ECTI TRANSACTIONS ON COMPUTER INFORMATION TECHNOLOGYVolume 15, No. 01, Month APRIL, Year 2021, Pages 108 - 122
Information extraction on tourism domain using spacy and bert
Chantana Chantrapornchai, Aphisit Tunsakul
Abstract Download PDFIn this paper, we present two methodologies to extract particular information based on the full text returned from the search engine to facilitate the users. The approaches are based three tasks: name entity recognition (NER), text classiﬁcation and text summarization. The ﬁrst step is the building training data and data cleansing. We consider tourism domain such as restaurant, hotels, shopping and tourism data set crawling from the websites. First, the tourism data are gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purpose. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts, we demonstrate to build the model to extract the desired entity,i.e, name, location, facility as well as relation type, classify the reviews or summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks.
Name Entity Recognition, Text Classication, BERT, SpaCy