ThaiScience  


ECTI TRANSACTIONS ON COMPUTER INFORMATION TECHNOLOGY


Volume 15, No. 01, Month APRIL, Year 2021, Pages 108 - 122


Information extraction on tourism domain using spacy and bert

Chantana Chantrapornchai, Aphisit Tunsakul


Abstract Download PDF

In this paper, we present two methodologies to extract particular information based on the full text returned from the search engine to facilitate the users. The approaches are based three tasks: name entity recognition (NER), text classification and text summarization. The first step is the building training data and data cleansing. We consider tourism domain such as restaurant, hotels, shopping and tourism data set crawling from the websites. First, the tourism data are gathered and the vocabularies are built. Several minor steps include sentence extraction, relation and name entity extraction for tagging purpose. These steps are needed for creating proper training data. Then, the recognition model of a given entity type can be built. From the experiments, given review texts, we demonstrate to build the model to extract the desired entity,i.e, name, location, facility as well as relation type, classify the reviews or summarize the reviews. Two tools, SpaCy and BERT, are used to compare the performance of these tasks.


Keywords

Name Entity Recognition, Text Classication, BERT, SpaCy



ECTI TRANSACTIONS ON COMPUTER INFORMATION TECHNOLOGY


Published by : ECTI Association
Contributions welcome at : http://www.ecti-thailand.org/paper/journal/ECTI-CIT