ThaiScience  


THAMMASAT INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLOGY


Volume 26, No. 02, Month APRIL, Year 2021, Pages 61 - 78


Enhancement of character-level representation in bi-lstm model for thai ner

Kitiya Suriyachay, Thatsanee Charoenporn, Virach Sornlertlamvanich, Natsuda Kaothanthong


Abstract Download PDF

Named Entity Recognition (NER) in the Thai language is a relatively challenging task because the Thai language does not have an explicit word boundary. This normally can cause difficulties in word segmentation, which affects the efficiency in NLP post-processing such as NER tasks. Moreover, one of the important problems is the ambiguity in using common nouns to express named entities. According to the Thai language, most named entities are usually placed close to a verb or a preposition with a specific pattern. This means that the part of speech (POS) can be effectively used as a feature to consider the type of named entity. For these reasons, in this paper, we generate the BiLSTM-CNN-CRF model to investigate the effectiveness of a combination of the features among word, POS, and Thai character clusters (TCCs). We use TCCs instead of characters to minimize word segmentation errors in the corpora and increase the efficiency in generating the model. Experimental results show that our proposed model outperforms other models. The TCC is a suitable unit for character embedding, providing better results than single character embedding.


Keywords

Named Entity Recognition, Recurrent Neural Network, Bidirectional LSTM, CNN, CRF, Thai language, Thai named entity, TCC



THAMMASAT INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLOGY


Published by : Thammasat University
Contributions welcome at : http://www.tijsat.tu.ac.th