ThaiScience  


ECTI TRANSACTIONS ON COMPUTER INFORMATION TECHNOLOGY


Volume 15, No. 02, Month AUGUST, Year 2021, Pages 166 - 176


Hierarchical text classication using relative inverse document frequency

Boonthida Chiraratanasopha, Thanaruk Theeramunkong, Salin Boonbrahm


Abstract Download PDF

Automatic hierarchical text classification has been a challenging and in-needed task with an increasing of hierarchical taxonomy from the booming of knowledge organization. The hierarchical structure identifies the relationships of dependence between different categories in which can be overlapped of generalized and specific concepts within the tree. This paper presents the use of frequency of the occurring term in related categories among the hierarchical tree to help in document classification. The four extended term weighting of Relative Inverse Document Frequency (IDFr) including its located category, its parent category, its sibling categories and its child categories are exploited to generate a classifier model using centroid-based technique. From the experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measure as 53.65% and 50.80% in Top-n features set from family-based evaluation in which are higher than TF-IDF for 2.35% and 1.15% in the same settings, respectively.


Keywords

Hierarchical Text Classi- cation, Term Weighting, Hierarchical Categories, Relative Inverse Documents Frequency (IDFr)



ECTI TRANSACTIONS ON COMPUTER INFORMATION TECHNOLOGY


Published by : ECTI Association
Contributions welcome at : http://www.ecti-thailand.org/paper/journal/ECTI-CIT