Automatic hierarchical text classification has been a challenging and in-needed task with an increasing of hierarchical taxonomy from the booming of knowledge organization. The hierarchical structure identifies the relationships of dependence between different categories in which can be overlapped of generalized and specific concepts within the tree. This paper presents the use of frequency of the occurring term in related categories among the hierarchical tree to help in document classification. The four extended term weighting of Relative Inverse Document Frequency (IDFr) including its located category, its parent category, its sibling categories and its child categories are exploited to generate a classifier model using centroid-based technique. From the experiment on hierarchical text classification of Thai documents, the IDFr achieved the best accuracy and F-measure as 53.65% and 50.80% in Top-n features set from family-based evaluation in which are higher than TF-IDF for 2.35% and 1.15% in the same settings, respectively.
Keywords
Hierarchical Text Classi- cation, Term Weighting, Hierarchical Categories, Relative Inverse Documents Frequency (IDFr)
ECTI TRANSACTIONS ON COMPUTER INFORMATION TECHNOLOGY