A Hybrid Multi-Word Terms Extraction System Applied to Topic Detection
Keywords:Multi-word Terms Extraction, Topic Detection, C-value, LLR.
Mutli-word Terms extraction plays an important role in many Natural Language Processing (NLP) tasks. Despite their major importance, few works were dedicated to Arabic multi-word terms extraction. This paper proposes an automatic Arabic multi-word terms (MWTs) extraction system based on two major filtering steps: linguistics filter using a part-of-speech tagger along with morphological patterns and statistical filter based on probabilistic methods, namely: Log-Likelihood Ratio (LLR) and C-value. We evaluate the performances of the realized systems on Wattan; an Arabic oriented topic newspaper corpus. Our system manages to achieve 90.23% in term of multi-word extraction precision. We also study the use of MWTs as features in Arabic Topic Detection. The conducted experiments show good results.