A Hybrid Multi-Word Terms Extraction System Applied to Topic Detection

Mutli-word Terms extraction plays an important role in many Natural Language Processing (NLP) tasks. Despite their major importance, few works were dedicated to Arabic multi-word terms extraction. This paper proposes an automatic Arabic multi-word terms (MWTs) extraction system based on two major filtering steps: linguistics filter using a part-of-speech tagger along with morphological patterns and statistical filter based on probabilistic methods, namely: Log-Likelihood Ratio (LLR) and C-value. We evaluate the performances of the realized systems on Wattan; an Arabic oriented topic newspaper corpus. Our system manages to achieve 90.23% in term of multi-word extraction precision. We also study the use of MWTs as features in Arabic Topic Detection. The conducted experiments show good results.


INTRODUCTION
The increasing availability of Arabic electronic documents has led to extensive research efforts covering the Arabic Natural Language Processing (ANLP) various fields, taking in consideration, particularities and complex morphological composition of the Arabic language. Controversially, few researches have been undertaken in the field of multi-word terms extraction for Arabic documents.
Although multi-word term has no uniform definition, it can be understood as a sequence of two or more consecutive individual noun words, forming a semantic unit [1]. In fact, the exact meaning of the words composing the MWT cannot be derived separately from the other MWT parts.
MWTs Extraction is an important task of automatic terms recognition and is employed in numerous NLP fields such as: text mining [2], syntactic parsing [3], [4], machine translation [5] and text classification [6]. The MWTs extraction task covers detection and extraction of a consecutive set of semantically related words. The technics used in MWTs extraction can be classified into four categories:


Statistical approaches based on frequency, probability and co-occurrence measures [7].
The hybrid approaches are wildly used since they combine the benefits of statistical and symbolic methods.
Our work is part of the semantic processing of unvowelized Arabic documents and aims to develop a multi-word terms extraction prototype for Arabic texts based on the hybrid approach using lexical patterns and statistical measures: C-value and LLR. We experimentally investigate the usage of MWTs as features in topic detection.
This paper is organized as follows: In section II we present related work. In section III; we describe the developed MWTs extraction system. Section IV details the conducted experiments and obtained results. Finally we conclude the paper in sectionV and announce future work.

Related Work
Although, MWTs extraction systems and prototypes have been developed for various languages such as: English, French, Chinese, Turkish, Dutch, Urdu…. Only, few researches have been dedicated to MWTs extraction for Arabic language.
The authors of [12] explore three approaches: the first one based on crossing correspondence asymmetries between Arabic Wikipedia titles and titles in 21 different languages, the second approach uses translated English MWTs to Arabic language and proceeds to validation. The last one benefits from large corpora and lexical association measures. These approaches prove to be very efficient for large-scale extraction of Arabic MWTs.
[9] created a MWTs extraction tool by adopting the hybrid approach. The first step is the extraction of the MWTs candidats using a set of 3 syntactic patterns taking into consideration morphological variants. The second step use several statistical scores like: T-Score, FLR, Mutual Information and LLR to rank the extracted MWTs. The authors used an Arabic corpus to calculate the precision of each statistical method and a collection of Arabic MWTs for validation. The experiments shows that the LLR gave the best results: 85% in terms of precision.
A similar work was presented in [13] implement the hybrid approachto extract bigrams . For the linguistic step, the authors used morpho-syntactic analysis to extract two categories of MWTs candidats: sequences of nouns and sequences of nouns separated with a preposition. The statistical filter include the use of the two statistical measures : the C-value and the LLR metrics. A corpus composed of 522845 is used to implement the extraction system. The authors used two methods of validation: the first one consider a MWT is correct if the translation of the MWT candidat is included in a terminologycal database and the second one in the manual validation. The experiments shows that using a combination of the two previous metrics in ranking MWTs, gives better results than using only one of them especially if the number of MWTs is increasing .

MULTI-WORD TERMS EXTRACTION SYSTEM
Our system is based on the hybrid approach and performs in two magor steps:

Linguistic filter
The linguistic filter has a major importance due to its contribution in the very early selection of MWTs candidate terms. The linguistic filter covers the following steps: 1. Document pre-treatement: This task covers the unification of documents encoding to avoid any ambiguity, elimination of Latin words, symbols, numbers, Roman numeral, special characters...

Sentence boundary determination:
To extract MWTs from documents, We implemented a program that breaks up the corpus documents to sentences. The full stop is considered to be the sentence delimiter.

Document POS-tagging:
We assign morphological tickets to the corpus documents sentences using The Stanford Arabic POS Tagger. This step will help us to detect possible MWTs following the patterns bellow: In order to extract multi-word terms, the document sentences are scanned for sets of words that conform to one of the patterns above and ordered by their number of occurrences. The linguistic filter allows to extract MWTs candidates with various sizes; Bigrams, Trigrams and Four-grams.

Stop-Word filter:
We eliminate the extracted MWTs beginning with a stop-word using a 600 noisy words list.

Statistical filter
To reduce linguistic ambiguities and increase the ratio of correct extracted MWTs, we combined two well known methods for their high effectiveness in MWTs extraction:  LLR [14] a unithood method used to qualify the association between two words in Bigrams by calculating the ratio between two likelihoods: the probability of observing one component of a collocation given the other is present and the probability of observing the same component of a collocation in the absence of other.   Number of compound nouns with U but without V .
Number of compound nouns with V but without U.
Number of compound nouns without U and without V. .
The LLR metric is given by the formulas: C-value metric [15] a termhood statistical method based on the frequency of occurrence that gives best results for nested MWTs ranking. The C-value measure comes together with a computationally efficient algorithm, which scores candidate multi-token terms according to the measure. Fig. 2 describes the C-value formula. Frequency of occurrence of a term in the corpus.
Set of extracted candidate MWTs that contain a.
Number of candidate terms in Ta.
We used The C-value metric for the nested words and their variations; the LLR metric was used for the remaining MWTs Bigrams.

DATASET
For the set up of our experiments, we used a corpus of over 20.291 articles, collected from the Arabic newspaper Wattan of the year 2004 [16]. The corpus contains articles covering the six following topics: culture, economics, international, local, religion and sport. The repartition of documents is described in Table 3. Sports 4450

EVALUATION METHOD
The evaluation of a MWTs extraction system is a very difficult task because of the absence of an evaluation standard of the MWTs, which are language and domaine dependent. In general, two categories of evaluation methods are used :


The manual validation use a humain expert with a linguistic knowledge. The humain judgement is more correct . However this method require more sources and time in the case of large corpora.


The use of dictionnaries and standars is realized automatically based on a comparaison between the output of the MWTs extraction system and the dictionaries. Although, this method is useful in case of large corpora, the lack of standred dictionaries make the comparaison difficult and non objectif.
To evaluate the MWTs extraction system developed , we use the manual validation through the n-first muli-words evaluation method [17]. This method works on tree steps: first, the selection of the liste of the n-first MWTs using the list of the MWTs extracted sorted according to their scores obtained using the LLR and the C-value. We consider only the first n MWTs having the best scores. Then, we proceed to the manual evaluation of the n-first MWTs list with the help of a human expert. Finally, the system precision is calculated according to the following formula :

MWTs Extraction system
The Multi-Word Terms extraction system allow the extraction of terms composed of 2 to 6 words. Fig. 1 gives examples of extracted MWTs for each of the six topics of the corpus: The results showed in Fig. 2 and Fig. 3 give the precision of the developed system for several subsets of extracted MWTs with sizes: 25, 50, 100 and 200 respectively. Fig. 2 illustre the MWTs extraction system precision with a precision average of 89.44%. We observe that the topic "Ecomony" gave the weakest performance in comparaison with the other topics. This can be explained by the nature of the topic which don't require using a lot of MWTs during documents redaction. The rest of the topics show good performances.
After the elimination of the MWTs containing one or more stopwords, the precision of the system increased to reach an average precision of 90.23% (Fig. 3). The obtained results show that using a stopword filter has improved the system performances. However, using a general and independent stopword list decreased the performances of some topics such as: religion where MWTs like: " ‫السالم‬ ‫"عليه‬ and ‫عشر"‬ ‫"ليال‬ have been deleted since they contain stopwords. We conclude that using a stopword filter helps to improve the general performance of the system. However, the impact of using this filter depend on the topic nature (literature, scheintific, journalistic, …).

Topic Dection with MWTs features
Since the MWTs extraction aims to extract specific terms from special copora , we decided to study the impact of using MWTs as feature in the Arabic topic detection. In concordance with an earlier work [19], We built a topic detection system based on Topic Oriented Vocabularies (TOV), Jaccard indicator and an adaptation of the TF-IDF classifier. We conducted experiments using MWTs as features of the TOV. To the best of our knowledge, it's the first time an Arabic detection topic system employs MWTs vocabulary. Fig. 4 shows the results obtained in terme of F1-measure. The average F1-measure of the topic detection system is 83.46%, the average is 84.10% and the average recall is 85.81%, for documents containing MWTs.
As shown in Fig. 4, the system achieves higher performances for: religion and sports topics. This can be explained by the specificity of the MWTs extracted for these topics and the literature nature of the other topics which produces some ambiguity. The topic "Economy" present the lowrest performances in concordance with the results obtained earlier.
We conclude that the performance of the topic detection system based MWts depends on the topics wich confirm that the MWTs depends on the topics and their nature.

Conclusion
We developed a multi-word terms extraction system for Arabic electronic documents based on linguistic patterns and the use of two statistical methods: LLR and C-value. We were able to extract words with bigrams, trigrams and four-grams.