Implementation and Evaluation of Rule Induction Algorithm with Association Rule Mining: A study in life insurance

.


INTRODUCTION
Data mining techniques are the result of a long process of research and product development. This evolution began when Business data was first stored on computers, continued with improvements in data access and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond retrospective data access and navigation to prospective and proactive information delivery. Data mining is ready for application in the business community because it is supported by three technologies that are now sufficiently mature:


Massive data collection  Powerful multiprocessor computers  Data mining algorithms

Separate and Conquer paradigm:
Among the rule induction methods, the "separate and conquer" approaches are very popular during the 90's. The goal is to learn a prediction rule from data If Premise Then Conclusion « Premise » is a set of conditions « attribute -Relational Operator -Value ». For instance, Age > 45 and Profession = Workman In the supervised learning framework, the attribute into the conclusion part is of course the target attribute. A rule is related to only one value of the target attribute. But one value of the target attribute may be concerned by several rules.

Compared to classification tree algorithms:
Which are based on the divide and conquer paradigm, their representation bias is more powerful because it is not constrained by the arbores cent structure. It needs sometimes a very complicated tree to get an equivalent of a simple rule based system. Some splitting sequences are replicated into the tree. It is known as the "replication problem".

Compared to the predictive association rule algorithms:
They do not suffer of the redundancy of the induced rules. The idea is even to produce the minimal set of rules which allows classifying accurately a new instance. It enables to handle the problem of collision about rules, when an instance activates two or several rules which lead to inconsistent conclusions.
We describe first two separate and conquer algorithms for the rule induction process. Then, we show the behavior of the classification rules algorithms implemented by a tool.

Separate and Conquer algorithms


Induction of ordered rules(Decision list induction)


Induction of ordered rules (Decision list induction)
The induction process is based on the top down separate and conquers approach. We have nested procedures that are intended to create the set of rules from the target attribute, the input variables and the instances.

Induction of unordered rules:
Ordered set of rules, when we read the i-th rule, we must consider the (i-1) preceding rules. It is impracticable when we have a large number of rules.

The classifier is now outlined as the following:
If Condition

PREVIOUS WORKS
There are number of practical works have been presented where most existing rule induction algorithms are used. Authors in [1] proposed Discovery of spatial association rules in georeferenced census data. It was relational mining approach. Authors in [3,4] proposed Top down induction of model trees with regression and splitting nodes and Ranking Mechanisms in Metadata Information Systems for Geospatial Data. Authors in [8] proposed Rule Induction with CN2 with Some recent improvements over traditional algorithms. They also proposed post pruning and hybrid pruning technique along with rule induction method to obtain high rate of accurate results. They also reduced the induced set of rules and computational time with high coverage of data from large data set. They also used decision tree and rule induction method with the help of data mining software. w w w . i j c t o n l i n e . c o m

Dataset
We take life insurance policy data; we want to detect the customers who having good policy based on customer categories and we have to obtain accurate result with less computational time.

Importing the database
After the launching of Tanagra, we create a new diagram by clicking on the FILE / NEW menu. We import the life insurance .xls file.

Sampling Algorithm
We want to subdivide the dataset into a learning sample (50%) and a test sample. We use the SAMPLING Component.

Sampling Algorithm
We want to subdivide the dataset into a learning sample (50%) and a test sample. We use the SAMPLING Component.

Fig 1
We set now "target" as TARGET attribute and the others as INPUT ones using the DEFINE STATUS component.

Induction of Decision Lists
We add the DECISION LIST component into the diagram. We click on the SUPERVISED PARAMETERS menu, the J-MEASURE is the default measure.

Fig 3
We validate these settings and we click on the VIEW menu. We obtain 20 rules in 703 ms.  We validate these settings and we click on the VIEW menu. We obtain 172 rules in 9203 ms.

Induction of unordered rules
We use the RULE INDUCTION component (SPV LEARNING tab) in order to generate a set of unordered rules. We click on the SUPERVISED PARAMETERS menu, the default settings are the following.

Fig 7
We validate them. We click on the VIEW menu. We obtain only 1 rule in 250 ms.