Implementation and Analysis of Clustering Algorithms in Data Mining

: Data mining plays a very important role in information industry and in society due to the presence of huge amount of data. Organizations in the whole world are already aware about data mining. Data mining is the process which uses various kinds of data analysis tools to obtain patterns which also referred to as knowledge discovery from data. Clustering is called unsupervised learning algorithm as groups are not predefined but defined by the data. There are so many research areas in data mining. This paper is focusing on performance and evaluation of clustering algorithm: K-means, SOM and HAC. Evaluations of these three algorithms are purely based on the survey based analysis. These algorithms are analyzed by applying on the data set of banking which is a very high dimensional data. Performances of these algorithms are also compared with each other. Our results indicate that SOM technique is better than k-means and as good as or better than the hierarchical clustering technique. We have also generated one code in Orange Python which is the enhanced algorithm based on the hybrid approach of SOM, K-means and HAC.


Introduction:
Any clustering technique is [1] having the purpose of evolving a K*n partition matrix U(x) of a dataset. Clustering techniques broadly fallen into two main classes, partitioning and hierarchical. In any clustering system two fundamental questions arises: 1) How many clusters are actually present in the data and 2) how real or good is the clustering itself. Data mining algorithms [2] for processing large amount of data must be scalable. Algorithms of data mining which are used for processing data with changing patterns must have the capability of updating and learning data. One of the traditional data mining techniques is [3] clustering which is an unsupervised learning paradigm where clustering methods try to identify the inherent grouping of text document, such that a set of clusters formed which exhibit high intracluster similarity an low intercluster similarity.
K-means clustering algorithm is a very simple iterative method which is used for partitioning a dataset into k number of clusters which one is purely user specified. [4] This algorithm can easily adapt to dynamic P2P network where existing nodes drop out and new nodes join in during the execution of algorithm and the data in the network changes.
Objective: Following are the objective of our research: 1. To evaluate the performance of clustering algorithm.
2. To analyze the banking data by applying clustering algorithms on it.
3. To find best possible solution for handling large amount of data.
Dataset: We analyze the Banking dataset. This is a real dataset. I have done a lot of surveys for finding the appropriate dataset according to my requirement. . In our research work, we will be focusing on performance and evaluation of clustering algorithms. There are many clustering algorithms in data mining but we will focus mainly on K-means, SOM and HAC. Data contain 1001 entries. We have adopted the hybrid approach of kmeans, HAC and SOM.
Tool Used: I have used the open source tool name Tanagra. Tanagra is very powerful tool which contain supervised learning as well as other paradigms like clustering, factorial analysis etc. In this project, I will apply K-Means, SOM and HAC using Tanagra tool and find the efficiency and performance of each algorithm and find out that using Tanagra tool which algorithm is able to handle with large amount of high dimensional data.
Firstly I have applied the Kohonen SOM algorithm on Dataset. I have also changed the parameters like rows are having size 2 and column are having size 3.
After applying SOM, we are getting the error ratio 0.6382. M a y 1 5 , 2 0 1 3 Now I have clicked on the data visualization option and drag the scatter plot option and dropped on the Kohonen-SOM. When we click on this option, it will show the various services of banking in the form of dots, square, triangle having different colors, When we select the attribute say income tax on the y axis and attribute say minimal deposit on the y axis. As we see that dots which form clusters are not having clear view because we have high dimensional data. When we will drag K-means on Kohonen-SOM, then we will see the above results. In this case R-Square is 0.6600 which seems to be very high as compare to Kohonen-SOM. Within sum of square is 1700.2125 and total sum of square is 5000.000 M a y 1 5 , 2 0 1 3 Above figure shows the dendrogram of HAC.
Conclusion:-From above work, we can clearly see that the results Coming using K-means are not able to handle large amount of data. Its error ratio is also very large, as in this case we mainly deal with high dimension data. SOM is helpful for handling with large data and also used for pattern recognition, image processing etc.
Proposed work: I have proposed one enhanced algorithm which will give the better results and visualization of nodes as results in Tanagra was not clear as the end of hybrid approach.
I have generated one code in Orange Python based on the hybrid approach of SOM, HAC & K-means.
We have browsed the script file name and press ok button Now the code is running, we are seeing the nodes having different attribute values and services.
Future work: I will make one optimal algorithm which will overcome the drawbacks of SOM, HAC and K-means. I will generate one program in Python which will give the outputs in the form of clusters.