Identifying Network Anomalies Using Clustering Technique in Weblog Data

In this paper we present an approach for identifying networkanomalies by visualizing network flow data which is stored inweblogs. Various clustering techniques can be used to identifydifferent anomalies in the network. Here, we present a newapproach based on simple K-Means for analyzing networkflow data using different attributes like IP address, Protocol,Port number etc. to detect anomalies. By using visualization,we can identify which sites are more frequently accessed bythe users. In our approach we provide overview about givendataset by studying network key parameters. In this processwe used preprocessing techniques to eliminate unwantedattributes from weblog data.


INTRODUCTION
As the increase in demand of using Internet, networks are growing rapidly and became open to public access. This leads to increase in the number and type of intrusions dramatically. This poses serious challenges for network analysis to detect various anomalies occur at different network proximities. We can use the firewalls to prevent illegal accessing of web pages. But, network intrusions take an advantage of vulnerabilities in some systems in the computer networks. Hence, there is a need for intrusion detection system beside firewalls to track intrusions in our network. Intrusion detection systems are mainly of two types: 1) Anomaly Detection System (ADS): It shows the normal behavior of network or user or application and identifies deviations to these profiles which may be potential security breaches. 2) Misuse Intrusion Detection System (MIDS): It uses attack signatures to compare with packet payloads for identifying intrusions. Using this method, it is not possible identify new attacks. Now-a-days, speed, complexity and the size of the network is growing rapidly, and the networks became open to public access, there is a tremendous increase in number and type of intrusions so that making it impossible for human analysis. To make analysis simple we can use different data mining techniques for network intrusion detection and analysis of network flow data. There are various approaches that use data mining techniques such as fuzzy logic [1], neural networks [2] and agent based [3] data mining approaches are widely used in intrusion detection systems. Clustering, association rule and sequential association rule mining are well known data mining techniques used in intrusion detection systems. Clustering technique partitions the data items in to finite number of groups based on their similarity. Partitioning Methods, Hierarchical Methods, Density-Based Methods, Grid-Based Methods, Model-Based Clustering Methods are different clustering techniques available in data mining. In this paper we are using Simple K-Means clustering technique to partition the weblog data to detect the network anomalies. In applying visualization to internet security, researchers exploit the human ability to process visual information quickly enables the complex task of network security monitoring and intrusion detection to be performed accurately and efficiently as discussed in [4] [5]. In this paper we collected network flow data from the web server of our organization, after which we carryout preprocessing and filtering of the data. We partitioned the preprocessed data to form clusters based on various attributes of data set. We used JFree Chart plug-in software for JAVA to visualize the clusters. By cluster analysis we can identify different anomalies in the data set.
The rest of the paper is organized as follows. In section 2, we present the related work and discuss various techniques used for visualization. We next present our approach in section 3. In section 4 we present results and analysis. Finally, conclusions are made in section 5.

RELATED WORK
To solve a number of problems encountered by the Intrusion detection, Visualization is a technique used in intrusion detection system. Conti et. a1 [4] provided a survey of packet and alert visualization techniques and present the challenges involved in information visualization of security related data and present techniques for network traffic visualization. D'Amico et.al [6] presented a technique for visualization that depicts patterns in massive amounts of data, and present methods for interacting with those visualizations to help analysts prepare for unforeseen events. Xiao et.al [7], present their work on visualization of network traffic that is applied for classifying the traffic and used a PLOT to visualize the data set. Itoh et.al [8] presented different techniques for visualizing the content of huge IDS log files. Goodall et.al [9], present the technique called as Time Based Network Traffic Visualization using which the complex task of searching for indications of attacks and misuse in vast amounts of network data is carried out. Tee et.al [10] share their work based on the Origin Autonomous System Change technique which is based on the premise that we can glean valuable knowledge from large data sets the same premise behind knowledge discovery. Munz et.al [11] presented a flow based anomaly detection scheme based on K-Means clustering algorithm where they cluster the unlabelled flow data for normal and anomalous traffic, but our approach is focused to provide intuitive visualization of network traffic based on flow analysis by applying Simple K-Means Clustering Algorithm.
Visualization Techniques:-Pen et. al [12] discusses that how intrusions can be detected by visualizing the cluster groups through a technique called IDGraphs. Conti et.al [4]presents a technique called ID RAINSTROM for visualizing the data. D'Amico et.al [6] presents a technique Vi-Assist, Abdullah et.al [5]discusses about scaling technique and Stacked Histogram. Goodall et.al [9] used Time Based Network Traffic visualization.

OUR APPROACH
In this paper we discuss the data collection method, preprocessing and filtering of the data. Then we used K-Means clustering algorithm to form the clusters on different network flow attributes. After this we analyzed those clusters by visualizing the flow data to detect the network anomalies. Figure.1 depicts the overall process of our approach. First we collect network flow data in the form of web log records from web server. The sample of collected raw data is shown in figure 2. This data set is preprocessed and then filtered based on key parameters like IP address, Protocol, and URL. The preprocessing step is very much important for our approach since it decides the formation of clusters. In preprocessing step we trimmed the length of the URL by eliminating unnecessary control characters, and request parameters. Preprocessing is done in such a way that the clusters do not overlap and distant apart. After the preprocessing the data set is filtered by eliminating unwanted attributes. In the filtering step we eliminated request method (GET/POST). The preprocessed and filtered data set is shown in figure 3. After filtering of the data set, we applied the simple K-Means algorithm on some meaningful attributes like IP address, Protocol and URL. The resultant clusters are shown in figure 4.

Fig 3: Preprocessed and Filtered data.
The formed clusters are visualized and analyzed based on IP address, Protocol and URL to identify network anomalies.

RESULTS AND ANALYSIS
We used General Public License (GPL) open source software JFree Chart for JAVA to visualize the formed clusters. First we applied clustering on IP address, where we obtained 6 clusters. On these clusters we again performed clustering on URL. In the analysis of protocol attribute we observed that, in case of normal flow data TCP and UDP packets are almost equal in number. But, in case of attack flow data TCP packets are more in number compared to UDP, few ICMP and IGMP packets also be observed. With this protocol attribute we can detect attacks like Denial of Service (DoS) and malware spreading. The visualization chart is shown in figure 5.

CONCLUSIONS
In this paper we presented an approach for analyzing and visualizing the network flow data using clustering. It is an easy, simple and fast way of analyzing the flow data. By the help of clustering we can predict the type of flow i.e. attacks or normal by performing some clustering on the particular attributes. We presented our analysis mainly based on three attributes. In this paper we have performed operation only on limited number of attributes. We can increase the number of attributes to be analyzed which shall give a much clear picture of the type of data. Also the algorithm of the preprocessing can be enhanced further.