PRIVACY PRESERVING CLUSTERING BASED ON SINGULAR VALUE DECOMPOSITION AND GEOMETRIC DATA PERTURBATION

Privacy preservation is a major concern when the application of data mining techniques to large repositories of data consists of personal, sensitive and confidential information. Singular Value Decomposition (SVD) is a matrix factorization method, which can produces perturbed data by efficiently removing unnecessary information for data mining. In this paper two hybrid methods are proposed which takes the advantage of existing techniques SVD and geometric data transformations in order to provide better privacy preservation. Reflection data perturbation and scaling data perturbation are familiar geometric data transformation methods which retains the statistical properties in the dataset. In hybrid method one, SVD and scaling data perturbation are used as a combination to obtain the distorted dataset. In hybrid method two, SVD and reflection data perturbation methods are used as a combination to obtain the distorted dataset. The experimental results demonstrated that the proposed hybrid methods are providing higher utility without breaching privacy.


INTRODUCTION
With the advancement in hardware technology large amounts of data is collected and stored by various companies and organizations. Data mining is a useful process for analyzing and extracting information from the large amounts of data. Association rule mining, classification and clustering are the some of the data mining tasks. Clustering is a tool that segments the dataset into meaningful clusters and similar objects are placed in the same clusters and dissimilar objects are placed in different clusters. A recent survey on web users reveal that, 86% of the users believe that, participation in the survey results to violate the privacy rights of individual [1]. The storing and sharing of data and the application of data mining techniques on these data are useful for decision making process on one hand, on the other hand privacy violation occurs when the extracted data mining patterns contain delicate personal information. Privacy preserving data mining has been developed to avoid the disclosure of sensitive information, maintain confidentiality of the organizations and also preserve privacy of individuals. This paper proposes a SVD based hybrid methods for privacy preserving clustering in centralized database environment. In hybrid method one, the dataset is perturbed using SVD and Scaling data perturbation method, In hybrid method two, SVD and reflection data perturbation methods are used to protect the sensitive attribute values. These hybrid methods preserve privacy and data utility of clustering. The related works of privacy preserving clustering is discussed in the following section.

LITERATURE SURVEY
The authors in [2] addressed the statistical inference problem in online query processing systems and a framework for evaluating and comparing different controls. Data perturbation techniques for privacy preserving data mining have been discussed by authors in [3]. Authors in [4] [5] presented geometric data transformation methods for privacy preserving clustering in centralized database environment. Privacy preserving classification using singular value decomposition and application of SVD on structural partitions discussed in [6]. The authors in [7] proposed sparsified SVD method for data distortion and a simplified model for terrorist analysis is proposed. In [8], a SVD based data distortion method for privacy preserving clustering has been addressed by authors. A hybrid data distortion method using isometric transformations such as translation, rotation and reflection transformation methods to preserve the confidentiality of numerical attributes in centralized data has been presented by authors in [10]. A Double Reflecting data perturbation and rotation data perturbation based hybrid data transformation to preserve the privacy of confidential numerical attributes is introduced in [11]. The proposed hybrid methods are explained in the following section.

SVD BASED DATA DISTORTION
Various single data perturbation techniques are existing for preserving the privacy of individuals. To enhance the privacy provided by the single data perturbation methods such as SVD, scaling data perturbation, reflection data perturbation, two hybrid data perturbation methods are proposed in this paper.

Singular Value Decomposition
Singular Value Decomposition (SVD) is a matrix factorization method [13] which is used to reduce the dimensionality of the datasets and can be used as a data distortion method. Let A be a matrix of dimension n × m representing the original dataset. The rows of the matrix correspond to data objects and the columns to attributes. The singular value decomposition is a more general method that factors any n x m matrix A of rank r into a product of three matrices, such that A = UWV T Where U is an n × n orthonormal matrix, W is an n x m diagonal matrix whose nonnegative diagonal entries (the singular values) are in descending order, and V T is an m x m orthonormal matrix. Because of the arrangement of singular values in the matrix W the SVD transformation has the property that maximum variation in the objects are captured in the first dimension and much of remaining variations are captured in second dimension, and so on. The rank-k approximation of Ak to the matrix A can be defined as Ak = UkWkVk T Where Uk contains the first k columns of U, Wk contains the first nonzero singular values, and Vk T contains the first k rows of V T . With k being usually small, the dimensionality of the dataset has been reduced dramatically from min (m, n) to k (assuming all attributes are linearly independent). The various steps in SVD to obtain distorted database are given in Table 1.

Input
: Dataset D containing m rows and n columns. Output: Distorted Dataset D′ containing m rows and n columns.

Begin
Step 1: Suppress all identifier attributes from the given matrix Dmxn.
Step 2: Apply SVD on the matrix D to obtain decomposed matrices U, W, V T .
Step 3: Compute the distorted matrix D′= UWV T Step 4: Release the distorted dataset D′ for clustering analysis.

Geometric Data Transformation Methods
A geometric data transformation method of dimension d is an ordered pair, defined as GDTM = (V, f) where: a) V R d is a representative vector subspace of data points to be transformed: b) f is a geometrical transformation function, f: R d R d . For geometric data transformation methods, the inputs are the vectors V, composed of confidential numerical attributes and the uniform noise vector N, while the output is the transformed vector space V′. The geometric data transformation methods are translation data perturbation, scaling data perturbation, rotation data perturbation, and reflection data perturbation. Among these scaling data perturbation and reflection data perturbation are adopted in the proposed hybrid methods to transform the original dataset to protect the privacy of individuals and maintaining the similarity between the data objects.

Scaling Data Perturbation (SDP):
In scaling data perturbation method, the noise term is applied to each confidential numerical attribute. A positive or negative constant is multiplied to all values of a selected attribute. The data is transformed with scaling data perturbation to obtain the distorted dataset and this is shared for clustering analysis.

ReFlection Data Perturbation (RFDP):
In reflection data perturbation method, the noise term is nothing but a rotation angle which is applied to the confidential numeric attributes. k pairs of attributes from data matrix D are selected. If number of attributes in D is odd, then the last attribute is paired with an already selected attribute randomly. Each attribute is taken once, when the number of attributes is even.
Single data distortion methods are not providing good privacy protection. In many cases original data can be extracted from the perturbed data. The metric properties remain unaltered after the transformations are called isometric transformations. The SVD data transformation retains the general trends in the data and also protecting privacy. In order to provide better privacy preservation, two hybrid methods are proposed in this paper which takes the advantages of existing techniques SVD and geometric data perturbation. In hybrid method one, SVD and scaling data perturbation are used as a combination to obtain the distorted dataset. In hybrid method two, SVD and reflection data perturbation are used as a combination to obtain the distorted dataset. The following section presents the hybrid method-1(SVD & SDP).

Hybrid Method-1 Based On SVD & SDP
The main aim of developing hybrid technique is to efficiently hide sensitive data from outside world and simultaneously extracts the useful patterns in the dataset. In case one a hybrid method is proposed by combining two existing techniques singular value decomposition and scaling data perturbation. Table 2 shows the algorithm for proposed hybrid method-1 based on SVD and SDP. Input : Dataset D containing m rows and n columns.
Output: Distorted Dataset D′ containing m rows and n columns.

Begin
Step 1: Suppress all identifier attributes from the given matrix Dmxn.
Step 2: Apply SVD on the matrix D to obtain decomposed matrices U, W, V T .
Step 3: Compute the distorted matrix D′= UWV T Step 4: For each confidential numerical attribute Aj in D′, where 1 ≤ j ≤ n do 1. Select the noise term ej for the attribute Aj 2. For each aij an instance of Aj where 1 ≤ i ≤ m do aij aij * ej End For End For Step 5: Release the distorted dataset D′′ for clustering analysis. End This data perturbation process of algorithm 2 consists of two steps. In step one, using SVD data perturbation the given input dataset is transformed and which is used as input to the scaling data perturbation to obtain the final distorted in step two achieve higher privacy preservation. The final distorted dataset is released for clustering analysis. The algorithm for case two which is based on SVD and reflection data perturbation methods is discussed in the following section.

Hybrid Method-2 Based On SVD & RFDP
To provide higher privacy protection hybrid data perturbation method is proposed as a combination of singular value decomposition and reflection data perturbation methods. Table 3 shows the algorithm for the proposed hybrid method-2. A u g 1 0 , 2 0 1 3 The given input dataset is perturbed in two steps. In step one, the given dataset is transformed using SVD data perturbation, which is used as input to the reflection data perturbation to obtain the final distorted dataset in step two.
Output: Distorted Dataset D′ containing m rows and n columns.

Begin
Step 1: Suppress all identifier attributes from the given matrix Dmxn.
Step 2: Apply SVD on the matrix D to obtain decomposed matrices U, W, V T .
Step 3: Compute the distorted matrix D′= UWV T Step 4: Calculate k = n/2 if n is even else k = (n+1)/2; Step 5: For each k pairs of attributes in D′ Step 6: For each pair of attributes Ai, Aj from step 5 where 1 ≤ i ≤ n and 1 ≤ j ≤ n Compute D′′ (Ai′, Aj′) = Ro (θ) x D′ (Ai, Aj) for different values of θ and identify the range that gives higher privacy vales Select an angle θ from the selected range that gives highest privacy preservation to compute the noise term Ro(θ) Using this Ro (θ) Compute D′′ (Ai′, Aj′) = Ro (θ) x D′ (Ai, Aj)

End For
Step 7: Release the distorted dataset D′′ for clustering analysis.

End
Experimental results of the proposed methods are discussed in the next section.

IMPLEMENTATION OF PROPOSED METHODS
The proposed methods are validated empirically by conducting experiments on three real life datasets obtained from UCI [9]. Haber man dataset with 3 attributes and 306 records, Credit-g dataset with 5 numerical attributes and 1000 records, Abalone dataset with 5 numerical attributes and 4177 instances are considered in this paper. The performance of the data distortion method is measured based on two factors I) Utility measures and II) Privacy measures. The dataset is highly utilized when the data distortion technique is giving high clustering accuracy after the data distortion. The well-known kmeans clustering algorithm is used to measure the clustering quality.
The utility of the dataset is measured based on the misclassification error. After transforming the data, clusters in the original dataset should be equal to those ones in the distorted dataset. WEKA (Waikato Environment for Knowledge Analysis) software is used to test clustering accuracy of the original and modified data base. The misclassification error, denoted by ME, is measured as follows: In the above formula N -Number of points in the original dataset.

K -Number of clusters.
Clusteri (D) -Number data points of the i th cluster of the original data set.
Clusteri (D′) -Number of data points of the i th cluster of the transformed dataset.
The cluster quality of the distorted dataset is measured by calculating misclassification error (ME) values. Higher ME values indicates lower clustering quality where as lower ME values indicate higher clustering quality. K-means clustering algorithm is used to generate the clusters for the three original as well as distorted datasets. Each experiment is repeated 10 times because k-means algorithm is not deterministic. ME value is calculated on an average of 10 values. Table 4 shows the ME values of all three datasets for SVD method, scaling data perturbation method, reflection data perturbation method and hybrid method-1 (SVD &SDP), hybrid method-2 (SVD & RFDP). When the misclassification error values in the Table 4 are compared, it clearly indicates that the proposed hybrid method-2 (SVD&RFDP), yields lower misclassification error rates for all the three datasets and SDP method gives the lowest misclassification error values. Even though SDP gives lower misclassification error values, many researchers pointed out that this multiplicative noise added by SDP method can be easily filtered out using logarithmic transformation and other attack methods. Hence an intruder can get back the original dataset. Among these methods SVD gives lower misclassification error compared to ReFlection Data Perturbation (RFDP) and also protects the privacy of individuals. So SVD is selected and included in hybrid methods. The following graphical representation of figure 2 depicts the effectiveness of the proposed hybrid methods. The misclassification error values depicted in Figure 2(b) are related to hybrid method-2 (SVD & RFDP) are compared, SVD method gives lower misclassification error when compared to RFDP and hybrid method-2(SVD &RFDP) gives the lowest misclassification error among the three methods. The misclassification error values of single data perturbation and hybrid methods are illustrated in Figure 2(c). When hybrid methods are compared, hybrid method-2(SVD &RFDP) gives the lower misclassification error values and SDP method gives the 0 misclassification error values which are lowest among all the methods.
The privacy of the perturbation technique is measured as the variance between the actual and the perturbed values [4]. This measure is given by var (X − Y) where X represents a single original attribute and Y is the distorted attribute S = var (X − Y) / var (X).
The higher S values indicate that privacy protection is high. S values are computed for the three data perturbation methods SVD, scaling data perturbation, reflection data perturbation for the three datasets and hybrid methods 1 & 2 are shown in table 5. The privacy measures assess the privacy protection of data distortion methods. The graphical representation of privacy values for all the individual methods and hybrid methods is given in Figure 3. When privacy values of the perturbation methods are compared, it clearly indicates that the hybrid methods are providing higher security and can preserve the privacy.

Figure 3: Privacy Values
The privacy values depicted in Figure 3(a) clearly reveals that, the proposed hybrid method-1 (SVD & SDP), yields higher privacy values. Hence hybrid method-1(SVD & SDP) gives the higher privacy preservation than the single data perturbation methods SVD and SDP.