A Preview on Subspace Clustering of High Dimensional Data

When clustering high dimensional data, traditional clustering methods are found to be lacking since they consider all of the dimensions of the dataset in discovering clusters whereas only some of the dimensions are relevant. This may give rise to subspaces within the dataset where clusters may be found. Using feature selection, we can remove irrelevant and redundant dimensions by analyzing the entire dataset. The problem of automatically identifying clusters that exist in multiple and maybe overlapping subspaces of high dimensional data, allowing better clustering of the data points, is known as Subspace Clustering. There are two major approaches to subspace clustering based on search strategy. Top-down algorithms find an initial clustering in the full set of dimensions and evaluate the subspaces of each cluster, iteratively improving the results. Bottom-up approaches start from finding low dimensional dense regions, and then use them to form clusters. Based on a survey on subspace clustering, we identify the challenges and issues involved with clustering gene expression data.


INTRODUCTION
The purpose of cluster analysis is to detect groups or clusters of similar objects, where an object is represented as a vector of measurements or points in multidimensional space. The distance measure determines the dissimilarity between objects in the various dimensions in the dataset [1]. With advances in technology, data collection has become easier and faster, leading to large and complex datasets containing many objects and dimensions. This requires that the existing algorithms be enhanced to speedily detect clusters of high quality. In sharp contrast to subspace clustering algorithms, traditional clustering algorithms consider all of the dimensions of the dataset in discovering clusters although many of the dimensions are often irrelevant [2][3] [4]. These irrelevant dimensions can confuse clustering algorithms by hiding clusters in noisy data. The purpose of generating high quality clusters in high dimensional data such as microarray gene data is to get a correct and informative biological interpretation of the gene cluster. One such technique to extract biologically relevant information about genes in a dataset is a tree of genes called GERC tree [5], which is produced by a divisive clustering algorithm, and the leaves represent the generated clusters.
Gene expression data are generated by DNA chips and other microarray techniques and they are often presented as matrices of expression levels of genes under different conditions, including environments, individuals and tissues. One of the major objectives of gene expression data analysis is to identify groups of genes having similar expression patterns under the full space or subspace of conditions. It may result in the discovery of regulatory patterns or condition similarities. Generally co-expressed genes, which are members of the same clusters, are expected to have similar functions. A method is presented in [6] to build a gene co-expression network (CEN), which is an undirected graph of nodes representing genes, connected by an edge if the corresponding gene pairs are significantly co-expressed. A gene expression similarity measure called NMRS (Normalized mean residue similarity) is used to construct the CEN, which is used to detect network modules from the built network.

Problems associated with Clustering High Dimensional Data
In high dimensional data, it is common for all the objects in a dataset to be spread out until they are almost equidistant from each other. The distance measures between the objects become meaningless and because of this "curse of dimensionality" [7], the performance of many clustering algorithms suffer, giving rise to several issues. (i) Any optimization problem becomes increasingly difficult with an increasing number of variables (attributes) [8]. (ii) The discrimination between the nearest and the farthest neighbors becomes rather poor in high dimensional data spaces [9] [10]. (iii) Many irrelevant attributes in a data set are collected due to the highly automated data acquisition process. Since the clusters are defined by some of the attributes only, the remaining irrelevant attributes ("noise") may interfere with the efforts of finding the "true" number of clusters. (iv) Also, in a data set containing many attributes, some attributes will most likely exhibit correlations among them. (v) The wrong selection of a proximity measure used by a clustering technique may lead to the discovery of some similar groups of genes at the expense of obscuring other similar groups.
Feature selection methods have been employed somewhat successfully to improve cluster quality. These algorithms find a subset of dimensions on which to perform clustering by removing irrelevant and redundant dimensions. Unlike feature selection methods which examine the dataset as a whole, subspace clustering algorithms evaluate features only on a subset of the data, based on a measure referred to as a "measure of locality" [11] representing a cluster, and are able to uncover clusters that exist in multiple, possibly overlapping subspaces and represent them in easily interpretable and meaningful ways [12].

Relevance of Subspace Clustering
Subspace clustering has been applied in text-mining, network anomaly detection, object detection in hyper-spectral satellite data and gene expression data analysis for finding co-expressed or coherent patterns. Clustering algorithms have been used with DNA microarray data for identification and characterization of genes. However, the high dimensionality of microarray and text data makes the task of pattern discovery difficult for traditional clustering algorithms and for this reason subspace clustering techniques can be used to uncover the complex relationships found in data in these areas. The work proposed in [6] has been extended to trace correlation among genes over a subspace of samples, represented by a co-expression network [13].

Proximity Measures
There are different methods [14] for quantifying similarity or dissimilarity between two gene expression levels, described in terms of the distance between them in the high-dimensional space of gene expression measurements. A dissimilarity measure dij for any two genes gi and gj obeys the following properties.
i. The distance between any two profiles cannot be negative. ii. The distance between a profile and itself must be zero. iii. A zero distance between two profiles implies that the profiles are identical. iv. The distance between profile gi and profile gj is the same as distance between profile gj and profile gi, i.e., D(gi, gj) = D(gj, gi). v. The distance measure obeys the triangle inequality property, i.e., for profiles gi, gj, gk, we have D(gi, gk) ≤ D(gi, gj) + D(gj, gk).
A microarray experiment compares genes from an organism under different development time points, conditions or treatments. For an n condition experiment, a single gene has an n-dimensional observation vector known as its gene M a y 2 5 , 2 0 1 3 expression profile. A proximity measure is a real-valued function that assigns a positive real number as a similarity value between any two expression vectors. The choice of the proximity measure depends upon the type of data, dimensionality and the approach used in the identification of coherent patterns. Therefore, to identify genes or samples that have similar expression profiles, selection of an appropriate proximity measure is very essential.

GENE-BASED CLUSTERING USING SUBSPACE APPROACHES
The purpose of gene-based clustering is to group together co-expressed genes indicating co-function and co-regulation. It has already been established in molecular biology that normally only a small subset of genes participate in any cellular process of interest. However, traditional clustering algorithms are generally concerned with clusters in the full feature space. Subspace clustering was initially proposed by Agrawal et al. [11], to evaluate features in only a subset of the data, based on a "measure of locality" representing a cluster. Sub-space clustering algorithms are further divided into two categories (a) bottom-up search and (ii) top-down search.

Bottom-Up Subspace Search Methods
This category of methods takes advantage of the downward closure of the property of density to reduce the search space. This property states that if a subspace S contains a cluster, then any subspace T ⊆ S must also contain a cluster. It determines locality by creating bins (for each dimension) which finally form multi-dimensional grid. There are two approaches: (i) static grid-sized approach and (ii) data driven strategies adopted to determine the cut-points. For the first approach, the two popular algorithms of this category are CLIQUE [11] that attempts to find clusters within subspaces using a grid-density based clustering approach and ENCLUS [14], which is an apriori like clustering method that defines clusters based on entropy. For the second approach, the algorithm MAFIA [15], a variant of CLIQUE, uses an adaptive, grid-based approach with parallelism, to improve scalability. Unlike CLIQUE and MAFIA, CBF [16] uses an efficient algorithm for creation of partitions optimally, to avoid exponential growth of bins with the increase in the number of dimensions. CLTree [17] adopts an algorithm which separates high and low density areas by using a modified decision tree algorithm that uses a modified gain criterion to measure the density of a region. DOC [18] is a hybridization of bottomup and top-down approaches and introduces the concept of an optimal projective cluster with strong clustering tendency over a subset of dimensions by an iterative improvement pattern.
The algorithms under the bottom-up approach are suitable for clustering gene expression data as they are (i) able to handle high dimensional data, (ii) can find clusters of arbitrary sizes and shapes, (iii) can find clusters which are embedded, intersected or disjoint, and (iv) scale reasonably well with the amount of data. However, the disadvantages of these algorithms are that (i) the algorithms do not scale well with the increase in number of dimensions, (ii) the algorithms may sometime eliminate small clusters, and (iii) the running time grows exponentially with the increase in the number of dimensions in the datasets.

Top-Down Subspace Search Methods
This approach starts with an initial approximation of clusters in an equally weighted full feature space. Next, it follows an iterative procedure to update the weights and accordingly reforms the clusters. It is expensive in the full feature space. However, the use of sampling techniques can improve the performance. The number of clusters and the size of the subspace are the most critical factors in this approach. Several algorithms have been described in the literature such as PROCLUS [19], which is a sampling based top-down subspace clustering algorithm that randomly selects a set of kmedoids from a sample and iteratively improves the choice of medoids to form better clusters. ORCLUS [20], like PROCLUS, also attempts to form clusters iteratively by assigning a point to its nearest cluster. It computes the dissimilarity between a pair of points as a set of ortho-normal vectors over a subspace. FINDIT [21] is a sampling based subspace clustering algorithm that finds clusters in a three phased manner. The algorithm δ-Clusters [22] starts with an initial seed and attempts to improve the overall quality of the clusters iteratively by swapping dimensions with instances. The use of coherence as a similarity measure makes it more relevant for microarray data analysis. COSA [23] uses knn to iteratively calculate the dimension weight for each instance and assigns higher weighted dimensions to those instances which have less dispersion within the neighbourhood till the weights stabilize. The output of COSA is a distance matrix, which can be used as an input to any distance based clustering algorithm.
Top-down subspace search methods are fast and scalable and the performance improves if sampling is used for large databases. However, its main disadvantages are that (i) it is sensitive to input parameters, (ii) quality of clusters depends upon the size of the sample chosen, and (iii) sampling may lead to some significant results being missed.

Biclustering Algorithms
A bicluster [24] is an I x J sub-matrix that exhibits some coherent tendency where I and J are the genes (rows) and conditions (columns) respectively, and |I| ≤ |N| and |J| ≤ |M|. A biclustering algorithm introduces a measure for the residue, called mean squared residue, which is an indicator of the degree of coherence of an element with respect to the remaining elements for the particular given bicluster. The lower the mean squared residue, the stronger is the coherence exhibited by the cluster and the better is the quality of the bicluster. The problem of finding the largest bicluster with minimum mean squared residue is NP-hard [24]. Biclustering algorithms employ different heuristic approaches to address this problem and can be divided into the following categories [25].

Greedy
Iterative Search M a y 2 5 , 2 0 1 3 Greedy Iterative search is based on the idea of forming biclusters of rows/columns by addition or deletion, with an attempt to maximize the local gain, which may lead to faster processing at the cost of losing good biclusters. Cheng and Church [24] pioneered the application of the greedy approach to gene expression data with the limitation that overlapping/embedded clusters cannot be identified because the elements of the already identified biclusters are masked by random noise. This limitation was addressed by FLOC [26] which uses a probabilistic algorithm to discover a set of k-possible overlapping biclusters simultaneously. OPSM [27] is another probabilistic model that attempts to address the idea of large Order-Preserving SubMatrices (OPSM) with maximum statistical significance, where a bicluster is determined by a set of rows, a set of columns and a linear ordering of the columns. Murali and Kasif proposed xMOTIF [28]. Its purpose is to compute the set of conserved rows I and the set of columns J that give the largest xMotif, i.e., the one that contains the largest number of rows.

Divide-and-Conquer
Divide and conquer algorithms have the significant advantage of being potentially very fast. However, they have a significant drawback of missing good biclusters that may be split before they can be identified. The Block Clustering [29] algorithm begins with the entire data in one block (bicluster) and iteratively tries to find the best split. Since the estimation of the optimal number of splicings is difficult, Duffy and Quiroz [30] suggested the use of permutation tests to determine when a given block split is not significant. Following this direction, Tibshirani et al. [31] added a backward pruning method to the block splitting algorithm and designed a permutation-based method called Gap Statistics, to induce the optimal number of biclusters, K.

Exhaustive Bicluster Enumeration
Exhaustive bicluster enumeration methods are based on the idea that the best biclusters can only be identified using an exhaustive enumeration of all possible biclusters exist in the data matrix. These algorithms can certainly find the best biclusters, if they exist, but have a very serious drawback. Due to their high complexity, they can only be executed by assuming restrictions on the size of the biclusters. Tanay et al. [32] introduced SAMBA, a bi-clustering algorithm that performs simultaneous bicluster identification by using exhaustive enumeration.

Iterative Row and Column Clustering Combination
This method applies clustering methods on the columns and rows of a data matrix and then combines the results to obtain biclusters. CTWC [33] tries to identify couples of small subsets of features (Fj) and objects (Oj), where both Fj and Oj can be either rows or columns. ITWC [34] is an iterative biclustering algorithm based on a combination of the results obtained by clustering performed on each of the two dimensions of the data matrix separately. DCC [35] uses self-organizing maps (SOM) to perform clustering in the row and column spaces of the data matrix and uses angle-metric as similarity measure.

TriClustering Algorithms
TriClusters are coherent clusters along gene-sample-time (temporal) or gene-sample-region (spatial) dimensions, which may be arbitrarily positioned and overlapped [36]. TriClustering algorithms are used for mining such coherent clusters in three-dimensional gene expression datasets. TriCluster [36] uses a graph-based approach to detect different types of clusters depending on different parameter values, including arbitrarily positioned and overlapping clusters. gTRICLUTER [37] accepts four input parameters, namely, minimum similarity threshold, minimum sample threshold, minimum gene threshold and minimum time threshold and gives as output coherent clusters along gene-sample-time dimension. [38] uses a heuristic TRI-Clustering algorithm to integrate gene expression and gene regulation information, by defining regulated expression values (REV) as indicators of how a gene is regulated by a specific factor. A selected survey outlining the basic challenges of triclustering are presented in [39], based on an analysis of three popular triclustering algorithms. [40] proposes a technique, based on order preserving submatrices, to find a set of triclusters from genesample-time data.

CLUSTER VALIDITY MEASURES
For gene expression data, clustering results in groups of co-expressed genes, groups of samples with a common phenotype, or "blocks" of genes and samples involved in specific biological processes. However, different clustering algorithms, or even different runs of a single clustering algorithm using different parameters, generally produce different sets of clusters [41]. Therefore, it is important to compare various clustering results and select the one that best fits the "true" data distribution. Cluster validation assesses the quality and reliability of the clusters obtained from various clustering processes.
Generally, cluster validity has three aspects. First, the quality of clusters can be measured in terms of homogeneity and separation on the basis of the definition of a cluster: objects within one cluster are similar to each other, while objects in different clusters are dissimilar with each other. The second aspect relies on a given "ground truth" of the clusters. The "ground truth" could come from domain knowledge, such as known function families of genes or from other sources such as the clinical diagnosis of normal or cancerous tissues. Cluster validation is based on the agreement between clusters obtained and the "ground truth." The third aspect of cluster validity focuses on the reliability of the clusters or the likelihood that the cluster structure is not formed by chance. Some of the popular cluster validity measures used to compare clustering results are Rand index [42], Cluster Homogeneity [43], Silhouette index [44], Z-score [45] and P-values [46]. M a y 2 5 , 2 0 1 3 Clustering gene expression data poses challenges that are different from those of clustering non-biological data. This is due to the very nature of data being collected from microarray experiments.

Research Challenges
Studies have confirmed that clustering algorithms are useful in identifying groups of co-expressed genes and discovering coherent expression patterns. However, due to the distinct characteristics of time-series gene expression data and the special requirements from the biology domain, clustering gene expression data still faces the following challenges. i.
The effectiveness of a clustering technique is highly dependent on the proximity measure used by the technique.
Finding an appropriate proximity measure or developing a clustering technique which works independently of any proximity measure is a challenging task. ii.
Most existing clustering techniques are either dependent on input parameter(s) or stopping criteria for discovery of the "true" number of clusters. However, providing an appropriate set of parameter(s), or stopping criteria, or developing a parameterless clustering technique able to find biologically relevant clusters is a major task and hence a challenge.
Hence, the clustering algorithm should be capable of identifying all these types of clusters simultaneously. This is a challenging task, irrespective of size of dataset or dimensionality.
iv. The available gene datasets often contain a lot of noise and missing values. Thus a clustering algorithm should be capable of extracting the "true" number of clusters in the presence of this noise and also be able to handle missing values.
v. Apart from clustering, an algorithm should be capable of showing the associations among the clusters which may be useful for drawing conclusions, i.e., the clusters to be represented in interpretable and meaningful ways.
vi. The problem for subspace algorithms is compounded in that they must also determine the appropriate dimensionality of the subspaces.
vii. Subspace clustering algorithms must also define the subset of features that are relevant for each cluster and these are in most cases found to be overlapping.
viii. In addition to producing high quality, interpretable clusters, subspace clustering methods must also be scalable with respect to the dimensionality of the subspaces where the clusters are found. In high dimensional data, the number of possible subspaces is huge, requiring efficient search algorithms.

Issues to be Addressed
Based on our survey, we identify the following issues. i.
Proximity Selection: Concepts such as proximity, distance, or neighborhood become less meaningful with increasing dimensionality of a dataset [9][10] [49]. That is, discrimination between the nearest and the farthest neighbors becomes rather poor in high-dimensional space. Euclidean distance, Pearson correlation and cosine angle all seem to work reasonably well as distance measures [45] and [50]. Euclidean distance seems to be more appropriate for ratio data, whereas Pearson correlation seems to work better for absolute-valued data [45].
As a solution, a more deliberate choice of distance metrics (e.g., the use of Manhattan distance or even fractional distance metrics) has been proposed in [49]. Another method is to normalize and standardize the expression profile of each gene [51], with an aim to filter out genes with a flat profile by detecting differences between replicates and separating genes which are not significantly different from the rest. Still another proposed approach [52] is based on inferring confidence intervals, making a more efficient use of the measured data and avoiding the subjective choice of a dissimilarity measure. ii.
Missing Value: The acquisition and analysis of microarray data influence the interpretation of the results. It can lead to erroneous conclusions about the data and substitution of missing values may introduce inaccuracies and inconsistencies. So accurate prediction of missing values remains an important issue.
iii. Relevance to Biologists: Appropriate clustering can reveal hidden structures in biological data and it can provide accurate means for extracting biologically significant pattern(s). It is particularly helpful to biologists in investigating and understanding the activities of uncharacterized genes and proteins.
iv. Cluster Expansion: While expanding a cluster, one must consider cluster quality along with cluster validity. Cluster validity should not be an overhead but should continue to be high simultaneously alongside high quality, so that one can save on time complexity.
v. Fusion of Gene Expression Dataset with Annotated Dataset: Genes determined to be co-expressed by clustering techniques are not necessarily co-regulated and hence may not have similar functions [53]. A possible approach may be that the annotated subset of differentially expressed genes clustered together based on functional similarity could be superimposed on top of co-expressed genes leading to stability of clustering results. One such approach to fuse the expression dataset with the annotated dataset and other related functional annotation M a y 2 5 , 2 0 1 3 information has been proposed in [54].
vi. Similarity Measure and Clustering Solution: One of the main issues in subspace clustering is the definition of similarity, taking into account only certain subspaces. Different subspaces may be derived by specifying different weights, different selections, or even different combinations of attributes of a data set, to which a desirable similarity model has been applied. Since the subspace is not necessarily the same for different clusters within one clustering solution, this selection of a "desirable" similarity model is a task that cannot be accomplished independent of the clustering solution. Hence subspace clustering algorithms cannot be thought of as traditional clustering algorithms using just a different definition of similarity, rather, the similarity measure and the clustering solution are dependent on each other and are to be derived simultaneously.

CONCLUSION
In this paper, we have made an attempt to identify the problems associated with clustering of gene expression data, using traditional clustering methods, mainly due to the high dimensionality of the data involved. For this reason, subspace clustering techniques can be used to uncover the complex relationships found in data since they evaluate features only on a subset of the data. Differentiating between the nearest and the farthest neighbors becomes extremely difficult in high dimensional data spaces. Hence a thoughtful choice of the proximity measure has to be made to ensure the effectiveness of a clustering technique. The automated acquisition of data in many application domains gives rise to collection of many redundant features, ultimately interfering with the identification of "true" clusters. Moreover, the substitution of missing values and handling of "noise" may introduce inaccuracies and inconsistencies, leading to erroneous conclusions about the data by the domain experts.
The authors feel that genes determined to be co-expressed based on clustering techniques may not have always similar functions and hence may not be co-regulated. Therefore, one can also explore the possibility of superimposing the clusters obtained with the annotated subset of differentially expressed genes clustered together based on functional similarity.
It is well known that most clustering methods are highly variable and a slight variation or change in the data may result in very different gene clusters. If the information from genomic knowledge bases, such as Gene Ontology, could be incorporated using data fusion earlier in the analysis of genomic data, the additional information about genes and their relationship with each other may improve stability, accuracy and/or biological relevance of the clusters.