# Cluster Analysis in Data mining

In general, data mining based on cluster analyses is to visualize the patterns in the data and is descriptive, although hypotheses testing on cluster profile differentiation is usually conducted as well. We introduce commonly used non-parametric and parametric clustering methods. Non-parametric clustering approaches are in general unsupervised learning and include methods based on data partitioning and methods aiming to cluster the data hierarchically. Partitioning-based methods aim to maximize certain criteria to map vectors to clusters, while hierarchical clustering methods construct “trees” showing similarities between groups. Parametric approaches in general model the distribution of data as a mixture of a certain number of distributions such that each part of the data is assumed to be generated from one of the mixtures. The goal is to optimize the distribution mixture to determine the number of clusters. Thinking of data matrix with n rows (representing n subjects) and m columns, in both non-parametric and parametric clustering methods, the objects to be clustered (clustering objects) can be rows (subjects) or columns (variables) determined by research interest.

## NON-PARAMETRIC CLUSTER ANALYSIS METHODS

Non-parametric cluster analyses identify clusters based on distance between clustering objects. Various ways have been proposed to define distances, among which distance for continuous variables based on Pearson sample correlations, Euclidean distance, and Manhattan distance are used often. Gower distances can be applied to a mixture of continuous and categorical variables. In the following, we introduce each of these distance measures.

### Distances

Let x = {.ri, • • • , X), }.y = {yi, ■ ■ ■ ,yn} denote two vectors.

A Pearson sample correlation-based distance is with zero distance represented by a perfect correlation between x and y. In the Euclidean metric, the distance between x and у is defined as These two distance metrics, dcor and deuc, are closely related. Denote by zx and zy standardized x and y, respectively. We have To reduce the impact of outliers, sometimes Spearman sample correlation-based distances are also used. Spearman correlations are calculated in the same way as in Pearson correlations except that ranks of data instead of original data are used to determine the correlations. Manhattan distance is in the L norm and defined as Unlike the Pearson or Spearman sample correlation-based distance metric, the Euclidean and Manhattan distance metrics do not take into account variations in data. The Mahalanobis distance metric, on the other hand, incorporates data variations into its distance definition, but assumes x and in у are from the same multivariate distribution with mean // and covariance E, The Mahalanobis distance is equivalent to the Euclidean distance when E is the identity matrix.

In many situations, variables in cluster analyses are a mixture of continuous and categorical variables. In this case, Gower distance can be applied [GO], which is an average of distances across all the variables with each distance between 0 and 1. For binary variables, Gower distance is calculated as proportion of mismatching, defined based on Jaccard similarity. Similar calculations are applied to nominal variables. For continuous variables, the distance is calculated as |X{уг/R, with В, denoting range of variable i for all clustering objects. For ordinal variables, ranking the variables is first applied and then the distance is calculated in the same way as for continuous variables. In some other situations, the interest is in the distance between distributions. Metrics that assess such distances include the Kullback-Leibler distance (KLD) and its special case mutual information (MI). The KLD metric measures the difference between two probability distributions. MI is KLD and measures the distance between joint probability distribution of two random variables and product of their marginal distributions. Essentially, MI measures distance from independence.

With distance metric defined, clustering objects are included into a group if the)' are close to each other defined by their distance to a cluster center or distances between each objects. Through simulated data, we introduce three commonly applied clustering techniques: partitioning- based methods, hierarchical clustering methods, and a hybrid of these two approaches. This is then followed by a demonstration using gene expression data.

### Partitioning-based methods

The К-means approach is among the most used partitioning-based clustering techniques due to its simplicity. The concept was first introduced by Cox , and the method later experienced various improvements [102, 41, 105, 64]. The algorithm by Hartigan and Wong [G4] is used most often. It starts from specifying the number of clusters, К, and initial centroid of each cluster. Let c with length К denote a collection of centroids, c = {ci,--- ,ck, .cK} with ck = {ch,--- ,ckp} being the centroid of cluster к for p components. Depending on the clustering objects, the p components can be p variables or p subjects. After selecting a distance metric, we assign each data point to a specific cluster following the steps below:

1. Calculate the distance between each data point and cluster centers.

The Euclidean distances are commonly used.

• 2. Each data point is assigned to a cluster based on argminCkecd(ck, x), where x denotes a data point with length p (p components).
• 3. Update the centroid of each cluster, which is calculated as the mean of the data points in each cluster.
• 4. If no data point is reassigned to a different cluster, then stop. Otherwise, repeat steps 1 through 3.

To determine the number of clusters, we can choose К by optimizing the total within-cluster variation, for continuous variables, it is defined

as where C- denotes cluster k, and x, the itli data point. The number of clusters can be determined by an empirical rule, the so-called “elbow” rule; plotting a set of SSwitt,in :s sorted by the number of clusters and selecting К such that, at that K, SSWithin experiences the biggest reduction followed by a flattened out SSwithin reduction. This is essentially the scree test commonly used in factor analyses . In Figure 5.1, the biggest reduction of SSWithin happens at 3 clusters, and then the reduction of SSwuhin is quite small between 4 and 3 clusters. Following the “elbow” rule, with 3 clusters, the data are likely to be grouped neatly.

The advantage of the К-means approach exists in its easy to follow algorithm and clear separation of data for data with roughly regular shape in distributions. However, this approach is sensitive to the scales of data and works better if data distributions are with spherical or elliptical shapes. In addition, К-means is sensitive to outliers. More robust variants of Hartigan and Wong’s method  have been proposed, such as generalized К-means and trimmed K-means , the improved K- means via outlier removal , among others [68, 53]. Even the case, the classical algorithm by Hartigan and Wong in general does well and is still mostly used.

The kmeans function in R implements the К-means approach. We demonstrate this through a simulated data set:

> set.seed(12345)

> xl = c(rnorm(15, sd = .05), rnorm(20, mean = 1, sd = .05),

+ rnorm(25,mean = 1.5, sd = .05))

> x2 = c(rnorm(15,mean =3, sd = 2), rnorm(20, mean = 11, sd = 1),

+ rnorm(25,mean = 12, sd = 1))

> x = cbind(xl, x2) Figure 5.1 Illustration of utilizing the “elbow” rule to determine the optimized number of clusters.

> rownames(x) = c(rep(l,15), rep(2,20), rep(3,25))

> plot(x, pch = as.character(rownames(x)), asp = 1)

> xlSd = xl/sd(xl)

> x2Sd = x2/sd(x2)

> x = cbind(xlSd,x2Sd)

> rownames(x) = c(rep(l,15), rep(2,20), rep(3,25))

> plot(x, pch = as.character(rownames(x)), asp = 1)

Data with GO rows and 2 columns are generated with each column from a mixture of three normal distributions are simulated. This type of data can be seen in gene expression or DNA methylation studies on pattern recognitions, e.g., detecting clusters of genes (GO genes in this example with measurements from two subjects) with each cluster sharing comparable expression levels. The scaling done by xl/sd(xl) and x2/sd(x2), which removes the effects of different scales in the two variables. To observe the pattern of the original and scaled data, we draw scatter plots using the plot function. The argument asp is set at 1 to produce plots where distances between points are represented accurately on screen. The clustering pattern in the original data (Figure 5.2a) is not as clear as that from the scaled data (Figure 5.2b). Figure 5.2 Scatter plot of the simulated data, raw (a) and scaled (b) data).

The following codes utilizes the function kmeans to cluster scaled data.

> x = cbind(xlSd,x2Sd)

> kmeansClust = kmeans(x, centers = 3, nstart = 2)

> # plot the clusters

> plot(xlSd, x2Sd, main = ’K-means clustering', cex.main = 2,

+ pch=kmeansClust\$cluster)

> legend(2, 1.5, c(l:3), pch = unique(kmeansClust\$cluster),

bty = "n")

In the kmeans function, the number of clusters is fixed at 3. To determine the number of clusters, the scree test noted earlier can be applied by obtaining kmeansClust\$tot.withinss for each number of clusters and plotting kmeansClust\$tot .withinss versus the number of clusters. The default algorithm for clustering is proposed by Hartigan and Wong , which in general outperforms the other several methods [102, 41, 105].

It is suggested that we try several random starts by setting nstart> 1. In the above, we set nstart=2. As seen in Figure 5.3 (generated by the last two lines in the program above), all objects are clustered correctly. We will see later in this section that having data with the same scales will substantially improve the quality of identified clusters. Figure 5.3 Three clusters inferred using the К-means approach.

Another partitioning-based method similar to the К-means is the partition around medoids (PAM) algorithm proposed by Kaufman and Rousseeuw . PAM uses medoids (data points) for centers of each cluster but not mean of data points in a cluster as in К-means. Because of this, PAM is relatively less sensitive to outliers. PAM calculates Silhouette information, S, which can be used to optimize the number of clusters. For each clustering object, where S, is the Silhouette information for clustering object i and d denotes average distance or dissimilarity. Consequently, Within refers to the average distance between clustering object г and other points in the same cluster, and ^betweenavel'age distance between clustering object i and the others in other clusters. Silhouette information 5) ranges from —1 to 1, and the higher the Si the better clustering quality of clustering object i. Averaging Si across all clustering objects will help us to determine the number of clusters such that within-cluster distances are minimized while maximizing between-cluster distances. Because of the clustering scheme in PAM, PAM can be applied to categorical variables or mixture of continuous and categorical variables with Gower distance utilized. The corresponding R function of PAM is pam in the cluster package. In this function, the averaged Silhouette information is denoted as “averaged Silhouette width”. We use the same data set to demonstrate this function.

> library(cluster)

> pamClust = pam(x = cbind(xlSd,x2Sd), к = 3, diss = FALSE,

+ metric = "euclidean")

> summary(pamClust)\$silinfo\$avg.width  0.6466787

In the above, we specify the number of clusters as k=3. Since x is a data matrix, the logical flag for x being a dissimilarity matrix is diss = FALSE. The dissimilarity is defined as the Euclidean distance in the data. To extract averaged Silhouette width, we use the summary statement, summary (pamClust)  \$silinf o\$avg. width, based on which we can optimize the number of clusters by setting к at a set of different values and choose к using the scree test noted earlier for K-means.