# Hierarchical clustering

Two subcategories of methods are in hierarchical clustering: agglomer- ative and divisive clustering. Agglomerative clustering starts from singleton clusters, that is, each cluster only includes a singleton. Based on certain criteria, these singleton clusters successively merge (or agglomerate) pairs of clusters until all clusters are merged into one single cluster that contains all the clustering objects. Divisive clustering, on the other hand, starts from one cluster with all clustering objects (either variables or subjects) in the cluster, and the single cluster splits into two clusters following certain criteria defined based on distances. The splitting continues with each cluster split into two clusters until only singleton clusters remain.

Agglomerative clustering In agglomerative clustering, the criteria for merging are determined based on different definitions of distances or dissimilarities between clusters. In the following, we introduce various commonly applied criteria: single linkage, complete linkage, average linkage, Ward’s method, and centroid method.

The criterion single linkage focuses on distances between nearest neighbors, and thus methods based on this linkage are also called “nearest neighbor procedure”. In particular, the distance between two clusters is defined as minimum distance between every pair of individual clustering objects. For instance, denote by c, = {ri'i, £2, T;}. c2 = {a’4. } two

clusters with the first cluster containing {x,X2, Ж3}. Following the single linkage criterion, the distance between these two clusters is defined as

the minimum distance between every pair of data points in the two clusters. For agglomerative clustering, at each step, combine two clusters that are the most similar. As seen in the definition, the linkage is sensitive to units and scaling. Such a rule of merging is applied in all the hierarchical clustering methods.

Complete linkage measures distance between two clusters from an opposite side. The distance between two clusters is defined as the maximum distance between every pair of individuals. For the two clusters noted above, their distance is defined as d2 = max{d(xi,x4), d(xi,X5),d(x2, £4), d(x2,2:5), d(x3, x4), d(x3, Ж5)}, which is the distance between the “farthest” neighbors. Similar to the method of single linkage, change of measures or scales may change the the behavior of clustering. Based on the definitions of these two linkages, it is not difficult to see that clustering methods built upon the single linkage criterion has the ability to find non-spherical or non-elliptical (i.e. irregular) clusters, and tends to isolate extreme values into separate clusters. Complete linkage, on the other hand, tends to produce clusters with similar diameters.

Instead of searching for minimum or maximum distances between two clusters as in single and complete linkages, the average linkage criterion focuses on average of distances between clusters. In this criterion, the distance between two clusters is a weighted average distance between individuals in the clusters. Using the same set of two clusters, ci = {aq, X'2, жз}, 02 = {ж 4. x~,}, the distance between these two clusters is defined as

where nc, and nC2 denote the number of clustering objects in each of the two clusters, and thus nclnC2 denotes the number of paired objects in total. This criterion tends to produce clusters with similar variances.

Ward’s method determines merging based on total variations in the data, evaluated as the sum of squared Euclidean distance from the mean vector (or centroid) of each cluster.

where I is the number of clusters, щ is the number of clustering objects in cluster i. As seen from the definition of SST. outliers can strongly impact the quality of clusters.

An approach is less sensitive to outliers is the Centroid method, which evaluates distances between clusters using centroids,

where f j is the centroid of cluster i, and fj the centroid for cluster j.

Two common R functions are used to perform agglomerative clustering: hclust and agnes. The implementation of function hclust is similar to agnes. In the following, we apply agnes to the simulated data discussed in Section 5.1.2 to demonstrate agglomerative clustering using R. The function agnes is in the cluster package. We first cluster the objects using original data using agnes. The dendrogram of the inferred clusters are in Figure 5.4.

> library(cluster)

> x = cbind(xl,x2)

> rownames(x) = c(rep(l,15), rep(2,20), rep(3,25))

> AgnesFit = agnes(x)

> plot(AgnesFit, which.plots = 2, main = ’(Unsealed) agnes - dendrogram’, sub = ’’, cex = 1, cex.main = 2, xlab = ’’)

The default distance metric in agnes is Euclidean distance, metric="euclidean". Another option of metric is "manhattan". The default criterion for combining two clusters is average linkage, method="average". Other criteria discussed earlier are also available in agnes. The plot function used here is in the plot.agnes package and used to create plots for visualizing the clusters inferred from agnes. In this plot function, which, plots is used to indicate what type of graphs to be drawn with NULL to plot both a banner plot and a dendrogram. 1 for a banner, and 2 for a dendrogram. The default of the sub argument is to include agglomerative coefficient in the graph. The agglomerative coefficient is used to assess the quality of the clustering. Let d(i) denote the dissimilarity (i.e., distance) of object i,i = 1, • • • , n to the first cluster it is merged with divided by the dissimilarity of the merger in the last step of the algorithm. The agglomerative coefficient is defined as the average of all 1 — d{i). The range of the agglomerative coefficient is between 0 and 1, the higher the better. The statistics agglomerative coefficient is only useful to compare same-size data sets (sensitive to sample size) and seems not sensitive enough to mis-classification. For this reason, Figure 5.4 does not include this statistics. With the original scale, apparently several clustering objects are misplaced due to the sensitivity of the criterion used to split (Figure 5.4).

We next apply agnes to scaled data, xlSd and x2Sd.

> xlSd = xl/sd(xl)

> x2Sd = x2/sd(x2)

> xSd = cbind(xlSd,x2Sd)

> rownames(xSd) = c(rep(l,15), rep(2,20), rep(3,25))

> AgnesFitl = agnes(xSd)

> plot(AgnesFitl, which.plots = 2, main = ’(Scaled) agnes - dendrogram’, sub = ’’, cex = 1, cex.main = 2, xlab = ’’)

and after scaling, the objects are all clustered correctly (Figure 5.5). Almost all clustering methods are sensitive to measures or scales, and it is thus recommended to scale the data or standardize the data before performing cluster analyses, especially when the clustering objects are not with the same units.

Divisive clustering For divisive clustering, the five criteria discussed earlier are applicable when splitting a cluster except for the complete linkage. In this case, the splitting criterion is to the smallest distance of two candidate subsets, instead of the largest between-cluster distance.

Figure 5.4 Dendrograms of clusters inferred by agnes using original data.

The commonly used divisive clustering approach was the DIANA algorithm, presented in Chapter 6 of Kaufman and Rousseeuw [84] and available in R, the function diana. To choose a cluster to split at each stage, the largest distances (or dissmilarity) between any two clustering objects within a cluster is used and such a largest distance is named as diameter. The cluster with the largest diameter is selected to split. To divide the selected cluster, the method first identifies a clustering object such that the average of its distances to the other clustering objects in the cluster is the largest. This clustering object initiates a “splinter group”. This is then followed by selecting clustering objects from the cluster (“parent cluster”) such that they are closer to the “splinter group” than to the “parent group”. This process continues until no points can be moved to the “splinter group” and the cluster is now split into two clusters.

We again use the simulated data xl and x2 to demonstrate the use of diana. As noted earlier, scaled data are recommended in cluster analyses, and thus in the following diana is applied to xlSd and x2Sd.

> xSd = cbind(xlSd,x2Sd)

> DianaFit = diana(xSd)

> plot(DianaFit,which.plots = 2,

+ main = "(Scaled) diana - dendrogram", sub = ’’,

Figure 5.5 Dendrograms of clusters inferred by agnes using scaled data.

+ cex = .8, cex.main = 2,xlab =

The arguments included in diana overall are the same as those in agnes such as distance metrics, except that diana does not provide an option on setting splitting criteria. As seen in the results from agnes. with scaled data, the method in diana also clusters all the objects into the right cluster (Figure 5.6).

# Hybrids of partitioning-based and hierarchical clustering

Approaches have been developed that are a mixture of partitioning-based and hierarchical clustering, or a mixture of agglomerative and divisive clustering. For approaches that combine partitioning and hierarchical clustering, they in general start the root node with all clustering objects included, select the best partition of the objects, followed by the possibility of collapsing one or more pairs of clusters. We introduce one type of such approaches, the hierarchical ordered partitioning and collapsing hybrid (HOPACH) method [142]. It is a mixture of divisive, partitioning, and agglomerative clustering. For approaches jointly using agglomerative and divisive clustering, they are usually built upon divisive clustering and start from a single cluster, split a cluster into two but

Figure 5.6 Dendrograms of clusters inferred by diana using scaled data.

allow subclusters or children clusters to merge or do not split clusters if they should not be split. For methods in this category, we focus on the hybrid hierarchical clustering via mutual clusters (hybridHclust in R) approach, where a mutual cluster is a set of clustering objects closer to each other than to all other objects, and thus a cluster that should never be broken [20].

The, HOPACH method This clustering algorithm is a hybrid between an agglomerative and a divisive algorithm, and with partitioning algorithm built in to determine the best split at each step. It starts from a single cluster with all clustering objects (root node) and, at each level, collapsing steps are applied to unite two similar clusters. For the partitioning algorithm, the partition around medoids (PAM) algorithm proposed by Kaufman and Rousseeuw [83] is applied.

The median (or mean) split silhouette (MSS) criterion improved from the Silhouette information in PAM is proposed in HOPACH to determine the number of subclusters, optimize partitions and select pairs of clusters to collapse at each level. The goal is to minimize MSS, which is a measure of cluster heterogeneity [142], and find small clusters if any inside big clusters. For each clustering object, the HOPACH algorithm measures how well matched it is to the other objects in the cluster compared to if it were switched to the next closest cluster. This is evaluated using the Silhouette information defined in (5.1). Using the average of the Silhouettes to determine the quality of cluster and optimize the splitting of a cluster to subclusters by maximizing mean Silhouettes. After splitting, each clustering object’s Silhouette information is re-evaluated relative to their sibling clusters, and a median of Silhouettes, named as median split silhouette (MSS) [142], is taken within each parent cluster. The median of Silhouettes across all parent clusters is an overall measurement for homogeneity within parent clusters. Two parent clusters can be collapsed if doing so will improve MSS.

An R package, hopach, which implements this algorithm is available in Bioconductor. In the following, we apply' the package to cluster the simulated data discussed above.

> library(hopach)

> hopClust = hopach(xSd, d = ’euclid’)

> # Look at num. of clusters

> hopClust\$clust\$k [1] 24

> # Look at sizes of clusters

> table(hopClust\$clust\$sizes)

1 2 3 4 9 7 5 10 1 1

In the above, the clustering is applied to the rows of scaled data xSd. The Euclidean distance metric, indicated by d=‘euclid’ in the function hopach, is used to determine the distance between clustering points. We can use hopClust\$clust\$k to see the number of clusters and table (hopClust\$clust\$sizes) to check the size of each cluster.

By' use of this hybrid clustering method, 24 clusters are identified, including 7 clusters with singletons. The large number of clusters is expected since this clustering method is built upon divisive clustering. To visualize the clusters, heat maps can be used, which is helpful to tease out overall clustering patterns instead of the more scrutinized 24 clusters. The R function dplot can be used to for this purpose. A heat map is generated based on ordered dissimilarity matrix. In dplot, two ways are available to define the order of the dissimilarity matrix, one being the order of elements in the final level of the hierarchical tree, and the other being the order from the level of the tree corresponding to the main clusters. To apply dplot to the above hopClust object,

> library(RColorBrewer)

> source("https://bioconductor.org/biocLite.R")

> biocLiteC'bioDist")

> library(bioDist)

> colRB = colorRampPalette(brewer.pal(10, "RdBu"))(256)

> rownames(xSd) = seq(l,nrow(xSd))

> distanceEuc = distancematrix(xSd, d = "euclid")

> dplot(distanceEuc, hopClust,lab = rownames(xSd), ord = ’final’, showclust = FALSE,

+ col = colRB.main = ’HOPACH - Euclidean distance’)

Instead of the default red-yellow color theme supplied by dplot, a red-blue color scheme defined by colorRampPalette in the RColorBrewer package is used in the plot. A color of red represents high similarity and a color blue refers to high dissimilarity. A distance matrix needs to be calculated and included in the dplot function. To be consistent with the distance metric used to cluster xSd, an Euclidean distance matrix is calculated using the function distancematrix in the hopach package. In the above, the elements in the dissimilarity matrix is ordered based on the final levels in the tree, ord="final". Running the dplot function, the ordered dissimilarity matrix is shown as a heat map and clusters appear as blocks on the diagonal of the heat map. For the simulated data, a clear pattern of 3 clusters is shown (Figure 5.7), Note that the color is not shown in the figure due to the black-white printing style.

The hybrid hierarchical clustering method This divisive clustering method has a feature of maintaining a group of clustering objects denoted as mutual clusters [20]. To define a mutual cluster, let S denote such a cluster. For any object x in S, we have d(x, у) > vn-£xwes,zesd(w, z) for any object у outside S. This definition claims a subset of clustering objects being a mutual cluster if the minimum distance between any object in the mutual cluster and an object outside the mutual cluster is larger than the maximum distance between points in the mutual cluster. These mutual clusters cannot be further divided, but within each mutual cluster, clustering is performed to create hybrid clusters. The algorithm has three steps:

• 1. Identify mutual clusters using the distance-based definition of such clusters.
• 2. Perform a constrained divisive clustering such that each mutual cluster must stay intact. The divisive clustering is achieved by use

Figure 5.7 Heat map of ordered dissimilarity matrix showing 3 clusters.

of the К-means approach at each node, which is noted as “tree- structured vector quantization” (TSVQ) in the area of engineering [54].

3. After divisive clustering is complete, within each mutual cluster, hybrid clusters are inferred. Since mutual clusters are often small, similar results will be obtained from divisive or agglomera- tive methods. However, since divisive clustering is implemented in step 2, it is simpler to use the same algorithm when determining the clusters within each mutual cluster.

This algorithm is implemented in R with the hybridHclust package, which detects the mutual clusters and divisively clusters points with mutual clusters protected. We continue to use the simulated data xl and x2 to demonstrate the implementation of this package with scaled data, xlSd and x2Sd.

> library(hybridHclust)

> x = cbind(xlSd,x2Sd)

> rownames(x) = c(rep(l,15), rep(2,20), rep(3,25))

> hyb = hybridHclust(x)

Figure 5.8 Dendrogram of the clusters detected by use of mutual clusters.

In the above program, hyb is a dendrogram and we can use plot(hyb,rownames(x)) to examine the quality of clustering.

The function mutualcluster outputs the identified mutual clusters:

> mcl <- mutualCluster(x)

> mcl

• 1 : 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
• 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
• 2 : 5 12
• 3 : 2 10 13
• 4 : 1 11
• 5:39
• 6 : 4 8 15

Note that observations 1G to 60 are included in one mutual cluster. As seen in the cluster dendrogram plot (Figure 5.8), these 45 observations are further split into two clusters. The inclusion of these observations in one mutual cluster implies shorter distances between these observations, compared to observations outside this mutual cluster. This is supported by the plot of data shown in Figure 5.2a.

# Examples – clustering to detect gene expression patterns

In this section, we apply the clustering methods discussed so far to a data set of gene expressions from the RNA-Seq technique. The data are from the study in Brooks et al. [13]. The authors combined RNAi and RNA-Seq to identify exons regulated by Pasilla, the Drosophila melanogaster ortholog of mammalian NOVA1 and NOVA2. The RNA- Seq data and related information are available from the Gene Expression Omnibus (GEO) database under accession numbers GSM4G117G- GSM461181. The data are downloaded from https://figshare.com/ s/e08e71c42f 118dbe8be6. We first read in the data and perform necessary quality controls. The reads were aligned to the Drosophila reference genome and summarized at the gene level.

File counts_Drosophila.txt contains reads in counts of 14,8G9 genes on seven samples. Information on sample IDs, treatment status, and depth in sequencing indicated by Library are saved in SampleInfo_Drosophila.txt. The abbreviation PE denotes paired end sequencing and SE is for single end sequencing.

> # read in the RNA-Seq data

> # Information on the data including

> # treatment status and library (sequencing depth)

> Targets = read.delim(file = "SampleInfo_Drosophila.txt")

> dim(Counts)

[1] 14869 7

> Targets

SampleName Group Library

• 1 SRR031714 Untreated PE
• 2 SRR031716 Untreated PE
• 3 SRR031724 Treated PE
• 4 SRR031726 Treated PE
• 5 SRR031708 Untreated SE
• 6 SRR031718 Treated SE
• 7 SRR031728 Untreated SE

To filter out genes with low counts, we compute the number of reads per million reads mapped and calculated by the library size of each sample, using the R function cpm in the edgeR package (it also requires the limma package). This is then followed by specifying a threshold of counts per million (cpm) for low reading, which is selected corresponding to 10 — 15 reads. In the following, the threshold is set at 3, corresponding to 12 reads (Figure 5.9).

> library(limma)

> library(edgeR)

> # compute counts per million reads mapped

> CountsRate = cpm(Counts)

> plot(Counts[,1].CountsRate[,1],xlim = c(0,20),ylim=c(0,5))

> abline(v = 12,col = 2)

> abline(h = 3,col = 4)

> # 3 counts per million is used as the cutoff

> # corresponding to 12 absolute reads

> Cutoff = CountsRate > 3

Figure 5.9 Plot of counts per million vs. number of reads. The vertical dashed line corresponds to 12 reads.

After the decision of threshold, we exclude genes such that 3 cpm observed in at least three samples. This is achieved by use of function rowSums. The screening step excluded more than half of the genes and in total 7245 genes are included in the subsequent analyses.

> Screen = rowSums(Cutoff) >= 3

> table(Screen)

Screen

FALSE TRUE 7624 7245

>

> CountsKeep = Counts[Screen,]

> dim(CountsKeep)

[1] 7245 7

To adjust the impact of library size, we normalize the counts by the sample normalization factor available in the edgeR package.

> у = DGEList(CountsKeep)

> # calculate log base 2 of the counts with counts normalized

> # by the sample normalization factor.

> NormLogCounts = cpm(y,log = TRUE)

For the purpose of demonstration, we choose the top 50 genes such that they show the largest variations across the 7 samples. The variances are calculated using function apply.

> VarGenes = apply(NormLogCounts, 1, var)

> # get the top 50 gene names

> SelectVar = names(sort(VarGenes, decreasing = TRUE))[1:50]

> HighVarCountsRate = NormLogCounts[SelectVar, ]

> dim(HighVarCountsRate)

[1] 50 7

We now have normalized cpm of 50 genes in 7 samples. The cluster analyses will focus on the genes and we start from the К-means approach. As done earlier, we create a scree plot on the total within cluster sum of squares to assist the decision on the number of clusters. As shown in Figure 5.10, the turning point of within cluster variations is at the number of 4 clusters.

> max<-10

> ssWithin = rep(0,(max-1))

> for (i in 2:max)

Figure 5.10 Scree plot of the within cluster sum of squares (K-means).

+ {

+ set.seed(i*5000)

+ kClust = kmeans(x = HighVarCountsRate,centers = i)

+

+ ssWithin[i-1] = kClust\$tot.withinss + }

> # choose the number of clusters minimizing within cluster variations

> plot(seq(2,max), ssWithin, pch = 19, cex = 0.8, xlab = "Number of clusters", + ylab = "Within cluster variations")

> lines(seq(2,max), ssWithin)

We then apply kmeans again to the RNA-Seq data of 50 genes with centers=4. The clusplot in the R package cluster perforins principal components analysis and displays the clusters based on the first two components which explain 76.22% about the variation in the data. The data are overall well grouped using four clusters (Figure 5.11). The shades with different density are achieved by setting shade=TRUE. A unit area with more data points is associated with a higher density.

Figure 5.11 Display of the four clusters using the first two principal components (K-means).

> kmeansClust = kmeans(x=HighVarCountsRate,centers = 4, nstart = 2)

> library(cluster)

+ color = TRUE,col.clus = repC'gray",4), col.p = "black", lines = 0,main = NULL)

To use PAM to cluster the data, the program is similar to what has been introduced earlier using simulated data. To choose the best number of clusters, we utilize grid search to maximize average Silhouette width (Figure 5.12). Using PAM, 5 clusters are identified (Figure 5.13), but the separation of the data is not as good as the result based on K-means.

> # using PAM to cluster

> library(cluster)

> SiWidth = rep(0,(max-1))

> for (i in 2:max)

+ {

+ pamClust = pam(x=HighVarCountsRate, к = i, diss = FALSE,

+ metric = "euclidean")

+ SiWidth[i-1] = pamClust\$silinfo\$avg.uidth

+ 1

Figure 5.12 The scree plot based on average silhouette width (PAM).

> # Choose the number of cluster maximizing

> # the average silhouette widths

> plot(seq(2,max).SiWidth, pch = 19, cex = 0.8, xlab = "Number of clusters", + ylab = "Average silhouette width")

> lines(seq(2,max).SiWidth)

> pamClust = pam(x=HighVarCountsRate, к = 5, diss = FALSE,

+ metric = "euclidean")

> dev.new0

+ color=TRUE,col.clus=rep("gray",4),col.p="black", lines=0,main=NULL)

For clusters analyzed using agnes and diana. for the purpose of comparison with results from partitioning-based methods, we first estimate the best number of clusters using the NbClust function in the package NbClust. In the following codes, the best number of clusters is estimated as 6, using Silhouette index with clusters inferred based on Euclidean distance using Ward’s method.

> library(NbClust)

> hierarClust = NbClust(data = HighVarCountsRate,

Figure 5.13 Display of the five clusters using the first two principal components (PAM).

+ distance = "euclidean",

+ min.nc = 2, max.nc = 10, method = "ward.D",

+ index = "silhouette")

> hierarClustSBest.nc

Number_clusters Value_Index

6.0000 0.3778

We next trim the tree inferred using Ward’s method in agnes. The function cutree is applied to extract a sub-tree. Function clusplot is not applicable to hierarchical clusters. To visualize the clusters, we use the fvix.cluster function in the factoextra package (it also needs the ggplot2 package). This function uses the top two principal components to display the group patterns. The loading of each gene is plotted in points specified by geom="point" and each cluster is framed with different shapes. Frame types are specified by ellipse .type. In each cluster, a point with a larger size represents the rnedoid of that cluster. The detected six clusters by agnes have some overlaps and it maybe reasonable to collapse some clusters (Figure 5.14).

> # hierarchical clustering using agglomerative nesting (agnes)

Figure 5.14 The six clusters identified using agnes.

> AgnesFit = agnes(HighVarCountsRate,method="ward")

> subTree = cutree(as.hclust(AgnesFit), к = 6)

>

> # plot the clusters

> library(factoextra)

> library(ggsignif)

> fviz_cluster(list(data = HighVarCountsRate, cluster = subTree), + geom="point", ellipse.type = "convex",

+ palette = repO'black",6), ggtheme = theme_gray())

For divisive cluster analysis, we essentially follow the same procedure. The function diana identified a singleton cluster and genes in other clusters are better separately compared to the results from agnes (Figure 5.15).

Next we detect clusters of genes considering the existence of mutual clusters. The coding is similar to what is discussed earlier. Here we focus on the detected mutual clusters as well as the dendrogram of the results. In total. 14 mutual clusters are detected. To plot the dendrogram with color indicating different clusters, we use ggplot. We need to first convert the object from hybridHclust to a dendrogram using the as. dendrogram function. Next we specify the number of clusters к

Figure 5.15 The six clusters identified using diana.

and colors used to mark clusters. Finally, the function ggplot is utilized to plot the dendrogram (Figure 5.1G). Based on the dendrogram, it is reasonable to conclude that 6 clusters can reasonably separate the 50 genes.

> mcl <- mutualCluster(HighVarCountsRate)

> mcl 1:24 2:35 3:678

• 4 : 11 15
• 5 : 29 34
• 6 : 12 22
• 7 : 20 49
• 8 : 45 48
• 9 : 32 43
• 10 : 23 28
• 11 : 18 19
• 12 : 37 41
• 13 : 27 39 42 50
• 14 : 30 36 46

> hybl <- hybridHclust(HighVarCountsRate,mcl)

> library("reshape2")

> library("purrr")

> library("dplyr")

> library("dendextend")

> dendro = as.dendrogram(hybl)

> dendro.col <- dendro %>%

+ set("branches_k_color", к = 6, value =

+ rep(c("black", "gray") ,3)) %>’/,

+ set("branches_lwd", 0.8) ’/,>%

+ set("labels_colors",

+ value = cC'darkslategray")) ‘/ф>°/,

+ set("labels_cex", 1)

> ggdl <- as.ggdend(dendro.col)

> ggplot(ggdl, theme = theme_minimal()) +

+ labs(x = "Gene numbers", у = "Height", title = "6 clusters") Warning message:

Removed 99 rows containing missing values (geom_point).

Figure 5.16 The six clusters identified using hybridHclust with mutual

clusters.