# Clustering, blockmodeling, and community detection

The subnetworks obtained as cuts or islands are usually dense and the standard 'dots' and 'lines' representation is unreadable. In such cases a much better visualization is using the matrix representation for the 'right' permutation of vertices. The permutation can be obtained using seriation algorithms, clustering, blockmodeling, or community detection algorithms (Louvain method, VOS clustering).

Network/Create Permutation/

Operations/Network+CIuster/Di ssimilarity*/Network based/ Network/Create Hierarchy/Clustering*/ Network/Create Partition/Blockmodeling*/

## The Louvain method and VOS

One of the approaches to determine the clustering **C **of a network is to maximize its *modularity*

where *1(C)* is the number of edges between vertices belonging to cluster C, and *d(C)* is the sum of the degrees of vertices from *C.*

The modularity maximization problem is NP-complete.

The Louvain method and VOS are available in Pajek using the commands:

Network/Create Partition/Communities/Louvain Method Network/Create Partition/Communities/VOS Clustering [Draw] Layout/VOS Mapping

# Clustering symbolic data

Given a set of units, * V, *clustering is a process of organizing these units into clusters where the clusters contain units more similar to each other than they are to units from other clusters, according to some well-specified criteria. In real-life (empirical) clustering problems, we have to deal with their different characteristics:

• descriptions of units in the form of vectors (capturing types of measurement scales, numbers of variables, missing values...) or structured units;

• sizes of the resulting clusters of units;

• structure of units in a 'space' (including features such as density and the 'shapes' of clusters).

A recent survey on clustering is given in Gan et al. (2007). Here, we present an approach to clustering (very) large datasets of *mixed* units, that is, units measured in different scales.

The approach is based on representation of units by *symbolic objects* (SOs) (Billard and Diday 2006). The SOs can describe either single units or groups of initial units condensed into SOs in a pre-processing step.

For clustering SOs, we adapted two classical clustering methods:

1. the *leaders method -* a generalization of the *k-means* method (Hartigan 1975) and dynamic clouds (Diday 1979)

2. Ward's *hierarchical clustering method* (Ward 1963)

Both adapted methods are based on the *same* criterion function - they are solving the *same* clustering problem.

With the leaders method, the size of the sets of units is reduced to a manageable number of leaders. The obtained leaders can be further clustered with the compatible agglomerative hierarchical clustering method to reveal relations among them and using the resulting dendrogram also to decide upon the appropriate number of clusters.

## Symbolic objects described with distributions

An SO, *X,* is described by using a list *X = [x{]* of descriptions of *m* variables *Vt GV.* In our model, each variable is described with frequency distribution *(bar chart)* of its *ki* values:

with

we denote the corresponding probability distribution by

We approach the clustering problem as an optimization problem over the set of *feasible* clusterings, <&k, the partitions of units into *k* clusters. The *criterion function* has the form

The *total error,* P(C), of the clustering, C, is a sum of the *cluster errors, p(C).*

There are many possibilities for expressing the cluster error, *p(C).* We assume a model in which the error of a cluster is a sum of differences of its units from the cluster's *representative, T,*

Note, in general, the representative need not be from the same 'space' (set) as units. The *best* representative is called a *leader:*

Then we define

The SO, *X,* is described by a list *X =* [Xj]. Assume also the representatives are described in the same way *T =* [tj], ts = *[tn,ti2,..., tik.].*

We introduce a dissimilarity measure between SOs with

where

This one is a generalization of the squared Euclidean distance. Using a more appropriate type of dissimilarity, *8,* it is possible to consider the problem of squared Euclidean distance being responsive to the largest values. Some alternatives for *8(x, t)* are shown in the following table:

The weight, *wx.j,* can be different for the same unit *X* for each variable *Vt* (as is needed in descriptions of egocentric networks, population pyramids, etc.).

## The leaders method

The *leaders method* is a generalization of a popular non-hierarchical clustering *k-means method.* The idea is to get an 'optimal' clustering into a *pre-specified* number of clusters with the following iterative procedure:

determine an initial clustering **repeat**

determine leaders of the clusters in the current clustering; assign each unit to the nearest new leader - producing a new clustering **until **the leaders stabilize.

Given a cluster, C, the corresponding leader, *Tc,* is the solution of the problem

Therefore *Tc =* **[t.*] **and t* = argmin^ 2xec ^(xi> ^nd again (to simplify the notation we omit the index *i):*

and omitting the index *j*

This is a standard optimization problem with one real variable. The solution has to satisfy the condition

The leaders for *8X* have two important properties. If *wxj = wx* then for each *i = ,... ,m:*

The leaders' components are *distributions.*

Further, let *wr* ; = nv then for each *i = ,... ,m:*

The representative of a cluster is again a distribution over this cluster.

Given the leaders, **T, **the corresponding optimal clustering, **C*, **is determined from

We assign each unit, *X,* to the closest leader, *Tk* e **T.**

## An agglomerative method

The hierarchical agglomerative clustering procedure is based on a step-by-step merging of the two closest clusters.

*Ck* is a partition of the finite set of units, *V,* into *k* clusters. The level, *h(C),* of the cluster *= Cu* U *Cv.*

Therefore, the computation of dissimilarities between the new (merged) cluster and the rest has to be specified. To obtain the compatibility with the adapted leaders method, we define the dissimilarity between clusters *Cu* and *Cv, CunCv =* 0, as

In an initial general computation, **u, **and **v(- **are components of the leaders of clusters, *Cu* and *Cv,* and **zs **is a component of the leader of the cluster, *Cu* u *Cv.* It can be shown that a *generalized Ward's relation* holds:

Instead of the squared Euclidean distance other dissimilarity measures *8(x, t)* can be used - see Kejzar et al. (2011). Relations similar to Ward's can be derived for them.

The proposed approach to clustering symbolic data is implemented in the R-package *clamix* for dissimilarity measures discussed in Section 3.10.1, currently available upon request; and for clustering without weights *wx.j* is implemented in R-package *clustDDist* (Kejzar et al. 2009).