GAUSSIAN MIXTURES IN PRACTICE AND THE CURSE OF DIMENSIONALITY
Despite their conceptual and mathematical elegance, the general Gaussian mixture models are almost never used in practice for clustering highdimensional data. The reason for this is simple: there are simply too many parameters in these models to fit them. The reason that the number of parameters gets so large is an example of the so-called “curse of dimensionality.” Let’s calculate the number of parameters in the likelihood function of the Gaussian mixture model with K components and d dimensions. For each component, we have a mean vector with d-parameters, and a symmetric covariance matrix with (d + 1) d/2 parameters. We also need K - 1 mixing parameters. In all, we’re looking at an objective function with K - 1 + Kd(1 + (d/2 + 1/2)) or more than (1/2)Kd2 parameters.
For 10 clusters of 100-dimensional data, we’re already looking at more than 50,000 parameters. It’s just too hard to optimize: Although the E-M algorithm is guaranteed to give us a local optimum in the likelihood, there’s just no way to try enough random starting values to convince ourselves that we’ve actually found the maximum likelihood. Furthermore, even if we give up on E-M and start implementing some fancier optimization methods, we need thousands of 100-dimensional observations to have enough data to estimate all these parameters.
In practice, we reduce the complexity of the likelihood function by placing constraints on the covariance matrices. This reduces the number of parameters (see Exercises). In the limit where covariance is the identity matrix (I), we have something very much like K-means. A very interesting case is where Z = sI, where s is a vector of variances. This allows each cluster to have different spreads, and also to weigh the dimensions differently. In practice, this can be very powerful if the observations are not equally spread out in the high-dimensional space. Figure 6.3 illustrates the effect of the diagonal covariance assumption on the model (compare a and c to b and d). At least for this two-dimensional example, the assumption of diagonal covariance does not greatly impact which clusters are found.