# MLEs FOR MULTIVARIATE DISTRIBUTIONS

The ideas that we’ve already introduced about optimizing objective functions can be transferred directly from the univariate case to the multivariate case. Only minor technical complication will arise because multivariate distributions have more parameters, and therefore the set of equations to solve can be larger and more complicated.

To illustrate the kind of mathematical tricks that we’ll need to use, we’ll consider two examples of multivariate distributions that are very commonly used. First, the multinomial distribution, which is the multivariate generalization of the binomial. This distribution describes the numbers of times we observe events from multiple categories. For example, the traditional use of this type of distribution would be to describe the number of times each face of a die (1-6) turned up. In bioinformatics, it is often used to describe the numbers of each of the bases in DNA (A, C, G, T).

FIGURE 4.3 Multivariate Gaussians and correlation. The panel on the left shows real gene-expression data for the CD8 antigen (from ImmGen). The panel on the right shows four parameterizations of the multivariate Gaussian in two dimensions. In each case, the mean is at (0, 0). Notice that none of the simple Gaussian models fit the observed CD8 expression data very well.

If we say the X is the number of counts of the four bases, and f is a vector of probabilities of observing each base, such that f = 1, where i

indexes the four bases. The multinomial probability for the counts of each base in DNA is given by

The term on the right-hand side (with the factorial of the sum over the product of factorials) has to do with the number of “ways” that you could have observed, say, X = (535 462 433 506) for A, C, G, and T. For our purposes (to derive the MLEs), we don’t need to worry about this term because it doesn’t depend on the parameters, so when you take the log and then the derivative, it will disappear. We start by writing the log-likelihood of the data given the model

The parameters are the fs, so the equation to solve for each one is If we solved this equation directly, we get

Although this equation seems easy to solve, there is one tricky issue: the sum of the parameters has to be 1. If we solved the equation here, we’d always set each of the parameters to infinity, in which case they would not sum up to one. The optimization (taking derivatives and setting them to zero) doesn’t know that we’re working on a probabilistic model—it’s just trying to do the optimization. In order to enforce that the parameters stay between 0 and 1, we need to add a constraint to the optimization. This is most easily done through the method of Lagrange multipliers, where we rewrite the constraint as an equation that equals zero, for example, 1 - fi = 0, and add it to the function we are trying to optimize multiplied by a constant, the so-called Lagrange multiplier,

Taking the derivatives gives which we can solve to give

Here, I have used MLE to indicate that we now have the maximum likelihood estimator for the parameter fA. Of course, this is not very useful because it is in terms of the Lagrange multiplier. To figure out what the actual MLEs are, we have to think about the constraint 1 - X fi=°. Since we need the derivatives with respect to all the parameters to be 0, we’ll get a similar equation for fC, fG, and fT. Putting these together gives us

or

which says that is just the total number of bases we observed. So the MLE for the parameter fA is just

where I have now dropped the cumbersome MLE notation. This formula is the intuitive result that the estimate for probability of observing A is just the fraction of bases that were actually A.