Despite the popularity, conceptual clarity and theoretical properties of maximum likelihood estimation, there are (many) other objective functions and corresponding estimators that are widely used.

Another simple, intuitive objective function is the “least squares”— simply adding up the squared differences between the model and the data. Minimizing the sum of squared differences leads to the maximum likelihood estimates in many cases, but not always. One good thing about least squares estimation is that it can be applied even when your model doesn’t actually conform to a probability distribution (or it’s very hard to write out or compute the probability distribution).

One of the most important objective functions for machine learning is the so-called posterior probability and the corresponding Maximum- Apostiori-Probability or MAP estimates/estimators. In contrast to ML estimation, MAP estimation says: “choose the parameters so the model is most probable given the data we observed.” Now, the objective function is P(0|X) and the equation to solve is

As you probably already guessed, the MAP and ML estimation problems are related via Bayes’ theorem, so that this can be written as

Once again, it is convenient to think about the optimization problem in log space, where the objective function breaks into three parts, only two of which actually depend on the parameters.

Interestingly, optimizing the posterior probability therefore amounts to optimizing the likelihood function plus another term that depends only on the parameters.

The posterior probability objective function turns out to be one of a class of so-called “penalized” likelihood functions where the likelihood is combined with mathematical functions of the parameters to create a new objective function. As we shall see in Chapter 9, these objective functions turn out to underlie several intuitive and powerful machine learning methods that we will see in later chapters.

Some very practically important classes of models used for machine learning (e.g., neural networks and SVMs) have specialized objective functions that have been developed for them. These models do not have probabilistic interpretations, so their objective functions cannot usually be related to likelihoods. Nevertheless, these models still have parameters to estimate (or “train”), and the efficiency and accuracy of the “training” algorithms available is critical to the practical applicability of these methods.

No matter what objective function is chosen, estimation usually always involves solving a mathematical optimization problem, and in practice this is almost always done using a computer—either with a statistical software package such as R or MATLAB®, or using purpose-written codes.

< Prev   CONTENTS   Source   Next >