SO WHAT IS LEARNING ANYWAY?
So far, I’ve been a bit cavalier in using the term learning, and I’ve said that К-means is really the first algorithm we’ve seen where the computer will really learn something. Then I described an iterative procedure to infer the parameters that describe clusters in the data. I think it’s important that you’ve now seen the kind of thing that machine learning really is before we get too carried away in analogy.
Despite these modest beginnings, I hope that you’re starting to get some feeling for what we mean by learning: A simplified representation of the observations that the algorithm will automatically develop. When we say that “learning” has happened, we mean that a description of the data has now been stored in a previously naive or empty space. This type of learning can be quantified using ideas from information theory: If we can measure how much information was in the model to begin with, and then run our learning algorithm (i.e., infer some parameter values) and measure how much information we have after, we can say the amount that the machine learned is the difference between the two. How information is actually measured is beyond the scope of this book, but it turns out to be simpler than one would imagine.
Granted, К-means (and most of the models that we will consider in this book) actually contains relatively little information about the data—a 100 parameters or so—so the computer doesn’t actually learn very much. In fact, the amount that can be learned is limited on the one hand by how much information there really is in the data (and how much of it is really noise), and on the other by the complexity of the model (how many parameters are available) and the effectiveness of the learning algorithm that is used to train it (do we reliably extract the information in the data and store in the parameters). A key issue that arises in machine learning is that we tend to overestimate how much the computer has learned because it’s hard to know when the model is learning about the real information in the data, as opposed to learning something about the noise in this particular dataset. This relates to the issues of overfitting and generalizability, and we will return to it throughout the book.
These are important (although rarely appreciated) considerations for any modeling endeavor. For example, learning even a single real number (e.g., 43.9872084...) at very high precision could take a lot of accurate data. Famously, Newton’s gravitational constant is actually very hard to measure. So, even though the model is extremely simple, it’s hard to learn the parameter. On the other hand, a model that is parameterized by numbers that can only be 0 or 1 (yes or no questions) can only describe a very limited amount of variation. However, with the right algorithm, it might be possible to learn these from much less (or less accurate) data. So as you consider machine learning algorithms, it’s always important to keep in mind what is actually being learned, how complex it is, and how effective the inference procedure (or algorithm) actually is.