Deep Learning of Representations: A Review and Recent Trends
Greedy Layerwise Pre-training
The following basic recipe was introduced in 2006 (Hinton and Salakhutdinov, 2006; Hinton et al., 2006; Ranzato et al., 2007a; Bengio et al., 2007):
- 1. Let h0(x) = x be the lowest-level representation of the data, given by the observed raw input x.
- 2. For l =1 to L
Train an unsupervised learning model taking as observed data the training examples h—(x) represented at level l — 1, and producing after training representations he(x) =
R*(he-1(x)) at the next level.
From this point on, several variants have been explored in the literature. For supervised learning with fine-tuning, which is the most common variant (Hinton et al., 2006; Ranzato et al., 2007b; Bengio et al., 2007):
3. Initialize a supervised predictor whose first stage is the parametrized representation function hL(x), followed by a linear or non-linear predictor as the second stage (i.e., taking hL(x) as input).
4. Fine-tune the supervised predictor with respect to a supervised training criterion, based on a labeled training set of (x, y) pairs, and optimizing the parameters in both the representation stage and the predictor stage.
Another supervised variant involves using all the levels of representation as input to the predictor, keeping the representation stage fixed, and optimizing only the predictor parameters (Lee et al., 2009a,b):
3. Train a supervised learner taking as input (hk (x),hk+i (x),...,hL(x)) for some choice of 0 < k < L, using a labeled training set of (x, y) pairs.
Finally, there is a common unsupervised variant, e.g. for training deep auto-encoders (Hinton and Salakhutdinov, 2006) or a Deep Boltzmann Machine (Salakhutdinov and Hinton, 2009):
- 3. Initialize an unsupervised model of x based on the parameters of all the stages.
- 4. Fine-tune the unsupervised model with respect to a global (all-levels) training criterion, based on the training set of examples x.