tuning, a layer-by-layer pre-training of RBMs is performed: the outputs of a one
layer(RBM) are treated as inputs to the next layer(RBM) and the procedure repeats
till all the RBMs are trained.
This layer-by-layer unsupervised
learning is demanding in DBN training as practically it helps in avoiding local
optima and alleviates the over filling problem that is seen, when thousands of
parameters are choosen. Furthermore, the algorithm is capable in terms of its
time complexity, which is linear to the size of RBMs36 Features at
different layers contains different information about data structures with
high-level features designed from
low-level features. For a simple RBM with Bernoulli distribution of both the
visible and hidden layers, the sampling probabilities are as follows36
where v and
h represents a Ix1
visible unit vector and a J x1 hidden unit vector, respectively; W is
the matrix of weights (wij) connecting the visible layer and hidden
layers; aj and bi are bias terms; and ? (.) is a sigmoid
function. For the case of real-valued visible units, the conditional probability
distributions are quiet different: typically, a Gaussian-Bernoulli distribution
is assumed and P(vi|h; W)
is Gaussian. Weights wij are
updated based on an approximate method called contrastive divergence (CD)
approximation. For example, the (t+1)th weight for wij can be updated as
Where ? is the
learning rate and c is the momentum factor; (.)data and (.) model are
the expectations under the distributions defined by the data and the model, respectively.
While the expectations may be calculated by running Gibbs sampling infinitely,
in practice, one-step CD is often used because it performs well37.
Other model parameters (e.g.,the biases) can be updated similarly. As a
generative mode, the RBM training includes a Gibbs sampler to sample hidden
units based on the visible units and vice versa (Eqs.(1) and (2)). The weights
between these two layers are then updated using the CD rule (Eq. 3). This
procedure will repeat until convergence. An RBM model data distribution using
hidden units without employing label information.
pre-training, information regarding the input data is stored in the weights
between every layer-by layers. The DBN then adds a final layer representing the
desired outputs and the overall network is fine tuned using labeled data and
back propagation strategies for better discrimination (in some implementations,
on top of the stacked RBMs, there is another layer called associative memory
determined by supervised learning methods).
are other variations for pre-training: instead of using RBMs, for example,
stacked denoising auto-encoders and stacked predictive sparse coding are also
proposed for unsupervised feature learning. Furthermore, recent results show that
when a large number of training data is available, a fully supervised training
using random initial weights instead of the pre-trained weights (i.e., without
using RBMs or auto-encoders) will practically work well. For example, a
discriminative model starts with a network with one single hidden layer (i.e.,
a shallow neural network), which is trained by back propagation method. Upon
convergence, a new hidden layer is inserted into this shallow NN (between the first
hidden layer and the desired output layer) and the full network is
discriminatively trained again. This process is continued until a predetermined
criterion is met (e.g., the number of hidden neurons).
summary, DBNs use a greedy and efficient layer-by layer approach to learn the
latent variables (weights) in each hidden layer and a back propagation method
for fine tuning. This hybrid training strategy improves both the generative
performance and the discriminative power of the network.