Before

tuning, a layer-by-layer pre-training of RBMs is performed: the outputs of a one

layer(RBM) are treated as inputs to the next layer(RBM) and the procedure repeats

till all the RBMs are trained.

This layer-by-layer unsupervised

learning is demanding in DBN training as practically it helps in avoiding local

optima and alleviates the over filling problem that is seen, when thousands of

parameters are choosen. Furthermore, the algorithm is capable in terms of its

time complexity, which is linear to the size of RBMs36 Features at

different layers contains different information about data structures with

high-level features designed from

low-level features. For a simple RBM with Bernoulli distribution of both the

visible and hidden layers, the sampling probabilities are as follows36

P(hj=1|v;

W)= ?

vi+aj) (1)

and

P(vi=1|h;

W)= ?(

hj+bi) (2)

where v and

h represents a Ix1

visible unit vector and a J x1 hidden unit vector, respectively; W is

the matrix of weights (wij) connecting the visible layer and hidden

layers; aj and bi are bias terms; and ? (.) is a sigmoid

function. For the case of real-valued visible units, the conditional probability

distributions are quiet different: typically, a Gaussian-Bernoulli distribution

is assumed and P(vi|h; W)

is Gaussian. Weights wij are

updated based on an approximate method called contrastive divergence (CD)

approximation. For example, the (t+1)th weight for wij can be updated as

follows:

?wij(t+1)=c?wij(t)+?(vihj)data-(vihj)model (3)

Where ? is the

learning rate and c is the momentum factor; (.)data and (.) model are

the expectations under the distributions defined by the data and the model, respectively.

While the expectations may be calculated by running Gibbs sampling infinitely,

in practice, one-step CD is often used because it performs well37.

Other model parameters (e.g.,the biases) can be updated similarly. As a

generative mode, the RBM training includes a Gibbs sampler to sample hidden

units based on the visible units and vice versa (Eqs.(1) and (2)). The weights

between these two layers are then updated using the CD rule (Eq. 3). This

procedure will repeat until convergence. An RBM model data distribution using

hidden units without employing label information.

After

pre-training, information regarding the input data is stored in the weights

between every layer-by layers. The DBN then adds a final layer representing the

desired outputs and the overall network is fine tuned using labeled data and

back propagation strategies for better discrimination (in some implementations,

on top of the stacked RBMs, there is another layer called associative memory

determined by supervised learning methods).

There

are other variations for pre-training: instead of using RBMs, for example,

stacked denoising auto-encoders and stacked predictive sparse coding are also

proposed for unsupervised feature learning. Furthermore, recent results show that

when a large number of training data is available, a fully supervised training

using random initial weights instead of the pre-trained weights (i.e., without

using RBMs or auto-encoders) will practically work well. For example, a

discriminative model starts with a network with one single hidden layer (i.e.,

a shallow neural network), which is trained by back propagation method. Upon

convergence, a new hidden layer is inserted into this shallow NN (between the first

hidden layer and the desired output layer) and the full network is

discriminatively trained again. This process is continued until a predetermined

criterion is met (e.g., the number of hidden neurons).

In

summary, DBNs use a greedy and efficient layer-by layer approach to learn the

latent variables (weights) in each hidden layer and a back propagation method

for fine tuning. This hybrid training strategy improves both the generative

performance and the discriminative power of the network.