Weight Initialization
Why do we initialize random weights? (Why don't we start with all zero or all equal weights?)
Because the update will be done equally across all neurons.
Why do we initialize with small weights? (Why not large weights?) And are there any problems with too small weights?
There are problems with too small weights - also depends on activation, say we have tanh
So as we move forward through networks - mean is kind of zero (because tanh) but the SD starts diminishing.. ultimately at the last layers weights are tooo small and final output is almost zero. and in backprop the gradients are also small so weights won't update - vanishing gradient
For higher weights - will fall in the saturation region, thus same thing at saturation region, the gradient is low. - but what about ReLU
-------------------------------
So based on activation Function Weight Initialization must be different:
https://machinelearningmastery.com/weight-initialization-for-deep-learning-neural-networks/
Weight Initialization for Sigmoid and Tanh
Xavier Weight Initialization
Normalized Xavier Weight Initialization
Weight Initialization for ReLU
He Weight Initialization
More on this later
--------------------------------
If all weights are zero/equal - all will be updated equally - every neuron will have the same operation - all neurons will be the same.
First idea:
Small random numbers - gaussian with zero mean and 1e-2 SD
W = 0.01 * np.random.randn(D,H)
- Good for small networks but problems with deeper networks
Let's say we initialize a "deep" neural net with small random weights. We pass a random data and see the "output" of activation or often just called the "activation" at each layer - we see the mean and the SD (yes overall data, averaged over all features and batch).
First layer op -> Mean around zero and SD is fairly high, output centered around zero -> because tanh activation layer is zero - centered.
Although on other layers moving forward the mean is still around zero, the SD will decrease and collapse to zero as we move on. So Gaussian distribution shrinks - becomes thinner and thinner - because on every layer output is being multiplied by a small number.
If weight 'w' is 0.001, h3 = 0.001 * 0.001 * 0.001 * x, h3 is very small - thus spread/SD decreases and the distribution shrinks if there are many layers as we move forward, final output is almost zero.
In backward pass, gradient wrt weights d(w*x) / dw will again be x, and problem like saturation occurs, i.e. gradient is low and weights won't update. So as we backward, the gradient flowing backward is the prev/upstream (topmost/upmost stream is the loss) gradient * local gradient wrt x which is w, so as we move backwards or say downwards from the upstream, the upstream gradient is being multiplied by x and thus it decreases further and further.
Let's say out weights in not initialized as low values but high values. Then the output will fall in the saturation region of the activation function (for tanh), thus gradient low, not much update and so on - the problems start to occur.
So there is a need to find proper weight initialization method.
Xavier Initialization [Glorot et al. 2010]:
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
fan_in = input feature length/ input nodes
fan_out = output nodes in a layer / output feature length
This will ensure, the variance of input is the same as the variance of output
because if we have a small number of inputs, no of weights being multiplied are also small so we want each of them to be big, to maintain the input and output variance. [assuming we are in the active region of tanh]
*****
But this too breaks with relu [46:40]
Sets half of the actiavtions to zero (approx)
So we do: w = nprand/ (sqrt / 2)
Since half gets killed - works good [He et al 2015]
*****
First idea:
Small random numbers - gaussian with zero mean and 1e-2 SD
W = 0.01 * np.random.randn(D,H)
- Good for small networks but problems with deeper networks
Let's say we initialize a "deep" neural net with small random weights. We pass a random data and see the "output" of activation or often just called the "activation" at each layer - we see the mean and the SD (yes overall data, averaged over all features and batch).
First layer op -> Mean around zero and SD is fairly high, output centered around zero -> because tanh activation layer is zero - centered.
Although on other layers moving forward the mean is still around zero, the SD will decrease and collapse to zero as we move on. So Gaussian distribution shrinks - becomes thinner and thinner - because on every layer output is being multiplied by a small number.
x --> w*x --> h1
h1 --> w*h1 --> h2
h2 --> w*h2 --> h3
If weight 'w' is 0.001, h3 = 0.001 * 0.001 * 0.001 * x, h3 is very small - thus spread/SD decreases and the distribution shrinks if there are many layers as we move forward, final output is almost zero.
In backward pass, gradient wrt weights d(w*x) / dw will again be x, and problem like saturation occurs, i.e. gradient is low and weights won't update. So as we backward, the gradient flowing backward is the prev/upstream (topmost/upmost stream is the loss) gradient * local gradient wrt x which is w, so as we move backwards or say downwards from the upstream, the upstream gradient is being multiplied by x and thus it decreases further and further.
Let's say out weights in not initialized as low values but high values. Then the output will fall in the saturation region of the activation function (for tanh), thus gradient low, not much update and so on - the problems start to occur.
So there is a need to find proper weight initialization method.
Xavier Initialization [Glorot et al. 2010]:
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
fan_in = input feature length/ input nodes
fan_out = output nodes in a layer / output feature length
This will ensure, the variance of input is the same as the variance of output
because if we have a small number of inputs, no of weights being multiplied are also small so we want each of them to be big, to maintain the input and output variance. [assuming we are in the active region of tanh]
*****
But this too breaks with relu [46:40]
Sets half of the actiavtions to zero (approx)
So we do: w = nprand/ (sqrt / 2)
Since half gets killed - works good [He et al 2015]
*****
General rule of thumb: Use xavier initialization (For Sigmoid / Tanh)
Xavier initialization's derivation is based on the assumption that the activations are linear. This assumption is invalid for ReLU and PReLU.
Bias
Later we'll see Batch Normalization, which makes us free to choose any initialization because we will explicitly make it normal or even better we will let the network "learn" if at some point we need to normalize it or not by itself.
**All taken from Stanford Visual Recognition Lecture 6.
https://medium.com/usf-msds/deep-learning-best-practices-1-weight-initialization-14e5c0295b94
Now let's talk about ReLU