Slides from Stanford cs231n - Lec 6
Sigmoid:
Output Range: 0 (for -inf) to 1 (for inf)
Problem:
a) Saturation - At saturation the gradient is zero - so weights will stop updating - or update very slowly.
So stop updating may mean the model is perfect - fits training data perfectly.
But say on initialization the neuron was saturated - then problem!
Thus saturation is a problem because it erodes the plasticity of neural networks and usually results in worse test performance. Like overfit. [Source]
b) Outputs are not zero centered - across y axis. [output range is 0-1] Time 12:00
So gradient will always be all positive or all negative dL/df * df/dw, L = f(wx+b), f = wx + b, df/dw is always +ve because x is +ve, dL/df can be +ve or -ve, so all weights gradient will be all +ve or all -ve. So we want x to be mixture of +ve and -ve.
c) exp() function is computationally expensive - not big issue though
Tanh:
Output range -1 to 1 [Zero Centered!]
But the saturation problem still persists!
ReLU
Does not saturate in +ve region.
Computationally efficient
Converges much faster than sigmoid/tanh (6x)
Biologically more plausible
Problem:
Not zero centered - across y axis
Negative half still has saturation
So if there is bad initialization and some weights of the neuron give negative output - they will never be updated.
If the learning rate is too high - the update pushes the weight so that the output is negative - the same problem again!
So sometimes 10-20% of the network may have dead ReLUs.
** So in practice people like to initialize ReLU neurons with slightly positive biases (ex 0.01)
Sometimes helps sometimes doesn't.
Sometimes people don't care at all and use 0 bias.
Leaky ReLU
Won't die in negative region
is Zero Centered - across y-axis
Parametric ReLU
alpha parameter is learnable
Exponential ReLU
Same advantages of Leaky ReLU - but exp() is computationally expensive.
The negative region will have saturation - makes it robust to noise [Clevert et al 2015]
Maxout [Goodfellow et al. 2013]
Generalizes ReLU and Leaky ReLU
Doesn't saturate, doesn't die. Both regions have linear regime.
max (w1*x + b1, w2*x + b2)
Problem: Number of parameters/neuron doubles
Data Preprocessing
Say, X = N x D, each example in row
Zero centering: X -= np.mean(X, axis = 0)
Normalization: X /= np.std(X, axis = 0)
Reason for zero-centering: To avoid problems like sigmoid: Outputs are not zero centered.
Reason for normalization: Normalization is done wrt feature, so that all the features are in the same range and so that they contribute equally.
For Image: We only do zero-centering (subtraction by mean) but no need to normalize because pixel values are already in the range [0-1] or [0-255]
We also don't do PCA/Whitening for images.
TLDR for Images: Center only [ subtract per channel mean], not common to normalize variance or to do PCA or whitening
Parametric ReLU
alpha parameter is learnable
Exponential ReLU
Same advantages of Leaky ReLU - but exp() is computationally expensive.
The negative region will have saturation - makes it robust to noise [Clevert et al 2015]
Maxout [Goodfellow et al. 2013]
Generalizes ReLU and Leaky ReLU
Doesn't saturate, doesn't die. Both regions have linear regime.
max (w1*x + b1, w2*x + b2)
Problem: Number of parameters/neuron doubles
Data Preprocessing
Say, X = N x D, each example in row
Zero centering: X -= np.mean(X, axis = 0)
Normalization: X /= np.std(X, axis = 0)
Reason for zero-centering: To avoid problems like sigmoid: Outputs are not zero centered.
Reason for normalization: Normalization is done wrt feature, so that all the features are in the same range and so that they contribute equally.
For Image: We only do zero-centering (subtraction by mean) but no need to normalize because pixel values are already in the range [0-1] or [0-255]
We also don't do PCA/Whitening for images.
TLDR for Images: Center only [ subtract per channel mean], not common to normalize variance or to do PCA or whitening