Incomplete - Need to study more
Batch Normalization is important to train deep nns.
VGG 16 and 19 were developed before BN.
So they trained 11 layer first added few layers trained again and so on.
Inception used auxillary losses to train, not necessary but to propagate loss into first layers.
Residual Nets:
1 important property, if the weights of residual network is zero, then it behaves as identity transformation, so that network can choose what is not required.
Easy for the model to learn to not to use the layers it doesn't need.
L2 regularization - making the weights to zero.
2 Gradient flow in backward pass is easy, so deeper nets can be designed.
DenseNet and FractalNet - Study!
Recurrent Neural Network
- Variable Size Data
x -------> [ RNN ] --------> y
Everytime x is inputted, RNN's hidden state will be updated, here's the difference: the internal hidden state will be fed back to the model on next input and so on. So input -> update hidden state -> produce output
now_state = func ( prev_state, now_input ), function has some weights w, like CNN
To get y, we can use a FC layer on "now_state" to get the output at now_input
we use same parameters and function at every step of computation
Simple RNN, (Vanilla RNN):
h_t = f_w ( h_tm1, x_t )
h_t = tanh ( W_hh * h_tm1 + W_xh * x_t ) # tanh to add some non-linearity in the system
y_t = W_hy * h_t
y1--> L1 y2 --> L2 L = L1 + L2, for back prop, dL/dW
^ ^
NN NN
| |
h0 ---> f_w ----> h1 ---> f_w ----> h2 ----> ...
^ ^
| |
x1 x2
W -> same W over all graph.
Now, how to initialize/set h0 and W?
f_w is receiving unique x and unique h but the same W
xt are out sequence of inputs
In backward pass you end up summing the gradients at every step
x1, x2, .... xt are the one input, so same weight must be used for one computation.
and sum of loss over all are used to train the W, there the learnable params are W.
What about h? -> you only need to initialize h0, h1 are calculated from input and h0 and others so on.
Many to one, one to many.
for one to many, use the x to initialize hidden state of the model.
for language translation-> instead of many to many, we use encoder, (many to one) and then a decoder (one to many)
encoder is like will encode the jist of the sentence in input language to one, and the decoder will take that jist and expand it into another language.
Why not many to many?
gist.github.com/karpathy/d4dee566867f8291f086
h and x are stacked [h x] and fed to W
LSTM -> Vanishing and Exploding Gradients sloved in RNN
Open in sublime: https://drive.google.com/open?id=1O6geIf6GN2WhVzJOxZhM1ha6yc-d4vo5
Batch Normalization is important to train deep nns.
VGG 16 and 19 were developed before BN.
So they trained 11 layer first added few layers trained again and so on.
Inception used auxillary losses to train, not necessary but to propagate loss into first layers.
Residual Nets:
1 important property, if the weights of residual network is zero, then it behaves as identity transformation, so that network can choose what is not required.
Easy for the model to learn to not to use the layers it doesn't need.
L2 regularization - making the weights to zero.
2 Gradient flow in backward pass is easy, so deeper nets can be designed.
DenseNet and FractalNet - Study!
Recurrent Neural Network
- Variable Size Data
x -------> [ RNN ] --------> y
Everytime x is inputted, RNN's hidden state will be updated, here's the difference: the internal hidden state will be fed back to the model on next input and so on. So input -> update hidden state -> produce output
now_state = func ( prev_state, now_input ), function has some weights w, like CNN
To get y, we can use a FC layer on "now_state" to get the output at now_input
we use same parameters and function at every step of computation
Simple RNN, (Vanilla RNN):
h_t = f_w ( h_tm1, x_t )
h_t = tanh ( W_hh * h_tm1 + W_xh * x_t ) # tanh to add some non-linearity in the system
y_t = W_hy * h_t
y1--> L1 y2 --> L2 L = L1 + L2, for back prop, dL/dW
^ ^
NN NN
| |
h0 ---> f_w ----> h1 ---> f_w ----> h2 ----> ...
^ ^
| |
x1 x2
W -> same W over all graph.
Now, how to initialize/set h0 and W?
f_w is receiving unique x and unique h but the same W
xt are out sequence of inputs
In backward pass you end up summing the gradients at every step
x1, x2, .... xt are the one input, so same weight must be used for one computation.
and sum of loss over all are used to train the W, there the learnable params are W.
What about h? -> you only need to initialize h0, h1 are calculated from input and h0 and others so on.
Many to one, one to many.
for one to many, use the x to initialize hidden state of the model.
for language translation-> instead of many to many, we use encoder, (many to one) and then a decoder (one to many)
encoder is like will encode the jist of the sentence in input language to one, and the decoder will take that jist and expand it into another language.
Why not many to many?
gist.github.com/karpathy/d4dee566867f8291f086
h and x are stacked [h x] and fed to W
LSTM -> Vanishing and Exploding Gradients sloved in RNN
Open in sublime: https://drive.google.com/open?id=1O6geIf6GN2WhVzJOxZhM1ha6yc-d4vo5