Skip to main content

Lecture 10 - Recurrent Neural Networks

Incomplete - Need to study more

Batch Normalization is important to train deep nns.
VGG 16 and 19 were developed before BN.
So they trained 11 layer first added few layers trained again and so on.
Inception used auxillary losses to train, not necessary but to propagate loss into first layers.

Residual Nets:
1 important property, if the weights of residual network is zero, then it behaves as identity transformation, so that network can choose what is not required.
Easy for the model to learn to not to use the layers it doesn't need.
L2 regularization - making the weights to zero.
2 Gradient flow in backward pass is easy, so deeper nets can be designed.

DenseNet and FractalNet - Study!

Recurrent Neural Network
  - Variable Size Data

  x -------> [ RNN ] --------> y
  Everytime x is inputted, RNN's hidden state will be updated, here's the difference: the internal hidden state will be fed back to the model on next input and so on. So input -> update hidden state -> produce output

  now_state = func ( prev_state, now_input ), function has some weights w, like CNN

  To get y, we can use a FC layer on "now_state" to get the output at now_input

  we use same parameters and function at every step of computation

  Simple RNN, (Vanilla RNN):
  h_t = f_w ( h_tm1, x_t )

  h_t = tanh ( W_hh * h_tm1 + W_xh * x_t )             # tanh to add some non-linearity in the system
  y_t = W_hy * h_t

                      y1--> L1          y2 --> L2             L = L1 + L2, for back prop, dL/dW
                      ^                 ^
                      NN                NN
                      |                 |                     
  h0 ---> f_w ----> h1 ---> f_w ----> h2 ----> ...
           ^                 ^
           |                 |           
           x1                x2

    W -> same W over all graph.
    Now, how to initialize/set h0 and W?
    f_w is receiving unique x and unique h but the same W
    xt are out sequence of inputs

    In backward pass you end up summing the gradients at every step

    x1, x2, .... xt are the one input, so same weight must be used for one computation.
    and sum of loss over all are used to train the W, there the learnable params are W.
    What about h? -> you only need to initialize h0, h1 are calculated from input and h0 and others so on.

    Many to one, one to many.

    for one to many, use the x to initialize hidden state of the model.

    for language translation-> instead of many to many, we use encoder, (many to one) and then a decoder (one to many)

    encoder is like will encode the jist of the sentence in input language to one, and the decoder will take that jist and expand it into another language.

    Why not many to many?


    gist.github.com/karpathy/d4dee566867f8291f086

    h and x are stacked [h x] and fed to W

    LSTM -> Vanishing and Exploding Gradients sloved in RNN
   

   

Open in sublime: https://drive.google.com/open?id=1O6geIf6GN2WhVzJOxZhM1ha6yc-d4vo5