Skip to main content

Lecture 7 - Training Neural Networks II - Stanford

Default Activation Choice: ReLU

Weights Initialization: Too Small - As we keep multiplying over and over, the activations will diminish. Too Large - multiply over and over again and explode.

Default Initialization: Xaiver or MSRA.

Zero Center and Normalize Layers - sensitivity will be high and generalization low if you don't.
Sensitivity - Loss function change on changing params. So if less sensitive, optimization easier.

Batch Normalization - Intermediate Activations to Zero Mean and Unit Variance - No of means, variance = number of channels, meaning mean over all batch over all data except channels.

If validation plateaus - and training loss decreases - maybe you are overfitting - so add some regularization - like dropout.

Low learning rate in theory should give better results if you train for large epochs, but a lot of time - so careful.

Stuck in local minima - low learning rate - theoritically a problem - yes but practically- no.

Today:

Stochastic Gradient Descent - Minibatch
Normal Gradient Descent - All data (whole batch)

Fancier optimization:
Default: Adam: b1 = 0.9, b2 = 0.999, lr = 1e-3 or 5e-4

LR Decay:
Step Decay: Decay LR by half every few epochs
Exponential Decay: lr = lr0 * e (-kt)
1/t decay: lr = lr0 / (1 + kt)

LR decay is less common in Adam, more common in SGD.
Find decay like: Take a learning rate with no decay, see the loss curve and see where you might need less lr.

Second order optimization:
this doesn't have learning rate. - but too much computation

Regularization:
Added to the loss - like L1 or L2 but not so common for NNs.
1. Drouput for NNs - Mostly in FC layers, In Conv Layers we drop some channels at random - less common.
2. Batch Normalization
3. Data Augmentation

Transfer Learning:

PyTorch:
import torch
dtype = torch.FloatTensor #CPU
dtype = torch.cuda.FloatTensor #GPU
x = torch.randn(100,200).type(dtype)

loss_fn = torch.nn.MSELoss(size_average=False) #what is size_average?
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
optimizer.zero_grad()
loss.backward()
optimizer.step()