Default Activation Choice: ReLU
Weights Initialization: Too Small - As we keep multiplying over and over, the activations will diminish. Too Large - multiply over and over again and explode.
Default Initialization: Xaiver or MSRA.
Zero Center and Normalize Layers - sensitivity will be high and generalization low if you don't.
Sensitivity - Loss function change on changing params. So if less sensitive, optimization easier.
Batch Normalization - Intermediate Activations to Zero Mean and Unit Variance - No of means, variance = number of channels, meaning mean over all batch over all data except channels.
If validation plateaus - and training loss decreases - maybe you are overfitting - so add some regularization - like dropout.
Low learning rate in theory should give better results if you train for large epochs, but a lot of time - so careful.
Stuck in local minima - low learning rate - theoritically a problem - yes but practically- no.
Today:
Stochastic Gradient Descent - Minibatch
Normal Gradient Descent - All data (whole batch)
Fancier optimization:
Default: Adam: b1 = 0.9, b2 = 0.999, lr = 1e-3 or 5e-4
LR Decay:
Step Decay: Decay LR by half every few epochs
Exponential Decay: lr = lr0 * e (-kt)
1/t decay: lr = lr0 / (1 + kt)
LR decay is less common in Adam, more common in SGD.
Find decay like: Take a learning rate with no decay, see the loss curve and see where you might need less lr.
Second order optimization:
this doesn't have learning rate. - but too much computation
Regularization:
Added to the loss - like L1 or L2 but not so common for NNs.
1. Drouput for NNs - Mostly in FC layers, In Conv Layers we drop some channels at random - less common.
2. Batch Normalization
3. Data Augmentation
Transfer Learning:
PyTorch:
import torch
dtype = torch.FloatTensor #CPU
dtype = torch.cuda.FloatTensor #GPU
x = torch.randn(100,200).type(dtype)
loss_fn = torch.nn.MSELoss(size_average=False) #what is size_average?
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
optimizer.zero_grad()
loss.backward()
optimizer.step()
Weights Initialization: Too Small - As we keep multiplying over and over, the activations will diminish. Too Large - multiply over and over again and explode.
Default Initialization: Xaiver or MSRA.
Zero Center and Normalize Layers - sensitivity will be high and generalization low if you don't.
Sensitivity - Loss function change on changing params. So if less sensitive, optimization easier.
Batch Normalization - Intermediate Activations to Zero Mean and Unit Variance - No of means, variance = number of channels, meaning mean over all batch over all data except channels.
If validation plateaus - and training loss decreases - maybe you are overfitting - so add some regularization - like dropout.
Low learning rate in theory should give better results if you train for large epochs, but a lot of time - so careful.
Stuck in local minima - low learning rate - theoritically a problem - yes but practically- no.
Today:
Stochastic Gradient Descent - Minibatch
Normal Gradient Descent - All data (whole batch)
Fancier optimization:
Default: Adam: b1 = 0.9, b2 = 0.999, lr = 1e-3 or 5e-4
LR Decay:
Step Decay: Decay LR by half every few epochs
Exponential Decay: lr = lr0 * e (-kt)
1/t decay: lr = lr0 / (1 + kt)
LR decay is less common in Adam, more common in SGD.
Find decay like: Take a learning rate with no decay, see the loss curve and see where you might need less lr.
Second order optimization:
this doesn't have learning rate. - but too much computation
Regularization:
Added to the loss - like L1 or L2 but not so common for NNs.
1. Drouput for NNs - Mostly in FC layers, In Conv Layers we drop some channels at random - less common.
2. Batch Normalization
3. Data Augmentation
Transfer Learning:
PyTorch:
import torch
dtype = torch.FloatTensor #CPU
dtype = torch.cuda.FloatTensor #GPU
x = torch.randn(100,200).type(dtype)
loss_fn = torch.nn.MSELoss(size_average=False) #what is size_average?
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
optimizer.zero_grad()
loss.backward()
optimizer.step()