Skip to main content

PyTorch Guide [Understanding PyTorch Code]

TORCH

torch.backends.cudnn.benchmark=True will enable the inbuilt cudnn auto-tuner to find the best algorithm to use for your hardware. This way, cudnn will look for the optimal set of algorithms for that particular configuration (which takes some time). This usually leads to faster runtime. It depends on the task. If your input size is changing a lot, then it might hurt runtime, if not, it should be much faster.
https://discuss.pytorch.org/t/what-does-torch-backends-cudnn-benchmark-do/5936/3

[https://hsaghir.github.io/data_science/pytorch_starter/]
torch:
a general purpose array library similar to Numpy that can do computations on GPU when the tensor type is cast to (torch.cuda.TensorFloat)
torch.autograd:
a package for building a computational graph and automatically obtaining gradients
torch.nn:
a neural net library with common layers and cost functions
torch.optim:
an optimization package with common optimization algorithms like SGD,Adam, etc
torch.jit:
a just-in-time (JIT) compiler that at runtime takes your PyTorch models and rewrites them to run at production-efficiency. The JIT compiler can also export your model to run in a C++-only runtime based on Caffe2 bits.

import torch # arrays on GPU
import torch.autograd as autograd #build a computational graph
import torch.nn as nn ## neural net library
import torch.nn.functional as F ## most non-linearities are here
import torch.optim as optim # optimization package


VARIABLES AND TENSORS

Variable is now Tensor since PyTorch 0.4.

Previously, variables were just a wrapper for tensors so you can easily auto-compute gradients - provided a backward method to perform backpropagation, back then tensors were the actual data and variable a wrapper.
Now, after Pytorch0.4.0, variables and tensors are the same. To compute autograd, variables are no longer necessary, set requires_grad of Tensors to True and you are good to go.
Variable(tensor) and Variable(tensor, requires_grad) still work, but they return tensors instead of variables now.
var.data is same as tensor.data. [https://www.quora.com/What-is-the-difference-between-a-Tensor-and-a-Variable-in-Pytorch]
[https://pytorch.org/docs/stable/autograd.html#variable-deprecated]

In short, m=now Tensors are Variables, and Variables no longer exist.

[https://medium.com/@layog/a-comprehensive-overview-of-pytorch-7f70b061963f]
Variable: To store tensors' gradients we need to wrap them in Variables. Variable have certain properties – .data (the tensor under the variable), .grad (the gradient computed for this variable, must be of the same shape and type of .data), .requires_grad (boolean indicating whether to calculate gradient for the Variable during backpropagation), .grad_fn (the function that created this Variable, used when backproping the gradients). There is one more attribute, .volatile. Variable is available under torch.autograd as torch.autograd.Variable

I think it's still not possible to forward a tensor through a Sequential model, only variable works, they still haven't written code for tensors.

Parameter of some module - this cannot be achieved with either Tensor (as they cannot have gradient) nor Variable (as they are not module parameters). So a wrapper around Variable is created called Parameter. This is available under torch.nn as torch.nn.Parameter


PYTORCH TENSORS VS NUMPY NDARRAY
Both are almost same, the difference is, pytorch tensors can use GPUs.
All numpy operations can be done in pytorch, pytorch tensors are like numpy arrays, N-dimensional.
device = torch.device('cuda') -> to run a pytorch tensor on gpu
Ex: torch.randn(N, 1000, device=device)
tens = torch.from_numpy(arr)
Also remember, these days pytorch variables and tensors are same.

COMPUTATION GRAPHS
In the computational graph, a node is an array/tensor and an edge is an operation on the array/tensor.
Pytorch also creates computational graphs, but does so dynamically or on-the-fly.
In graph, nodes are tensors, edges are functions.

Every matrix corresponds to a graph (https://www.math3ma.com/blog/matrices-probability-graphs)
matmul is actually a function / an edge.

from torch.autograd import Variable
Convert a Tensor to a node in the computational graph using torch.autograd.Variable()
Access its value using x.data
Access its gradient using x.grad
# No longer required to wrap Tensor in Variable. We can directly use Tensors.
torch.Tensor(d) # d = python list

Do operations on the .Variable() to make edges of the graph.

The edges of the graph also result in new nodes in the computational graph.
Each node in the graph has a .data property which is a multi-dimensional array and a .grad property which is it’s gradient with respect to some scalar value (.grad is also a .Variable() itself).

FORWARD AND BACKWARD
There are some variables/tensors in our architecture/graph/model which we do not need to update and some which we need to update.
Example: x ---(w1)---> h1 ---(w2)---> y and [loss]
Here, x, w1, h1, w2 and y are variables. But we only need to update w1 and w2 (with respect to loss)
To make a distinction between them, there is requires_grad (which is false by default):
x = autograd.Variable(d, requires_grad=False)
w1 = autograd.Variable(d, requires_grad=True)
w1.data.size() w1.grad [what is w1.creator?]

import torch
a = torch.autograd.Variable(torch.Tensor(10,10))
print(a.requires_grad) # prints False

[https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html]
torch.Tensor is the central class of the package. If you set its attribute .requires_grad as True, it starts to track all operations on it.
To stop a tensor from tracking history, you can call .detach() to detach it from the computation history, and to prevent future computation from being tracked.
To prevent tracking history (and using memory), you can also wrap the code block in with torch.no_grad():

[https://github.com/jcjohnson/pytorch-examples]
forward -> computes output tensors from input tensors.
backward-> receives gradient of output tensors wrt some scalar value - like loss
and computes gradient of input tensors wrt same scalar value

TORCH.AUTOGRAD
Problem: Calculating gradient manually and updating every node is impossibly difficult of huge networks.

torch.autograd : automatically implement backpropagation / automatic differentiation [data, grad, creator]
https://discuss.pytorch.org/t/what-does-the-backward-function-do/9944
Lets assume w's are out vaiables/tensors.

.backward()
loss.backward() computes dloss/dw for every parameter w which has requires_grad=True. These are "accumulated" into w.grad for every parameter w. In pseudo-code:
w.grad += dloss/dw    (note the += part, accumulation)

Two important things:
1. loss.backward() only computes the gradients and "accumulates" it into w.grad, it doesn't update w, we have to do it later.
2. Since, the gradients are accumulated into w.grad, we will need to set the grad to zero initially so that we get gradient for one operation/pass only. w.grad.data.zero_() before doing .backward.

Although as of now, we found a way to compute gradients (automatically) for all variables/tensors we need to update. We still need to update them, manually updating variables/tensors is still very difficult, there might be huge number of variables/tensors ex. weights to update.

Also, manually setting w.grad.data.zero_() for all w's is difficult.

When can accumulation be useful?
Say you cannot put large batch in the pc at once, so you feed small batches, but later on add the gradients of all those small batches to update. Or when we are dealing with multiple losses.

optimizer.step()
optimizer.step updates the value of w using the gradient w.grad. For example, the SGD optimizer performs:
w += -lr * w.grad, for all whose requires_grad = True

Update:
Instead of using optimizer we can also get all the model parameters using model.parameters()
And then update each parameter as parameter.data -= learning_rate*parameter.grad.data . B
If you are researching for a new optimizer.

And what about manually setting all grads to zero? Well, model.zero_grad() - recursively set the gradient buffers for all the parameters in the model to zero.

There are many methods available for each module to access its children — model.modules() , model.named_modules() , model.parameters() model.named_parameters() , model.children() and model.named_children() . But the most used is model.parameters() , as this is used to access all the parameters recursively and hence can be used to pass to an optimizer.

model = SomeOurModule(which_inherits_nn.Module) #more on this below

optimizer.zero_grad()
optimizer.zero_grad() clears w.grad for every parameter w in the optimizer. It’s important to call this before loss.backward(), otherwise you’ll accumulate the gradients from multiple passes.

Remember: Update happens only when you call step().

So general way to go is:
model = MyModelLikeSRGAN(some_input)
optim = optimizer_defined_on_some_parameters_we_want_to_update_like_model.parameters() # torch.optim.Adam(model.parameters())
optim.zero_grad()
y_pred=model(x) #or we can do, y = model.forward(x)
loss = loss_fn(y_pred,y) #loss_fn = nn.MSELoss() or we can do loss=nn.functional.mse_loss(y_pred, y) directly
loss.backward() #calc grads
optim.step()    #update params

or "maybe"
model.zero_grad()
loss.backward()
for parameter in model.parameters():
parameter.data -= learning_rate*parameter.grad.data


Transforms: [https://medium.com/@layog/a-comprehensive-overview-of-pytorch-7f70b061963f]
Functions:
You input you get output, no memory of their own, do not store any state or buffer, like log function.
A linear layer cannot be a function, because it has internal states, such as weights and biases.
If we supply the weights and biases externally or explicitly, then linear layer becomes a function.
In short, functions do not have learnable parameters.
Common mathematical functions are implemented under torch as torch.log, torch.sum etc.
Other neural network related functions are implemented under torch.nn.functional

Ex: nn.MSELoss(), since it is a non-parametric function, we can use nn.functional.mse_loss functional form as well. The only difference being that we will have to call it directly like nn.functional.mse_loss(out, y) See the example above, in "So general way to go is"

Modules:
Consists of parameters, layers, functions or other modules.
When we backprop, gradients of all parameters of modules or its child modules are calculated.
After constructing a module we need to call backprop only once and it will automatically compute gradients for the child modules recursively.
Predefined modules are implemented under torch.nn as torch.nn.Conv2d, torch.nn.Linear etc.

Whenever we need to create a new function we will subclass torch.autograd.Function and whenever we need to define a new model (module) we will subclass torch.nn.Module

When we define a new model or module, we need to define the functions, __init__ aand forward

class LinearRegressor(nn.Module):
    def __init__(self, inp_size, require_bias=True):
        super(LinearRegressor, self).__init__()
        self.linear = nn.Linear(in_features=inp_size, out_features=1, bias=require_bias)
 
    def forward(self, inp_batch):
        return self.linear(inp_batch)

 In __init__, we instantiate all our parametric layers (and sometimes non-parametric ones too).
 In forward, we apply all the layers and other functions to our input and that’s it. We have created our model.
 [https://gist.github.com/layog/a006e8ec8c201639d46a34e77a693554#file-pytorch_linear_regression-py]

 super(LinearRegressor, self).__init__() to call the init of nn.Module since the LinearRegressor inherited from nn.Module

https://pytorch.org/docs/stable/nn.html
nn.Module is base class for all neural network modules.
It basically has.

def __init__(self):
    self._backend = thnn_backend
    self._parameters = OrderedDict()
    self._buffers = OrderedDict()
    self._backward_hooks = OrderedDict()
    self._forward_hooks = OrderedDict()
    self._forward_pre_hooks = OrderedDict()
    self._state_dict_hooks = OrderedDict()
    self._load_state_dict_pre_hooks = OrderedDict()
    self._modules = OrderedDict()
    self.training = True

def forward(self, *input):
and other methods as well.

So your module should look like:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
       x = F.relu(self.conv1(x))
       return F.relu(self.conv2(x))

Defining non-parametric layers and initialize parameters:
lin = nn.Linear(3, 1) #doesn't have any inside parameters. probably, m = 3, c = 1, y=mx+c, we don't need to learn m or c
y = lin(x)
lin = nn.Linear(in_features=inp_size, out_features=1, bias=require_bias) #has internal params
y = lin(x)

Initializing parameters:
Is just putting values in .data attribute of Variables
1. We can iterate over model.parameters() and init them with tensor functions such as exponential, uniform, fill etc
2. Every module have an attribute definition .apply . We can call this apply on the module and pass it a function which handles the initialization for each of the parameter. Whenever .apply is called on a module, it is called on each of the module and parameter recursively.
3. Using torch.nn.init module, suppose we want to initialize w using Xavier initialization, do torch.nn.init.xavier_uniform(m), combining this with .apply we can use the module torch.nn.init to initialize parameters of all sort

Default: In PyTorch, parameters are automatically initialized using some predefined initialization.

Not that I'll ever do this but a note: a+b is allowed if a is on CPU and b is on GPU.
To run on GPU: Copy model parameters and input data on the GPU
Done using .cuda() attribute available for all tensors and modules.
For tensors straightforward: x.cuda(), y.cuda()
For module (model), calling .cuda() will recursively copy all child modules and parameters to GPU.
First check torch.cuda.is_available() which returns True if a GPU is available on a machine.

We have a model,
model = SRGAN()
if torch.cuda.is_available(): model.cuda()
model.train() and model.eval() will do nothing but set mode to train to affect only those layers that behave differently during training and testing like dropout, batchnorm
it doesn't actually train or evaluate, it's like setting is_train to true or false.

Multiple gpus: [I won't worry about this now]
torch.nn.DataParallel
model = torch.nn.DataParallel(model)
with torch.cuda.device(<device_id>)
or <tensor>.cuda(device_id)
torch.cuda.empty_cache()
<tensor>.pin_memory()

Customs function in pytorch is just like you do in python.

Defining new functions along with custom backward pass can be accomplished by subclassing torch.autograd.Function and defining two static methods forward and backward.

Excluding parts of graph from optimization:
1. Do not pass the parameters that are not to be optimized to optimizer.
Defining torch.optim.<your-favorite-optimizer> to not pass the parameters which are not to be updated.
Inefficient: gradients for those parameters will still be calculated by loss.backward()
2. Set requires_grad to False
While creating tensor/variable, pass the requires_grad=False argument
or, after creation, <param>.requires_grad = False
Ex: Find names using: model.named_parameters(), and model.linear.weight.requires_grad = False
Returns an iterator of tuple with two elements — first the name of the parameter, second the parameter itself.
3. To not store temporary Variables or buffers to save memory, either do requires_grad = False for all, overkill!
Or use volatile attibute of Variable, makes all the children volatile as well and no buffer will be stored for any variable.
Volatile propagates faster then requires_grad.

Optimizers:
List: https://pytorch.org/docs/stable/optim.html#algorithms
Every optimizer expects input a list of parameters during initialization, which the optimizer will try to optimize. If any parameter does not require gradient but is passed to the optimizer, it will throw an error.

During fine tuning we want the lower layers to have much less learning rate than the higher layers.
So, instead of passing a list of parameters to optimizer, we can also pass a list of dictionaries, where each dictionary defines arguments for each parameter group. The parameter for each group are defined by params key of the dictionary.

# Setting conv1 parameters to non-trainable
# Note that this model won't learn well since we are not training the very first layer
# Generally, we will requires_grad=False when we load a pretrained model
model.conv1.weight.requires_grad = False
model.conv1.bias.requires_grad = False

# Type 1 optimizer definition, all the parameters are using the same configuration
# Note that we do not have much flexibility with this optimizer definition and differential learning rate is not possible
optim1 = torch.optim.SGD([param for param in model.parameters() if param.requires_grad], lr=0.01, momentum=0.9)

# Type 2 optimizer definition, different parameter group have different configuration
# Here conv2 parameters have learning rate of 0.0001 and 0 momentum
# while other parameter groups have learning rate of 0.01 and momentum of 0.9
optim2 = torch.optim.SGD([
    {'params': model.conv2.parameters(), 'lr': 0.0001, 'momentum': 0},
    {'params': model.linear1.parameters()},
    {'params': model.linear2.parameters()}
], lr=0.01, momentum=0.9)

We can now call optim.step() and this will update all the parameters according to the defined configuration.


Schedulers and updating optimizer:
For varying learning rates: needd to update optimizer's parameter groups.
Access parameter grups: optim.param_groups -> list of dictionaries, each dict is a param group.
Setting learning rate of group 0 to 0.001, do : optim.param_groups[0]['lr'] = 0.01

Or, predefined schedulers to auto-update lr.
torch.optim.lr_scheduler
pass an optimizer whose learning rate is to be scheduled and other scheduler specific arguments
then call scheduler.step(...) for every epoch, with the proper arguments

Working with datasets:
Module: torch.utils.data Classes: Dataset and DataLoader
To define our custom dataset, we will subclass Dataset and override methods __len__ (for defining length of our dataset) and __getitem__ (to get one item from our dataset using indexing).
Batch Size, Shuffle can be handled as:
DataLoader is used which takes input of an object of our defined Dataset class and other optional arguments such as batch size, shuffle etc.

Saving and Loading the model:
torch.save(model, <PATH>)
model = torch.load(<PATH>)
But, this is not often used, since directly saving the complete model is error prone and can break in a number of ways, even with a simple move, such as, moving the model to some other directory.

torch.save(model.state_dict(), PATH)
model.load_state_dict(torch.load(PATH))

to load a state dict, we need to first create our model by properly defining the model


There are some other features as well:
torch.nn.ModuleList
torch.nn.ParameterList
.apply and .extent attribute
Say, a = DoubleTensor, a.type(torch.FloatTensor), now a = FloatTensor
pack different sized inputs and need to unpack outputs from RNN. To do this we can use pack_padded_sequence and pad_packed_sequence from torch.nn.utils.rnn module.
RNN/LSTM : https://stackoverflow.com/questions/49466894/how-to-correctly-give-inputs-to-embedding-lstm-and-linear-layers-in-pytorch/49473068#49473068
torch.set_default_tensor_type(<tensor_type>)
torch.set_default_tensor_type(torch.cuda.FloatTensor)
Wonderful torch functions: https://pytorch.org/docs/stable/tensors.html
torch.bmm for batch matrix multiplication, torch.squeeze & torch.unsqueeze for removing and adding a dimension from a tensor and torch.cat etc.
Numpy to Torch: torch.from_numpy(<array_name>)
Torch to Numpy: <tensor_name>.cpu().data





TORCH.NN
[https://hsaghir.github.io/data_science/pytorch_starter/]
[https://github.com/jcjohnson/pytorch-examples]
The nn package defines a set of Modules, which are roughly equivalent to neural network layers.
A Module receives input Tensors and computes output Tensors, but may also hold internal state such as Tensors containing learnable parameters.
The nn package also defines a set of useful loss functions.

We can use the nn package to define our model as a sequence of layers.
nn.Sequential is a Module which contains other Modules, and applies them in sequence to produce its output.

Each Linear Module computes output from input using a linear function, and holds internal Tensors for its weight and bias.

After constructing the model we use the .to() method to move it to the desired device.

model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        ).to(device)

loss_fn = torch.nn.MSELoss(reduction='sum')
reduction='sum' means that we are computing the *sum* of squared errors rather than the mean
more common to use mean squared error as a loss by setting reduction='elementwise_mean'.

Also remember every module has init and forward function [see above]
So, model(x) means model.forward(x)
And generally in forward method, we have x = conv(x), x = relu(x), meaning, we pass x and get x then put x into another layer and so on until we get output.

Optimizer is like: optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)




EXTRAS
What is nn.Sequential
What is mode CNA, NAC, CNAC?
These are custom defined structures, CNA = Conv->Norm->Activation
What is from torch.nn import init
init.kaiming_normal

Source: https://github.com/jcjohnson/pytorch-examples

all numpy operations can be done in pytorch, pytorch tensors are like numpy arrays, N-dimensional.
pytorch tensors can utilize gpu unlike numpy

device = torch.device('cuda') -> to run a pytorch tensor on gpu
Ex: torch.randn(N, 1000, device=device)


torch.autograd : automatically implement backpropagation / automatic differentiation [data, grad, creator]
pytorch also creates computational graphs, but does so dynamically or on-the-fly
in graph, nodes are tensors, edges are functions
backpropagating through the graph - compute gradients easily

If we want to compute gradients with respect to some Tensor,
then we set requires_grad=True when constructing that Tensor
like the weights. we want differentiation w.r.t. weights so that we can update them.

Ex: w1 = torch.randn(D_in, H, device=device, requires_grad=True)

# Use autograd to compute the backward pass. This call will compute the
# gradient of loss with respect to all Tensors with requires_grad=True.
# After this call w1.grad and w2.grad will be Tensors holding the gradient
# of the loss with respect to w1 and w2 respectively.
loss.backward()

after this we have the w1.grad and w2.grad for every iteration and we can then update the weight manually.

w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad

use the torch.no_grad() context manager
to prevent PyTorch from building a computational graph

# Update weights using gradient descent. For this step we just want to mutate
  # the values of w1 and w2 in-place; we don't want to build up a computational
  # graph for the update steps, so we use the torch.no_grad() context manager
  # to prevent PyTorch from building a computational graph for the updates
  with torch.no_grad():
    w1 -= learning_rate * w1.grad
    w2 -= learning_rate * w2.grad

    # Manually zero the gradients after running the backward pass
    w1.grad.zero_()
    w2.grad.zero_()

So basic idea:

forward -> computes output tensors from input tensors.
backward-> receives gradient of output tensors wrt some scalar value - like loss
and computes gradient of input tensors wrt same scalar value

https://hsaghir.github.io/data_science/pytorch_starter/

In the computational graph, a node is an array and an edge is an operation on the array.
we make a node by wrapping an array inside the torch.aurograd.Variable() function.
All operations that we do on this node from then on will be defined as edges in the computational graph
The edges of the graph also result in new nodes in the computational graph. Each node in the graph has a .data property which is a multi-dimensional array and a .grad property which is it’s gradient with respect to some scalar value (.grad is also a .Variable() itself). Variables have requires_grad=False by default. There is also retain_variables=True
import torch
a = torch.autograd.Variable(torch.Tensor(10,10))
print(a.requires_grad) # prints False

So:
Nodes are tensors - Tensors which we need to update / who gradients we need to compute.