Algidus

Posts

Showing posts from April, 2019

Lecture 10 - Recurrent Neural Networks

Incomplete - Need to study more Batch Normalization is important to train deep nns. VGG 16 and 19 were developed before BN. So they trained 11 layer first added few layers trained again and so on. Inception used auxillary losses to train, not necessary but to propagate loss into first layers. Residual Nets: 1 important property, if the weights of residual network is zero, then it behaves as identity transformation, so that network can choose what is not required. Easy for the model to learn to not to use the layers it doesn't need. L2 regularization - making the weights to zero. 2 Gradient flow in backward pass is easy, so deeper nets can be designed. DenseNet and FractalNet - Study! Recurrent Neural Network - Variable Size Data x -------> [ RNN ] --------> y Everytime x is inputted, RNN's hidden state will be updated, here's the difference: the internal hidden state will be fed back to the model on next input and so on. So input -> updat...

Cross Entropy Squared Error KL-Divergence

Cross Entropy Vs. Squared Error | KL-Divergence Generative Adversarial Network Generative model is about finding the underlying distribution or model from which the samples of images/data are supposed to come from. Say we have a million images of cats, so we can assume, all the images come from one distribution, Pcat So a sample from Pcat is an image of cat. So we want: to model Pcat. So we model the data / find the model of the data. Model = Distribution. [yt: hQv8FNaJHEA] Explicit model: Learns the actual parameter of the distribution. X_1 = P (X) --> Finding P Implicit model: Don't bother to learn the distribution of the data, instead focus on the stochastic [random] procedure that directly generate the data - on some input. X_1 = F_W(Z), Z = Latent space/input, W = Learned weights GAN are implicit models. Overall View: minG maxD { x~Pdata(X) E[ logD(X) ] + z~Pz(Z) E[ log ( 1 - D(G(Z)) )] X is a real data, Z is the noise, and G(Z) is...

Popular CNN Architectures

[Source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html] output_size = ( I + P_left + P_right - D*(F-1) -1) / S + 1 I = Input Size, F = Filter Size, D = Dilation, S = Stride, P = Padding Number of Filters = No. of Input Channels x No. of Output Channels. No. of Params = Filter Size (F x F) * No. of Filters AlexNet [2012] - The pioneering paper. - IP = 227 x 227 x 3 - First use of ReLU - Used Norm Layers (not common anymore) - Heavy Data Augmentation - Flipping, Jittering, Cropping, Color Normalization, etc - Dropout = 0.5 - Batch Size = 128 - SGD Momentum = 0.9 - Learning rate 1e-2 reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2% -> 15.4% Training Multiple Models and Averaging them together. ZF Net [2013] - Slight modif...

Lecture 7 - Training Neural Networks II - Stanford

Default Activation Choice: ReLU Weights Initialization: Too Small - As we keep multiplying over and over, the activations will diminish. Too Large - multiply over and over again and explode. Default Initialization: Xaiver or MSRA. Zero Center and Normalize Layers - sensitivity will be high and generalization low if you don't. Sensitivity - Loss function change on changing params. So if less sensitive, optimization easier. Batch Normalization - Intermediate Activations to Zero Mean and Unit Variance - No of means, variance = number of channels, meaning mean over all batch over all data except channels. If validation plateaus - and training loss decreases - maybe you are overfitting - so add some regularization - like dropout. Low learning rate in theory should give better results if you train for large epochs, but a lot of time - so careful. Stuck in local minima - low learning rate - theoritically a problem - yes but practically- no. Today: Stochastic Gradient De...

VGG19 / VGG16

Input Image: 224 x 224 x 3 All pre-trained models expect input images normalized in the same way, i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224. The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]. For all layers: Bias = True, Filter size = 3x3. Stride = 1, Padding = 1 (same) Each conv layer is followed by ReLU 19 layers are including 3 fully connected layers. So there are 16 convolutional layers with relu, 5 MaxPool Layers, 3 Fully connected layers. So, 16 kernels and 16 biases for conv layers and 3 kernels/weights and 3 biases for FC layers. 5 max pools (2x2) so output size before FC layers = 224 / (2^5) = 224/32 = 7 Also, Torch Input = C x H x W = 3 x 224 x 224 Conv2D(input_channel, output_channel, stride, padding) ---------------------------------------------------------------- Layer (type) Outpu...

Things Learned from implementing SRGAN in PyTorch

https://twitter.com/tim_dettmers/status/1059539322985054208?lang=en PyTorch has .half() — it should work out of the box with that For faster input output speed: https://github.com/xinntao/BasicSR/wiki/Faster-IO-speed 1. Put data in SSD 2. Crop the images beforehand so that you won't have to load full images during training - since you are going to train multiple times - for debugging and what not so good approach. 3. Convert to lmdb - it's faster. For my case: As like author, I generated 480*480 sub-images with a sliding window of step = 240. Initially I had 800 training images, 800 items, totalling 3.5 GB Now after cropping I had: 32,208 items, totalling 12.3 GB, each of size 480*480*3 cv2 reads image in BGR format and the output is numpy unsigned integer 8, so be careful, unsigned integer means if you subtract 62-91 you will get 227. x = np.uint8(62) y = np.uint8(91) print(x-y) To be safe, before doing any operation on cv2 do img = img.astype(np.float64) ...

Pytorch: detach() and detach_()

In short: generally used: var.detach().float().cpu() detach() doesn't affect the original graph, it will create a copy of the variable with requires_grad = false. [There is much more as well] detach_() is the in-place version of detach() More on detach(): .detach() -> doesn't change requires_grad property of where it is applied but for the ones that follow. Think of detach as the breaking point between two graphs https://github.com/pytorch/examples/issues/116 http://www.bnikolic.co.uk/blog/pytorch-detach.html https://github.com/szagoruyko/pytorchviz/blob/master/examples.ipynb http://ruotianluo.github.io/2017/02/11/pytorch-attempt/ https://blog.csdn.net/u012436149/article/details/76714349 Detach This method is described in the official documentation. Returns a new Variable separated from the current image. Returned Variable will never need a gradient If detached Variable volatile=True, the detached volatile is also True. There is also a caveat: the ret...

Pytorch Basics 2 - Transfer Learning - Augmentation

https://www.youtube.com/watch?v=_H3aw6wkCv0 set_trace() -> to debug or see value at any point numpy -> ndarray pytorch -> tensor @ = Matrix Multiplication x.T @ x -> x transpose matmul x :numpy x.t() @ x : pytorch inv(x) torch.inverse(x) x.add(1) x.add_(1) : _ means inplace operation, ex: x.t_() will change x torch to numpy: A.numpy() numpy to torch: torch.from_numpy(x) Difference between detach() and with torch.no_grad(): https://pytorch.org/blog/pytorch-0_4_0-migration-guide/ always use detach to get the variable/tensor data because it is secured. example: x = 1,2,3 y = x.data y = 4,5,6 then, x also becomes 4,5,6 and when we calc loss.backward, x gradient is changed from its new value - harmful because we had changed it explicitly. x = 1,2,3 y = x.detach() y = 4,5,6 then, x also becomes 4,5,6 and when we calc loss.backward, there is an error which will indicate the value has changed. So in short, do not change the variable on which det...

PSNR

https://www.mathworks.com/help/vision/ref/psnr.html PSNR = 10 * log10( R^2 / MSE) MSE = Sum [ ( I1 - I2) ^ 2 ] / (M*N) For image in range 0-1, R = 1 For image in range 0-255, R = 255