Skip to main content

Popular CNN Architectures

[Source: https://adeshpande3.github.io/adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html]

output_size = ( I + P_left + P_right - D*(F-1) -1) / S + 1
I = Input Size, F = Filter Size, D = Dilation, S = Stride, P = Padding

Number of Filters = No. of Input Channels x No. of Output Channels.
No. of Params = Filter Size (F x F) * No. of Filters


AlexNet [2012] - The pioneering paper.
    - IP = 227 x 227 x 3
    - First use of ReLU
    - Used Norm Layers (not common anymore)
    - Heavy Data Augmentation - Flipping, Jittering, Cropping, Color Normalization, etc
    - Dropout = 0.5
    - Batch Size = 128
    - SGD Momentum = 0.9
    - Learning rate 1e-2 reduced by 10 manually when val accuracy plateaus
    - L2 weight decay 5e-4
    - 7 CNN ensemble: 18.2% -> 15.4% Training Multiple Models and Averaging them together.

ZF Net [2013] - Slight modification on AlexNet - but explains convolutional networks very well - so far we believed convnets are trial and error.

    - ZF Net Improved hyperparameters over AlexNet

    “Visualizing and Understanding Convolutional Neural Networks”

    ZF Net used 7x7 filter with small Stride in first layer (AlexNet used 11x11) - because we want to retain more info at first layer. Number of filter increases as we move forward.

    Activation Function = ReLU
    Error Function = Cross-entropy
    Batch stochastic gradient descent.

    DeconvNet - maps features to pixels.

    Suppose we have multiple feature maps at a certain layer. We want to see what a certain map has learned from the original image. So we take that 1 feature (output/activation), set all other activations to zero, pass it through the deconvnet, unpooling, rectify and filter operations for each preceding layer until the input space is reached. [ How does it handle up-pooling and up-sampling on already trained network? ]

    First layer of your ConvNet detects low level features like simple edges or colors.

    Second layer detects more circular features.

    3, 4 or 5th layers detect higher level features like dogs' faces or flowers.

    So this paper explains the working of CNNs.

VGG Net [2014]
    - Simple and deeeep [Much deep, small filters]
    - 19 Layers of CNN - 3x3 Filters, Stride 1, Padding 1, MaxPool 2x2 Stride 2.

    Alex 8 layers, VGG 16 to 19 layers. [By layers we only count weighted layers, conv and fc]
    Alex 11x11, ZF 7x7, VGG 3x3
    Why use smaller Filters? (3x3 Conv)
        Stack of three 3x3 conv layers has same effective receptive field as one 7x7 conv filter.
        - First conv layer receptive field = 3x3
        - Second conv layer receptive field = 5x5
        - Third conv layer receptive field = 7x7
    VGG reasoned two 3x3 conv layers has an effective receptive field of 5x5
        - So 2 small filers more info + decrease in number of params + nonlinearity increases - 2 ReLUs

    Params: 3 * (3^2 * C^2), C = channels per layer
    Params: 7^2 * C^2
    If we remove C, 27 Params, 49 Params

    Three 3x3 conv layers has an effective receptive field of 7x7
    Since filter size decrease - depth increase
    Number of filters doubles after each maxpool. Shrinking spatial dimensions but growing depth.
    Scale Jittering - one data augmentation technique

    Activation Function = ReLU
    Error Function = []
    Batch stochastic gradient descent.

    CNNs have to have a deep network of layers in order for this hierarchical representation of visual data to work.

    - "SIMPLE AND DEEP"

     Calculating the memory for VGG,
     We need to store 2 things, data and params.

     We have total : 24M data to store
     Say we need 4 bytes to store each data, then
     Total memory = 24M * 4 bytes ~= 96 MB / Image (forward pass only ~ *2 for bwd)

     5 Gigs of memory, we can store only about 50 images.

     Total Params : 138M, 60M for Alex

     FC layers have so many params, because densely connected, some throw FC to save params.
     GoogLeNet threw away FC.

     VGG Net Details:
        - ILSVRC'14 : 2nd in Classification (first GoogLeNet), 1st in Localization
            - Localization -> Not just classifying image but also finding where that object is in the image
        - Similar training procedure as Alex
        - No Local Response Normalisation (LRN)
        - Use VGG 16 or VGG19 (VGG 19 is only slightly better, more memory)
        - Use ensembles for best results
        - FC7 features generalize well to other tasks.



GoogLeNet [2014] (At the same time as VGG. GoogLeNet better on ImageNet slightly, VGG better in some other tracks)
    - Says simple isn't the best - need to consider memory and power usage
    Stacking layers and adding huge number of filters - increases computational and memory cost + increases chance of overfitting.

                    |-----> 1x1 Conv ---------------------->|C| (network in network layer)
                    |                                       |O|
                    |-----> 1x1 Conv -----> 3x3 Conv ------>|N| (a medium sized filter convolution)
    Input Layer --->|                                       |C|--->
                    |-----> 1x1 Conv -----> 5x5 Conv ------>|A| (a large sized filter convolution)
                    |                                       |T|
                    |-----> 3x3 MaxP -----> 1x1 Conv ------>|N| (a pooling operation)

    - Apply parallel filter operations on the input from previous layer:
        - Multiple receptive field sizes for convolution (1x1, 3x3, 5x5)
        - Pooling operation (3x3)
    - Concatenate all filter outputs together - DEPTH-WISE Spatial Dimension is maintained by zero padding - or sometimes strides.

    -Example:  If we dont use 1x1, op channels = 128 (1x1), 192 (3x3), 96 (5x5),
    So if ip = 28x28x256. op = 28x28x(128+192+96+256)  256(pool) = 28x28x672
    Conv Operations:
    [1x1 conv, 128] 28x28x128x1x1x256
    [3x3 conv, 192] 28x28x192x3x3x256
    [5x5 conv, 96] 28x28x96x5x5x256
    Total = 854M ops

    Pooling Layer also preserves feature depth, so total depth after concat will only grow at every layer

    Now say we use 1x1conv, 64 filter then op will be:
    [1x1 conv, 64] 28x28x64x1x1x256
    [1x1 conv, 64] 28x28x64x1x1x256
    [1x1 conv, 128] 28x28x128x1x1x256
    [3x3 conv, 192] 28x28x192x3x3x256
    [5x5 conv, 96] 28x28x96x5x5x256
    [1x1 conv, 64] 28x28x64x1x1x256
    Total: 358M ops

    These 1x1 filters are also called bottleneck filters.

    Why the 1x1 conv? - To reduce the depth to decrease computation.
    Example 100x100x60 to 100x100x20

    Calculating the number of multiplication in convolutional neural networks.

    Inception Module:
        - At each layer of a traditional ConvNet, you have to make a choice of whether to have a pooling operation or a conv operation (there is also the choice of filter size).
        - Inception module allows both - in parallel and later concatenate them.
        - Performing convolution (different filter sizes) and also pooling at the same time.
            Drawbacks:
                - too many outputs / too many depths in output after concatenation.
                - to alleviate this - 1x1 conv ( with relu - which doesn't hurt/increases non-linearity ) is used.

        How does this architecture help?
            - network in network layer (1x1)
                Extract information about the very fine grain details in the volume.
            - a medium sized filter convolution (3x3)

            - a large sized filter convolution (5x5)
                Cover a large receptive field of the input - extract it's information

            - a pooling operation
                Reduce spatial sizes and combat overfitting

            - ReLUs after each conv layer adds nonlinearity

            - Perform all these while remaining computationally considerate

            - Sparsity and dense connections [https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf]

    How is concatenation done in inception module, are output sizes of all parallel operations same?
        - Same padding is used for all convolution layers so that output size is same as input
        - For pooling stride of 1 with same padding is used, stride of 1 helps to maintain the size.
        [https://github.com/Natsu6767/Inception-Module-Tensorflow]
        https://hacktilldawn.com/2016/09/25/inception-modules-explained-and-implemented]


    Main Points
        - Used 9 Inception modules in the whole architecture, with over 100 layers in total! Now that is deep…
        - <b>No use of fully connected layers!</b> They use an average pool instead, to go from a 7x7x1024 volume to a 1x1x1024 volume. This <b>saves a huge number of parameters</b>.
        - <b>Uses 12x fewer parameters than AlexNet</b>.
        During testing, multiple crops of the same image were created, fed into the network, and the softmax probabilities were averaged to give us the final solution.
        Utilized concepts from R-CNN (a paper we’ll discuss later) for their detection model.
        There are updated versions to the Inception module (Versions 6 and 7).
        Trained on “a few high-end GPUs within a week”.

    Details
        - 22 Layers
        - Efficient "Inception" module
        - No FC Layers
        - Only 5 million parameters - 12x less than Alex
        - ILSVRC'14 Classification Winner
        - No FC Layer


Microsoft ResNet (2015): 152 Layers

    What happens when we continue stacking deeper layers on a "plain" convolutional neural network?
    - Deeper layer not always better, 56-layer did worse than 20-layer.
    Hypothesis - this is an "optimization" problem, deeper models are harder to optimize

    - Double the number of layers - still isn't as deep as ResNet
    - 152 Layers, setting records in classification, detection and localization.

              |----------------------------------->|
              |                                    v
    Input --->| ---> Conv ---> ReLU ---> Conv --->[+]---> ReLU ---> Output
      X                       F(.)                                    Y

    Input is X, Conv-ReLU-Conv is F(.), and Output Y = F(X) + X
    If there was not connection above,
        Y = F(X) traditional layer, F(X) is the function we are finding/modeling/estimating
    With connection,
        Y = F(X) + X, so we are finding F(X) = Y - X, which is the difference between Y and X

    So in ResNet we are not learning the function that maps, but the difference between op and ip.

    - The authors believe that “it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping”. [WHY?]

    - Another reason for why this residual block might be effective is that during the backward pass of backpropagation, the gradient will flow easily through the graph because we have addition operations, which distributes the gradient.


    ResNet inside ResNet [https://arxiv.org/pdf/1608.02908.pdf]

    - For residual
        - Input channels must be equal to op channels otherwise we cannot calculate residue or do elementwise sum at the end.
        - To do this you can use 1x1 like Inception.

    Training ResNet in practice:
        - Batch Normalization after every CONV Layer
        - Xavier/2 initialization from He et al.
        - SGD + Momentum (0.9)
        - LR = 0.1 divided by 10 when validation error plateaus
        - Mini-batch size 256
        - Weight decay of 1e-5
        - No dropout used.

Improving ResNets: [2016]
    Identity Mappings in Deep Residual Networks
    - Improved ResNet block design from creators of ResNet
    - Creates more direct path for propagating information throughout network (moves activation to residual mapping pathway)
    - Gives better performance

              |-------------------------------------->|
              |                                       v
    Input --->|--BN--ReLU-->Conv-->BN--ReLU-->Conv-->[+]--> Output
      X                       F(.)                            Y


Wide Residual Networks (2016)
    - Argues that residuals are the important factor, not depth
    - Uses wider residual blocks (F x k filters instead of F filters in each ResNet Layer)
    - 50-layer wide ResNet outperforms 152-Layer original ResNet
    - Increasing width instead of depth more computationally efficient (parallelizable)

Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)
    - Increases width of residual block through multiple parallel pathways ("cardinality")
    - Parallel pathways similar in spirit to inception module

FractalNet:
Densely connected Convolutional Networks [2017]
Vanishing gradient problems are solved.

See Inception-v4 Resnet + Inception!
Read Spatial Transformer Networks:


Automatic Data Augmentation using Deep Learning:
    "Googled automatic data augmentation deep learning"
    - https://infoscience.epfl.ch/record/218496/files/ICIP_CAMERAREADY_2715.pdf [Good]
    - Automatic Data Augmentation from Massive Web Images for Deep Visual Recognition
    - https://arxiv.org/pdf/1808.05130.pdf
    - AutoAugment: Learning Augmentation Policies from Data https://arxiv.org/pdf/1805.09501.pdf


Typical input image sizes to a Convolutional Neural Network trained on ImageNet are 224×224, 227×227, 256×256, and 299×299; however, you may see other dimensions as well.
VGG16, VGG19, and ResNet all accept 224×224 input images while Inception V3 and Xception require 299×299 pixel inputs

https://drive.google.com/open?id=1dWE655PDpC9bq3exVVRB_6mY-DsB9sw4

[View in Sublime]