Add layers during training
Net to Net [?] Ian Goodfellow
Network Morphism Microsoft
Deep Visual-Semantic: Alignments for Generating Image Descriptions
Types of RNN
- Vanilla RNN / Simple RNN / Elman RNN
- Long Short Term Memory (LSTM)
- Helps improve gradient flow during back prop
Computer Vision Tasks
- Classification [already done] - Very basic
- Localization
- Segmentation
- Detection
1. Semantic Segmentation - Grass, Cat, Tree, Sky in 1 image. [No object just pixels] - No box but all pixels are classified.
2. Classification + Localization - Class = Cat and a box around cat [Single Object]
3. Object detection - Box around objects - Dog, Dog, Cat
4. Instance Segmentation - All pixels are classified - Dog Cat Dog
3 and 4 - Multiple objects.
Semantic Segmentation
- For every pixel we want to say what it is - cat, grass, sky, etc
- Two Cows will be classified as cow, both are not differentiated like cow 1 and cow 2. - Instance segmentation does this.
- Just labeling the pixels.
- Sliding Window approach
- Extracting many patches from image and classifying each patch - like a classification problem. Centre pixel of patch - is labeled.
- Very expensive computationally.
- Nobody does this.
- Fully Convolutional Network
- Final layer will have channels / features equal to the number of classes we have.
- For medical image. Either 2 classes - Either background or Cell. So each pixel label is of length 2 - one hot.
- Average loss over all pixels - to train - Cross entropy loss on every pixel of the output - then either sum or average over space and then over minibatch
- Still a lot of computation - no of features on each step will be huge - Not much used in practise.
So rather we use downsampling or upsampling - here comes deconvolution.
- We can make it very deep also because less computation
- Upsampling Ideas
- Unpooling
Nearest Neighbor
[1 2 [1 1 2 2
3 4] --> 1 1 2 2
3 3 4 4
3 3 4 4]
"Bed of Nails"
[1 2 [1 0 2 0
3 4] --> 0 0 0 0
3 0 4 0
0 0 0 0]
Max Unpooling
- Will remember which position was the max value in and on upsampling the value is put into same position and other are kept zeros.
- "Need to remember position"
- Transpose Convolution
Multiply each pixel by the filter, so if the filter is 3x3, for 1 pixel in input we get 3x3 pixels in output and then we put the output with strides or sth and we sum in the overlaps.
Other names:
- Deconvolution (Bad)
- Upconvoltion
- Fractionally strided convolution
- Backward strided convolution
3x3 stride 2 upsampling produce some checkerboard artifacts, so use 4x4 stride 2 or 2x2 stride 2 for transpose convolution - helps alleviate the problem a bit.
Classification + Localization:
- You know ahead of time that you are looking for 1 object only
- So we produce 1 bounding box
Prev -> FC Layers -> 1000 classes (one hot) (softmax loss)
Now -> add one more path, FC Layers -> 4 coordinates [x, y, w, h]
Treat Localization as a regression problem. (L2 Loss)
If there are multiple losses, one loss might dominate the order, because of magnitude, so there is some hyperparameter to scale the losses differently.
Finding that hyperparameter is difficult
So,
Take some other metric of performance that you care about other than the actual loss value, then use that metric to do cross validation to find the proper hyperparameter for scaling losses instead of looking at the loss value to make the choice.
Finding bounding box can also be used for other tasks like human pose estimation.
Basic Regression Loss - usually L2 or sometimes L1 or smooth L1
Object Detection:
Fixed set of categories.
Now we dont know how may objects will be there in image so output will vary
Say if there is 1 cat: 4 numbers (x, y, w, h)
2 cats: 8 numbers
Sliding window
Slide window and do classification and localization for that window.
Output of localization still being 4.
Problems: size of crop
so required - many many crops.
- Nobody does this
My thought - Localizing slide window, we detect fixed number of outputs say assume max number of objs can be 3 so op = 12 and check if window is a cat, dog or bg? if bg then ignore.
Region Proposals
Blobby Image regions using traditional approaches to get crops
Example: Slective Search
R-CNN [2104]
Gives proposed regions - different crop sizes.
We still need fixed input sizes for CNNs. so we wrap it.
But this will change aspect ratio - how much does it affect accuracy? - maybe some papers are there.
Still Expensive
Because if many regions are detected, need to cnn through all of them and so on.
Fast R-CNN
Instead of cropping on the input image we crop on the feature from CNN
Mask R-CNN
I read this somewhere.
Does all - Best of all.
Read Mask R-CNN and ResXnet or sth like that from Kaiming