Skip to main content

Detection and Segmentation

 Add layers during training

Net to Net [?] Ian Goodfellow

Network Morphism Microsoft


Deep Visual-Semantic: Alignments for Generating Image Descriptions


Types of RNN

- Vanilla RNN / Simple RNN / Elman RNN

- Long Short Term Memory (LSTM)

- Helps improve gradient flow during back prop


Computer Vision Tasks

 - Classification [already done] - Very basic

 - Localization

 - Segmentation

 - Detection



1. Semantic Segmentation - Grass, Cat, Tree, Sky in 1 image. [No object just pixels] - No box but all pixels are classified.

2. Classification + Localization - Class = Cat and a box around cat [Single Object]

3. Object detection  - Box around objects - Dog, Dog, Cat

4. Instance Segmentation - All pixels are classified - Dog Cat Dog 


3 and 4 - Multiple objects.



Semantic Segmentation 

- For every pixel we want to say what it is - cat, grass, sky, etc

- Two Cows will be classified as cow, both are not differentiated like cow 1 and cow 2. - Instance segmentation does this.

- Just labeling the pixels.


- Sliding Window approach

- Extracting many patches from image and classifying each patch - like a classification problem. Centre pixel of patch - is labeled.

- Very expensive computationally.

- Nobody does this.


- Fully Convolutional Network

- Final layer will have channels / features equal to the number of classes we have.

- For medical image. Either 2 classes - Either background or Cell. So each pixel label is of length 2 - one hot.

- Average loss over all pixels - to train - Cross entropy loss on every pixel of the output - then either sum or average over space and then over minibatch

- Still a lot of computation - no of features on each step will be huge - Not much used in practise.


So rather we use downsampling or upsampling - here comes deconvolution.

- We can make it very deep also because less computation


- Upsampling Ideas

- Unpooling

Nearest Neighbor

[1 2       [1 1 2 2

3 4] -->   1 1 2 2

            3 3 4 4

            3 3 4 4]

"Bed of Nails"

[1 2       [1 0 2 0

3 4] -->   0 0 0 0

            3 0 4 0

            0 0 0 0]


Max Unpooling

- Will remember which position was the max value in and on upsampling the value is put into same position and other are kept zeros.

- "Need to remember position"


- Transpose Convolution

Multiply each pixel by the filter, so if the filter is 3x3, for 1 pixel in input we get 3x3 pixels in output and then we put the output with strides or sth and we sum in the overlaps.


Other names:

- Deconvolution (Bad)

- Upconvoltion

- Fractionally strided convolution

- Backward strided convolution


3x3 stride 2 upsampling produce some checkerboard artifacts, so use 4x4 stride 2 or 2x2 stride 2 for transpose convolution - helps alleviate the problem a bit.



Classification + Localization:

- You know ahead of time that you are looking for 1 object only

- So we produce 1 bounding box

Prev -> FC Layers -> 1000 classes  (one hot) (softmax loss)

Now -> add one more path, FC Layers -> 4 coordinates [x, y, w, h]

Treat Localization as a regression problem. (L2 Loss)

If there are multiple losses, one loss might dominate the order, because of magnitude, so there is some hyperparameter to scale the losses differently.


Finding that hyperparameter is difficult

So,


Take some other metric of performance that you care about other than the actual loss value, then use that metric to do cross validation to find the proper hyperparameter for scaling losses instead of looking at the loss value to make the choice.


Finding bounding box can also be used for other tasks like human pose estimation.


Basic Regression Loss - usually L2 or sometimes L1 or smooth L1


Object Detection:

Fixed set of categories.

Now we dont know how may objects will be there in image so output will vary

Say if there is 1 cat: 4 numbers (x, y, w, h)

2 cats: 8 numbers


Sliding window

Slide window and do classification and localization for that window.

Output of localization still being 4.


Problems: size of crop 

so required - many many crops.

- Nobody does this


My thought - Localizing slide window, we detect fixed number of outputs say assume max number of objs can be 3 so op = 12 and check if window is a cat, dog or bg? if bg then ignore.


Region Proposals

Blobby Image regions using traditional approaches to get crops

Example: Slective Search


R-CNN [2104]

Gives proposed regions - different crop sizes.

We still need fixed input sizes for CNNs. so we wrap it.

But this will change aspect ratio - how much does it affect accuracy? - maybe some papers are there.


Still Expensive

Because if many regions are detected, need to cnn through all of them and so on.


Fast R-CNN

Instead of cropping on the input image we crop on the feature from CNN



Mask R-CNN

I read this somewhere.

Does all - Best of all.


Read Mask R-CNN and ResXnet or sth like that from Kaiming