# Lecture Outline

Multi-object detection (YOLO, FasterRCNN)
Region proposal networks and tracking
Semantic and instance segmentation
Self-supervised depth estimation
Autonomous control and robotics

# Object Detection

# What is object detection?

Instead of predicting just one label, we want to predict two things: We're going to predict a box, which means the location of that label and the type of that label.

Two things:

The positions of the box (x, y, h, w)
The object within the box

# A simple solution

Previously we couldn't change the number outputs in our CNN, because it was always fixed according to the architecture.

We pick a random box at this image, pass this through CNN, and try to classify what is in this box. Basically then we take a different box and we repeat this process. If the CNN identifies an object, then we store the box. If it doesn't identify the object, then we kind of ignore it, and we go on to the next box.

Problem:
Explosion of number of inputs. Too many sizes, scales, positions. Each time we see a brand new image, we need to repeat this process all over again and start all over.

# R-CNN: Region Proposal Network / Region Evolutional Neural Network

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

Instead of just randomly picking boxes, this model will take the input image and will try to extract what are called proposal regions, or region proposals.

The idea of step number two is to propose all of the potential different boxes(around 2000) that might have something interesting in the image. We shrink them all to the same size. All the regions(boxes) get fed in through the CNN in parallel, and we put classifications on all 2000 boxes.

This is a much faster way:

We're not going through all of the boxes.
We're able to leverage parallelization of GPU by combining them all into the same shape. We warp them all to the same shape, and therefore we're able to feed them all together simultaneously through the CNN.

This is a very bad solution in practice!

Even though the testing process can be done in parallel, the training process actually still has to train on each one of those 2000 boxes in sequence, one after another.
The region we proposed before are variable size, but for some reason, we had to strength them into the same shape. Thus we lose a lot of the important information about the structure of the image.

# Faster R-CNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Instead of trying to have an external neural network, or saw an external algorithm propose those regions, now we try to learn the regions directly within the same neural network, so we can have this whole process end to end, and we don't have to deal with all of these different 2000 regions in parallel.

The way this works is that we'll have our big image, and instead of feeding small regions through a CNN, we feed the entire image through a few convolutional layers. The output here will be feature maps on the entire image. This is important because we're not learning convolutional features over small parts or small boxes here. We're learning feature maps over the entire image. By looking at the feature maps, the neural network can actually learn to kind of highlight where are the important parts. We can use a region proposal network that will take these feature maps and basically extract regions where these features are being activated. If the activations are inaccurate, the neural network will get that error signal back, propagate and adjust the activations on the next time of training, so that the region proposal network will improve the proposals on the subsequent iterations. Each of these region proposals that are learned by the activations of the neural network can now be fed through their independent convolutional neural networks that actually perform the classification on each one of these.

A few very important points about this algorithm that made it so strong and so state of art:

It's extremely fast because we only feed in the entire image directly to this first conventional extractor instead of feeding in the boxes.
The region proposal network actually is learning to output, predicting the proposals. This means we're not relying on some non-neural-network based algorithms to learn these proposals. We're directly learning how to identify and localize these boxes as part of feature extraction phase. This makes the whole object detection pipeline of this neural network, what we call end to end. We're not learning each part seperately, instead, we learn all of these features together. The neural network has to adjust itself and learn if one part is broken. Actually all of the parts have to kind of contribute to fixing that problem.
We actually continue extracting features over these proposals. Once we've identified and proposed these regions, the neural network is still building more features and depth along these regions as it goes deeper.
Each proposal can be fed into its independent classifier to perform the detection.

The second step above is the main contribution of this type of algorithm.

# Region Proposal Network(RPN)

After feeding the imput image into conventional layers(or residual layers), we want to take the feature maps and learn proposal regions.

The proposal region is the anchor of this region.

It's where the region is (where)
It's kind of the class of the region (what)

The center area of each of these anchors is going to be maintained throughout the depth of the network. This is because this network is fully convolutional. We're never shrinking the dimensionality of our features. Our features are actually going to mirror the original input size.

The goal is to have this network identify variable number of anchor boxes. An anchor box is going to be defined by the center of this box(center coordinate), the width and the height of that box.

First, we want to identify the class of the box depending on either background object or foreground object. We're just trying to classify two classes, just background or foreground. If an object is classified as fore ground, then this object will be proposed. It will be passed forward into the classification layers to figure out what type of object it is.

Second, we do the regression of the position, height and the width of the anchor boxes.

To do these two things, we need to have two lost functions. We combine objective number one with objective number two, in the form of two different loss funcitons.

$L(\{p_i\}, \{t_i\}) = \frac{1}{N_{cls}}\sum_i L_{cls}(p_i, p_i^*) + \lambda \frac{1}{N_{reg}}\sum_i p_i^* L_{reg}(t_i, t_i^*)$

$\{p_i\}$ is the loss with respect to the probability of this being an object and $\{t_i\}$ is the loss with respect to the coordinates of this object(could be thought of as kind of like the target space of this object).

The first step is to classify objective number one, classify as background or foreground. To do this, we can literally take labels of the scene and check what percentage of our ground truth label is actually identifying an object within this region proposal. We can literally use a binary softmax layer or a binary sigmoid cross entropy layer to classify the probability from our predicted probability and from our dataset.

The second step is going to focus on the regression and the modification of the size and the position of each box. Imagine the predicted position is a little bit shifted from the ground truth position. We can take a mean squared error loss, subtract the coordinates from our ground truth dataset to the closest predicted box, and regress on the error between these two boxes. The closest box to the one that we've predicted will move towards it on the next iteration of training.

# Semantic segmentation

Instead of classifying for a box, we actually classify every single pixel in the original image.

This essentially means we will take the pixel on the top left, and we want to output another image where that same pixel in the right hand image, corresponds to the label, i.e., the object class of that pixel.

# Fully convolutional neural networks (FCNN)

Fully Convolutional Networks for Semantic Segmentation

For the first part of this model, convolutional layers followed by max pooling layers followed by nonlinearities and repeat.

For the second part, we want to upscale, which is called unpooling, and is the same as bilinear sampling.

Two different ways to do unpooling operation:

Nearest neighbor unpooling: Assume we want to double the sides of image, we can duplicate one pixel four times in the small quadrant.
Bed of nails: Drop in certain pieces and certain pixels to the larger image, and keep everything else at zero. The goal of the next conventional layer is to learn how to interpolate and combine all of this information together.

# The importance of skip connections

In these types of networks, the middle part is very small, so a lot of information can be lost.

Loss function:
Binary cross entropy loss minimizes distance between ground truth and predicted probability distributions.

$-\sum_{class} y_{true} \log(y_{pred})$

We do the same object classification for every pixel. But because of the middle layer is very small, we're not able to achieve very accurate predictions on the boundaries.

Problem:
The encoder reduces the dimensionality of our input and makes it difficult for the decoder to capture low-level details.

To solve this, we make every layer also connected to the same sized alyer in the future part of the network.

# Remember: Residual layers

Instead of trying to learn the entire function from x to y, we broke it up to learn a lot of changes in x to get to y. It allow us to make training a lot smoother for the neural network, and improve the quality of the solutions that were learned.

# U-Net

U-Net: Convolutional Networks for Biomedical Image Segmentation

# Depth estimation using U-Net

We as humans have two eyes, we compute basically the distance, or the divergence between two cameras(image that seen in our two eyes). If a pixel in one eye is very close to the pixel in the second eye, this means the point is very far away.

We can compute the depth for every pixel using this type of algorithm.