Computer Vision Flashcards
What is representation learning of CNN
The Network not only predicts the classification but also learns how the image is composed, all the interconnection that are in the image in order to make is an object
Self supervised learning, what is it?
Still need to fully get it
What is perception?
Ability to capture and process information from our senses
What is the perceptron
Simple model that defines a linear boundry 0 or 1 ( sign(xw)). It does not work if the observation are not linearly separable. You can see the perceptron as one neuron
Learning algorithm
Is an algorithm that is able to learn from data. Ingredients: - Task - Performance Measures - Experience
Unsupervised Learning
Find pattern in data without really giving them any direction (no expert input). Mainly used for clustering
Reinforcment learning
There is an agent that is performing a certain task, in order to learn a specific task there is a feedback loop that gives rewards to the agent as soon that it makes a correct action. Used a lot for games
The agent is in a particular state and is performing a certain action, this action will bring the agent to be in state + 1. This move is associated to a positive, negative or neutral feedback
List all the type of learnings
Active Learning Online Learning and Incremental Learning Weak supervised Learning Self-supervised Learning Deep Learning Federated Learning
What is inductive bias
Inductive Bias is the bias that is introduced by the hipothesi selection (meanig which assumption I am using in the modeling phase)
You have two types of Bias
- Restriction (lmit hipothesis space)
- Preference: Impose ordering on hypothesis (priorities)
Bias and Variance trade-off
When you are modeling you risk to overfit or underfit, meaning introducing a lot of bias or a lot of variance in your predictions.
Bias means that your assumption are too strict and you are not able to fully explain the phenomenon.
Variance means that you ar e explaining not only the phenomenon but also the noise such as measurement error that do not help when generalizing the results.
If you have high variance you will see strong performances in the training set and bed ones in the test set. High Bias you see bad performances in both the data-sets
What is Algorithmic Bias
An algorithm that creates unfair outcomes it has a Bias. A system that is acheiving better result for an ethnic group
Cost Function
Is your tool to measure performances and feed these information back into the model. Your goal is to minimize it.
Gradient Descent
Is a successful algorithm that allows you to estimate the parameters.
You are basically descending the loss functions towards low value of it. You use the gradient to inform it.
Update the weights towards the negative of the gradient multiplied by a learning rate
Challenges connected to Vision
- Illumination
- Shadow
- Scale
- Perspective view
- Viewpoint
- Deformation
- Occlusion
- Clutter
- For Classification (intra-class variation or inter class variation)
image representation
Binary: Black and white image can be represented as 0 and 1 matrix
Greyscale: values from 0 to 255
Color Images: 3 layers with values betwee n 0 and 255 (blue, green and red)
What is color constancy?
Some color might look different if they are close to other colors
What is the issue due sampling?
Is basically when you have an image where the border are not well defined, in order to decrease this problem you can increase the resolution (dpi)
Quantized?
Image are mapped into a matrix with values for each pixel
Image instogram
Is a nice way of visualizing an image as an intogram. Since every pixel is a value between 0 and 255 you can visualize thir distribution. Can be done for Greyscale and color image.
It is not really useful for image comparison
What does it mean that you can see the image as a function?
You can see an image as a matrix of values, and x the rows and y the column, in such a way that the function f(x,y) map these two coordinates to a particular value between 0 and 255
What is a filter?
It is used in order to transform the pixels from an image. It is used to extract info or transform the images to simplify it or add info. It is basically a function that defines how to process teh pixels
Use case:
- Extract Info
- Detect patterns
- De-noising
Filters are typically Convolution filters
Smoothing filter
Smothing filters are filters that are applied to the images to remove sharp feature or it.
They need to have some propreties:
- They need to have positive values
- They need to sum up to one
- Amount of smoothing depends on the kernel size
- Remove high frequencies (remove the dependencies with closeby region with a lot of black for instance)
Moving average = kXk filter with the sum that goes to 1 that is going to be applied in a region of the image. It replace each filter as the neighbour average. It has the goal of removing sharp feature
Boundry issue
Every time that you apply the filter you do not get the same shape out if you are not using a padding
Gaussian Filter
The Gaussian filter is a type of convolution filter where the pixel are weighte depending on the distance from the filter center.
There are two parameters that rules the filter and are:
- Size of the filter
- Scale of the gaussian ditstribution
What is a convolution
Given an n-dimentional object a convolution is the function that applyis a certain kernel sequencially on each reagion of the input.
To call it convolution you need to take the kernel and then flip it, vertically and then horizontally. Otherwise you are looking at cross correlation. It all depends on the casuality of the signal, what depends on what.
It is used for signal processing
Propreties of convolution
- Commutative: fg = gf
- Associative: (fg)h = f(gh)
- Homogeneity: k fg = fk g = k(f*g)
- Distributive: f(g+h) = fg + f*h
- Shift invariant: You can shift the image but the utput will behave the same (the inputs are the same just shifted)
- Separability: You can separate a 2d filter into 2 1d filters (using matrix multiplication)
Sharpening filter
You can use convolution where you use two different filters (explane)
How can you use filters?
Extract informations
Detect patterns
De-noising
Definition of Edge
It is a sudden change in the image (discontinuity). The shape informations can be encoded in the edge
It is important to define edges because we are able to recognize objects in an image and as well recover geometry of objects
The edges in an image are created by:
- Different illumination
- Change in orientation of the surface
- Different depth
- Color
How can you spot edges
since edges are a discontinity and a rapid change in the image you are able to spot edges by looking at the image derivative (since we are able to visualize an image as a function)
The edge is indeed in proximity of the extreme value of the derivative function
For images you can implement derivatives by applying filters (since it is not a continuous function)
Describe forward, backward and central derivative for one D
Backward: Looks at the changes at position i and i -1
Forward: i and i+1
central: i-1 and i+1
What is the image gradient
Is the calculation of the gradient on the image, and of course is composed by two quantity, the gradient moving from left to right and the one moving from up and down
The gradient vector points in the direction of the most rapid increase, and the stength of the gradient is defined as the sum of the two derivatives squared all square rooted
You can use the gradient to do image editing and smooth out strong edges
What is the intensity profile
Given a row or a column of the image you are able to see the pixel intensity and as well the derivative. In this way you are able to get a signal
What is the effect of noise on the intensity profile
So adding noise to the image also adds noise to the signal, because if every pixel is very different from the others you might think that you have a lot of edges while you actually have only a few
A derivative of random noise is also a random noise
How can you overcome the effect of noise in the intensity profile
You can use a smoothing filter to force close pixels to be simlar. Of course this fails if you actually have a lot of pixels
For instance you can use a Gaussian Filter
We can also directly multiply the image by the derivative of the filter to find where the peak of the derivative is (is a proprety of the derivatives). This allows us to take one least step
Derivative of Gaussian filters
Also for the Gaussian filter you have a x directional gaussian filter and a y directional gaussian filter
You can espress the filter as the product of two functions (gaussian with mean 0)
The scale is the parameter of the gaussian kernel that manage how much smoothing there will be (the larger the sigma and the more influence far away pixels will have)
The larger the sigma and the more blurried the edges will be and you focus on macro features
Depending on the type of task we should choose a particular sigma
Really important to remember is that the sum of the weights needs to be 0, constant regions should have value 0
Sobel filter
The sobel filter is a derivative filter with some smoothing to it.
It comes from the multiplication of a gaussian smoothing filter of one dimention times the 1d derivative
It leads to vertical edges detection
Tell me which ones are the derivative filters
- Sobel
- Scharr
- Prewitt
- Roberts
Definition of corner
Is a region where there is a significant changes in all its direction. Otherwise it would be an edge.
Corners could be thought as keypoints because they identify a certain geometry of the object
How can you calculate if there is a corner
You can search for the simultaneous changes of the image when you set a certan u and v
Why do we need keypoint
Keyponts are reference point of the image that should have certain characteristics:
- Repeatability = Can be found even after geometric transformation
- Salience = Distinct from each other
- Compactness and efficiency = Few keypoints
- Locality = It occupy a small area of the picture.
Keypoints allow to metch images
Harris Corner
Harris corner detection is an algorithm that detect corner
- It is Translation invariant
- Rotation invariant
- Not scale invariant
The problem is that is not scale invariant, we need a scale invariant algorithm becuase we would like to be able to get universal feature of an image
Scale invariant selection
Based on two image you need to find regions of similar size that will be in both pictures that have very different scales.
The problem is that is difficult to automatically define the size of the area, because the size will depend on the image
We need to find functions that are scale invariant per se, functions that will always give a similar results independently from the image scale and we apply this function to a region that has similar size
A good function has one sharp peak in the region
In order to do that we can use the Second derivative of a Gaussian
Blob detection
Blob detection tries to detect keypoints by appplying a second order derivative of a gaussian to an image at multiple scale. We look at different extrema.
These filters are invariant to scale and rotation
Blob Kernels
- Laplacian
- Difference of gaussians
Both are scale and rotation invariant
Harris Laplacian
Is a scale invariant detector that uses the Harris corner to detect keypoints in space and it uses the Laplacian to account for scale differences
Sift
Is another Scale invariant that uses the DoG in space and in scale. The difference of gaussians is more efficients than the laplacian because it does not need ro calculate second order derivatives.
You choose the keypoint as the extrema between scale and space, meaning that across all the different region scales and different spaces you choose the maximum
Rotational ambiguity
If you need to match two pictures you might end up having the images rotated, and they are difficult to match
That is why you need invariant local features
Feature descriptor desired property
They need to be invariant by:
- Translation
- Rotation
- Scale
- Change in brightness
- Perspective
SIFT descriptors
You can overcome the rotation ambiguity by creating a histogram of local gradients directions. And you choose the most prominent direction. We describe all features in the patch relative to that orientation
For each sift descriptor you have 128 dimentional vectors
You can metch sift descriptor to check keypoint similarity (you can do it with eucledian distance)
What is visual recognition
It means identify the content of the image. In order to do this you can use a data driven approach (or also some more rule driven approach) like KNN (O(1) at training time, O(n) at prediction)
How do you represent images for image classification
- You can represent them as raw pixels
- Bag of words
Define bag of words
You do not use raw pixels anymore but you count the number of occurrencies of visual preimitives or visual keys. It originated from text categorization.
There are 3 steps:
- Feature detection (extract local features from an image) -> (DOG and HArris Laplacian or you could randomly sample)
- Create an mage codebook that is a group of visual primitive.
- For each feature check the distance with each visual item in the dictionary and build an histogram
- Use a learning algorithm to define which image it is based on the histogram
So basically you are representing the image as frequencies of visual words
This method have been really useful for:
- Image classification
- Large image search
- Discover visual theme
How do you learn the visual vocabulary
You take the diverse local features and you cluster them to create feature clusters. The centroid of the cluster will be your visual word (because they summarize a certain feature).
You can usethe k-means:
- randomly assign the centroid
- find the closest features
- re-evaluate the centroid
You can learn the codeword on separate training sets. Indeed this codewords should be universal
You do not want to have a too small vocabolary but not even too large.
In order to make the vocabolary really efficient you can make vocabolary trees (also the comparison is faster)
What is a vector quantizer
Is a function that takes a feature and maps it to the closest codevector
What is teh TF-IDF normalization
The TF-IDF normalization is a way to add more weight to features that are not always there. log(n_docu/n_docu_with_feature)
What is a problem of Bag of words and how is solved
With bag of words you lose the spacial information, the relation that exists between two close visual features.
In order to solve it you can use the spacial pyramid, here are the steps:
- Take the image and do the bag of words
- divide the image in 4 and do the bag of words on each subimage
- Keep dividing until you reach the satisfactory level
It is really good for image representation
Why is Deep learning so good
Because it can automatically both learn the representation of the image and classify at the same time
Which layer do we have in a CNN
- Fully connected
- Convolutional Layer
- Pooling Layer
Basically it is a series of convolutional layers an their activation functions
Which one is the dimention side after a convolution step
(W - F + 2*P)/S + 1 W input volume size F filter size P is the padding S is the stride
What is the pooling layer
The Pooling layer does not need any type of learning, you simply pool different values to one (downsamppling the image)
Max pool
Mean Pool
Describe different common architecture
AlexNet =Pretty fast and memory efficient
VGGNet = It is famous because it showed the effect of having a deeper network, but it is really expensive to evaluate.
GoogLeNet=It has the inception module, very smart way of extracting local feature. You have a stack of convolutions that get stacked on top of each other. Even if it is deeper is very fast
What is Transfer learning
Is a really common practice, where you do not train the entire moedl but you start from pretrained model and you train on top of them to increment the generalization
What is dataaugmentation
Is a technique to increase the generalization of the model, where you make your prediction robust to certain type of problems (such as rotation for instance)
Self supervised learning
You divide the image in spaces and you use this information to train on and to learn the representation
How do we process videos?
We cannot simply use CNN but we have to use RNN, where the networ has an additonal parameter the state variable.
Examples are the:
- LSTM
- GRU
- Vanilla RNN
Predictive vision
Given a picure and an agent we would like to predict the action of the agent.
This ability is ruled by two factors:
- Dynamics knoledge
- Understand of the semantic of the scene
You need to use knowledge transfer to augment the data
What is image captioning
Given an image you are able to generate a textual description of it.
We do it by training a RNN model on the output of a fully connected layer. This will beour initial state and it will have a certain probability of being a certain word and then you continue with the RNN