Computer Vision Flashcards

Question

What is a convolution

Answer 1

Given an n-dimentional object a convolution is the function that applyis a certain kernel sequencially on each reagion of the input. To call it convolution you need to take the kernel and then flip it, vertically and then horizontally. Otherwise you are looking at cross correlation. It all depends on the casuality of the signal, what depends on what. It is used for signal processing

Answer 2

- Commutative: f*g = g*f - Associative: (f*g)*h = f*(g*h) - Homogeneity: k f*g = f*k g = k(f*g) - Distributive: f*(g+h) = f*g + f*h - Shift invariant: You can shift the image but the utput will behave the same (the inputs are the same just shifted) - Separability: You can separate a 2d filter into 2 1d filters (using matrix multiplication)

Answer 3

You can use convolution where you use two different filters (explane)

Answer 4

Extract informations Detect patterns De-noising

Answer 5

It is a sudden change in the image (discontinuity). The shape informations can be encoded in the edge It is important to define edges because we are able to recognize objects in an image and as well recover geometry of objects The edges in an image are created by: - Different illumination - Change in orientation of the surface - Different depth - Color

Answer 6

since edges are a discontinity and a rapid change in the image you are able to spot edges by looking at the image derivative (since we are able to visualize an image as a function) The edge is indeed in proximity of the extreme value of the derivative function For images you can implement derivatives by applying filters (since it is not a continuous function)

Answer 7

Backward: Looks at the changes at position i and i -1 Forward: i and i+1 central: i-1 and i+1

Answer 8

Is the calculation of the gradient on the image, and of course is composed by two quantity, the gradient moving from left to right and the one moving from up and down The gradient vector points in the direction of the most rapid increase, and the stength of the gradient is defined as the sum of the two derivatives squared all square rooted You can use the gradient to do image editing and smooth out strong edges

Answer 9

Given a row or a column of the image you are able to see the pixel intensity and as well the derivative. In this way you are able to get a signal

Answer 10

So adding noise to the image also adds noise to the signal, because if every pixel is very different from the others you might think that you have a lot of edges while you actually have only a few A derivative of random noise is also a random noise

Answer 11

You can use a smoothing filter to force close pixels to be simlar. Of course this fails if you actually have a lot of pixels For instance you can use a Gaussian Filter We can also directly multiply the image by the derivative of the filter to find where the peak of the derivative is (is a proprety of the derivatives). This allows us to take one least step

Answer 12

Also for the Gaussian filter you have a x directional gaussian filter and a y directional gaussian filter You can espress the filter as the product of two functions (gaussian with mean 0) The scale is the parameter of the gaussian kernel that manage how much smoothing there will be (the larger the sigma and the more influence far away pixels will have) The larger the sigma and the more blurried the edges will be and you focus on macro features Depending on the type of task we should choose a particular sigma Really important to remember is that the sum of the weights needs to be 0, constant regions should have value 0

Answer 13

The sobel filter is a derivative filter with some smoothing to it. It comes from the multiplication of a gaussian smoothing filter of one dimention times the 1d derivative It leads to vertical edges detection

Answer 14

- Sobel - Scharr - Prewitt - Roberts

Answer 15

Is a region where there is a significant changes in all its direction. Otherwise it would be an edge. Corners could be thought as keypoints because they identify a certain geometry of the object

Answer 16

You can search for the simultaneous changes of the image when you set a certan u and v

Answer 17

Keyponts are reference point of the image that should have certain characteristics: - Repeatability = Can be found even after geometric transformation - Salience = Distinct from each other - Compactness and efficiency = Few keypoints - Locality = It occupy a small area of the picture. Keypoints allow to metch images

Answer 18

Harris corner detection is an algorithm that detect corner - It is Translation invariant - Rotation invariant - Not scale invariant The problem is that is not scale invariant, we need a scale invariant algorithm becuase we would like to be able to get universal feature of an image

Answer 19

Based on two image you need to find regions of similar size that will be in both pictures that have very different scales. The problem is that is difficult to automatically define the size of the area, because the size will depend on the image We need to find functions that are scale invariant per se, functions that will always give a similar results independently from the image scale and we apply this function to a region that has similar size A good function has one sharp peak in the region In order to do that we can use the Second derivative of a Gaussian

Answer 20

Blob detection tries to detect keypoints by appplying a second order derivative of a gaussian to an image at multiple scale. We look at different extrema. These filters are invariant to scale and rotation

Answer 21

- Laplacian - Difference of gaussians Both are scale and rotation invariant

Answer 22

Is a scale invariant detector that uses the Harris corner to detect keypoints in space and it uses the Laplacian to account for scale differences

Answer 23

Is another Scale invariant that uses the DoG in space and in scale. The difference of gaussians is more efficients than the laplacian because it does not need ro calculate second order derivatives. You choose the keypoint as the extrema between scale and space, meaning that across all the different region scales and different spaces you choose the maximum

Answer 24

If you need to match two pictures you might end up having the images rotated, and they are difficult to match That is why you need invariant local features

Answer 25

They need to be invariant by: - Translation - Rotation - Scale - Change in brightness - Perspective

Answer 26

You can overcome the rotation ambiguity by creating a histogram of local gradients directions. And you choose the most prominent direction. We describe all features in the patch relative to that orientation For each sift descriptor you have 128 dimentional vectors You can metch sift descriptor to check keypoint similarity (you can do it with eucledian distance)

Answer 27

It means identify the content of the image. In order to do this you can use a data driven approach (or also some more rule driven approach) like KNN (O(1) at training time, O(n) at prediction)

Answer 28

- You can represent them as raw pixels | - Bag of words

Answer 29

You do not use raw pixels anymore but you count the number of occurrencies of visual preimitives or visual keys. It originated from text categorization. There are 3 steps: - Feature detection (extract local features from an image) -> (DOG and HArris Laplacian or you could randomly sample) - Create an mage codebook that is a group of visual primitive. - For each feature check the distance with each visual item in the dictionary and build an histogram - Use a learning algorithm to define which image it is based on the histogram So basically you are representing the image as frequencies of visual words This method have been really useful for: - Image classification - Large image search - Discover visual theme

Answer 30

You take the diverse local features and you cluster them to create feature clusters. The centroid of the cluster will be your visual word (because they summarize a certain feature). You can usethe k-means: - randomly assign the centroid - find the closest features - re-evaluate the centroid You can learn the codeword on separate training sets. Indeed this codewords should be universal You do not want to have a too small vocabolary but not even too large. In order to make the vocabolary really efficient you can make vocabolary trees (also the comparison is faster)

Answer 31

Is a function that takes a feature and maps it to the closest codevector

Answer 32

The TF-IDF normalization is a way to add more weight to features that are not always there. log(n_docu/n_docu_with_feature)

Answer 33

With bag of words you lose the spacial information, the relation that exists between two close visual features. In order to solve it you can use the spacial pyramid, here are the steps: - Take the image and do the bag of words - divide the image in 4 and do the bag of words on each subimage - Keep dividing until you reach the satisfactory level It is really good for image representation

Answer 34

Because it can automatically both learn the representation of the image and classify at the same time

Answer 35

- Fully connected - Convolutional Layer - Pooling Layer Basically it is a series of convolutional layers an their activation functions

Answer 36

``` (W - F + 2*P)/S + 1 W input volume size F filter size P is the padding S is the stride ```

Answer 37

The Pooling layer does not need any type of learning, you simply pool different values to one (downsamppling the image) Max pool Mean Pool

Answer 38

AlexNet =Pretty fast and memory efficient VGGNet = It is famous because it showed the effect of having a deeper network, but it is really expensive to evaluate. GoogLeNet=It has the inception module, very smart way of extracting local feature. You have a stack of convolutions that get stacked on top of each other. Even if it is deeper is very fast

Answer 39

Is a really common practice, where you do not train the entire moedl but you start from pretrained model and you train on top of them to increment the generalization

Answer 40

Is a technique to increase the generalization of the model, where you make your prediction robust to certain type of problems (such as rotation for instance)

Answer 41

You divide the image in spaces and you use this information to train on and to learn the representation

Answer 42

We cannot simply use CNN but we have to use RNN, where the networ has an additonal parameter the state variable. Examples are the: - LSTM - GRU - Vanilla RNN

Answer 43

Given a picure and an agent we would like to predict the action of the agent. This ability is ruled by two factors: - Dynamics knoledge - Understand of the semantic of the scene You need to use knowledge transfer to augment the data

Answer 44

Given an image you are able to generate a textual description of it. We do it by training a RNN model on the output of a fully connected layer. This will beour initial state and it will have a certain probability of being a certain word and then you continue with the RNN

Computer Vision Flashcards

(68 cards)