07-08 - BoW & FaceRecognition Flashcards

Question 1

Q

How to gather the correct data for learning a vocabulary (local features from a training set) - BoW

Answer

A

Use SIFT or ORB for example

The chosen detector must be invariant and the number of descriptors to find must be chosen thinking about runtime and influence on the vocabulary

Question 2

Q

Describe the k-means algorithm and the interpretation of the clusters?

Answer

A

→ naive algorithm
Input: n feature vectors x, nr of clusters
Inizialization: init. k means mu
- fx as k random feature vectors (SIFT is 128 dimensional )
Now iterate until convergence through following two steps:
1. Assign each datapoint to the nearest cluster using squared eucledean distance
2. Update the cluster means with the new members

Problems with k means:
- initialization can be poor, another method or runs with differend inits could make sense
- poor clustering do to either too few or many clusters (the nr of clusters can be tuned by fx the elbow method)

Question 3

Q

How to make the BoW training set searchable by forming global image descriptors?

Answer

A

We quantize → make a global image descriptor by making a histogram where each cluster is a bin

Question 4

Q

How BoW descriptors can be used for retrieval based on a novel image?

Answer

A

brute force approach, we loop over all training images and compare bow descriptor for novel image to the training images and compute the distance between them (eg eucledian or chi square)

Question 5

Q

How to perform weighting to obtain better Bag Of Words matching results?

Answer

A

Team Frequency: upvotes cues that are frequent in a single image

Inverse Document Frequency: upvotes words that are rare in the whole training set

Question 6

Q

How to reduce the BoW search problem?

Answer

A

hierarchial clustering & inverted file index
inverted file index:
For each cue/word/cluster, have a list of indices of images that have that word matched to them.

Question 7

Q

The features used by the Viola-Jones detector, including how to compute them?

Answer

A

In the Viola Jones Algorithms 3 types of features are used:

Two rectangle (difference of the sum of the pixel within two rectangular regions)
Three rectangle (sum within two outside rectangles subtracted from the sum in a center rectangle)
Four rectangle (difference between diagonal pairs of rectangles

The features are found very efficiently using an so called integral image. Integral images are intermediate representations for images.

Each pixel location on the integral image, contains the sum of the pixels above and to the left of the current pixel.

Once computed, any one of these Harr-like features can be computed at any scale or location in constant time.

Question 8

Q

The basic building blocks, the weak learners, including intuition

Answer

A

Weak Learner: a feature, a threshold for the classification function and a polarity.
Intuition:
After each iteration, the classifier with the lowest error rate is chosen. Aftewards the weights are updated in a way, that the easy (already identified) images are less important for the next round of error calculations.

Question 9

Q

Viola Jones Cascade optimization

Answer

A

Simpler classifiers are used to reject the majority of subwindows before more complex classifiers are called upon to achive low false positive rates.

A positive result from the first classifier triggers the use of the second and so on. One negative outcome at any point lead to immediate rejection of the subwindow.

Question 10

Q

The recipe fir the boosting algorithm

Answer

A

Input: image patches (labeled)

Initialize uniform weight to all training examples

repeat:

normalize weights
apply weights, compute misclassification rate
find single feature/weak learner with smallest classification error
reduce weights for the training example that the current weak learner got right

Finally: strong classifier is the weighted sum of the M chosen weak learners

Question 11

Q

How to compute the eigenface decomposition?

Answer

A

all image patches are 1D vectors with length being the number of pixels

find the mean and covariance matrix (statistical relationship between the pixel intensities of the images)
Compute eigenvectors and values with svd. The eigenvectors are directions in face space, sorted by importance according to eigenvalues and are calles eigenfaces

Question 12

Q

How to project a face to the face space

Answer

A

We project a novel image onto the facespace by subtracting it by the mean and taking the dot product between the result and the eigenvectors. Each of these dot products gives us a scalar response, which also often is called a weight. These weights tell us which face is closest to the one on the novel image and we can use them to reconstruct the image by summing the mean with each of the eigenvectors multiplied by the corresponding weight. That reconstruction can be compared to the original image, the more similar they are the higher possibility of it being a face.

Question 13

Q

How to use face space to detect/recognize a face?

Answer

A

If the noval image shows a face or not, is computed as DFFS = x - tilde x
The lower the DFFS is, the more likely that it is a face

To check who it is, a “distance in face space” DIFS is computed, between the input image and each training image. Recognition of a person is now simply a thresholding of this distance between input image x and the clostest match of training images.

Question 14

Q

What are the pros and cons for eigenfaces?

Answer

A

Pros:

Cheap and flexible face comparison
- only need projection and vector space distance computations
Allows for unsupervised discovery of new identities (whenever a novel image projection has a low DFFS but the smalles DIFS is large

Cons/Limitations

In real-life, images of the same person do not always cluster nicely together. Sometimes intra-class effects can be larger than inter-class variations, for example because of lightning, angle etc. This can result in overlapping clusters.

Question 15

Q

The structure of FaceNet, especially compared to how eigenfaces projects faces?

Answer

A

Neural network architecture that learns to directly map the face images into a high dimensional embedding space where facial similarities are preserved. It also adds a loss that forces same faces to be close and different faces to have a large distance

Question 16

Q

The formulation of the triplet loss - FaceNet

Answer

Study These Flashcards

A

One of the triplet images is the anchor, and the other to resemble one image from the same class and from from another.

FaceNet tries to pull the positive embedding towards the anchor while pushing the negative one away

The final loss for one triplet becomes:

||f(x^a)-f(x^p)||^2-||f(x^a)-f(x^n)||^2+alpha

The total final loss i the sum of losses for all triplets.

Question 17

Q

The difference between positives and (hard) negatives during training - FaceNet

Answer

Study These Flashcards

A

To help training, FaceNet makes sure to sample hard negatives, which are training examples where the negative embedding is closer to the anchor than the positive embedding.

07-08 - BoW & FaceRecognition Flashcards

(17 cards)