07-08 - BoW & FaceRecognition Flashcards
How to gather the correct data for learning a vocabulary (local features from a training set) - BoW
Use SIFT or ORB for example
The chosen detector must be invariant and the number of descriptors to find must be chosen thinking about runtime and influence on the vocabulary
Describe the k-means algorithm and the interpretation of the clusters?
→ naive algorithm
Input: n feature vectors x, nr of clusters
Inizialization: init. k means mu
- fx as k random feature vectors (SIFT is 128 dimensional )
Now iterate until convergence through following two steps:
1. Assign each datapoint to the nearest cluster using squared eucledean distance
2. Update the cluster means with the new members
Problems with k means:
- initialization can be poor, another method or runs with differend inits could make sense
- poor clustering do to either too few or many clusters (the nr of clusters can be tuned by fx the elbow method)
How to make the BoW training set searchable by forming global image descriptors?
We quantize → make a global image descriptor by making a histogram where each cluster is a bin
How BoW descriptors can be used for retrieval based on a novel image?
brute force approach, we loop over all training images and compare bow descriptor for novel image to the training images and compute the distance between them (eg eucledian or chi square)
How to perform weighting to obtain better Bag Of Words matching results?
Team Frequency: upvotes cues that are frequent in a single image
Inverse Document Frequency: upvotes words that are rare in the whole training set
How to reduce the BoW search problem?
hierarchial clustering & inverted file index
inverted file index:
For each cue/word/cluster, have a list of indices of images that have that word matched to them.
The features used by the Viola-Jones detector, including how to compute them?
In the Viola Jones Algorithms 3 types of features are used:
- Two rectangle (difference of the sum of the pixel within two rectangular regions)
- Three rectangle (sum within two outside rectangles subtracted from the sum in a center rectangle)
- Four rectangle (difference between diagonal pairs of rectangles
The features are found very efficiently using an so called integral image. Integral images are intermediate representations for images.
Each pixel location on the integral image, contains the sum of the pixels above and to the left of the current pixel.
Once computed, any one of these Harr-like features can be computed at any scale or location in constant time.
The basic building blocks, the weak learners, including intuition
Weak Learner: a feature, a threshold for the classification function and a polarity.
Intuition:
After each iteration, the classifier with the lowest error rate is chosen. Aftewards the weights are updated in a way, that the easy (already identified) images are less important for the next round of error calculations.
Viola Jones Cascade optimization
Simpler classifiers are used to reject the majority of subwindows before more complex classifiers are called upon to achive low false positive rates.
A positive result from the first classifier triggers the use of the second and so on. One negative outcome at any point lead to immediate rejection of the subwindow.
The recipe fir the boosting algorithm
Input: image patches (labeled)
Initialize uniform weight to all training examples
repeat:
- normalize weights
- apply weights, compute misclassification rate
- find single feature/weak learner with smallest classification error
- reduce weights for the training example that the current weak learner got right
Finally: strong classifier is the weighted sum of the M chosen weak learners
How to compute the eigenface decomposition?
all image patches are 1D vectors with length being the number of pixels
- find the mean and covariance matrix (statistical relationship between the pixel intensities of the images)
- Compute eigenvectors and values with svd. The eigenvectors are directions in face space, sorted by importance according to eigenvalues and are calles eigenfaces
How to project a face to the face space
We project a novel image onto the facespace by subtracting it by the mean and taking the dot product between the result and the eigenvectors. Each of these dot products gives us a scalar response, which also often is called a weight. These weights tell us which face is closest to the one on the novel image and we can use them to reconstruct the image by summing the mean with each of the eigenvectors multiplied by the corresponding weight. That reconstruction can be compared to the original image, the more similar they are the higher possibility of it being a face.
How to use face space to detect/recognize a face?
If the noval image shows a face or not, is computed as DFFS = x - tilde x
The lower the DFFS is, the more likely that it is a face
To check who it is, a “distance in face space” DIFS is computed, between the input image and each training image. Recognition of a person is now simply a thresholding of this distance between input image x and the clostest match of training images.
What are the pros and cons for eigenfaces?
Pros:
- Cheap and flexible face comparison
- only need projection and vector space distance computations
- Allows for unsupervised discovery of new identities (whenever a novel image projection has a low DFFS but the smalles DIFS is large
Cons/Limitations
- In real-life, images of the same person do not always cluster nicely together. Sometimes intra-class effects can be larger than inter-class variations, for example because of lightning, angle etc. This can result in overlapping clusters.
The structure of FaceNet, especially compared to how eigenfaces projects faces?
Neural network architecture that learns to directly map the face images into a high dimensional embedding space where facial similarities are preserved. It also adds a loss that forces same faces to be close and different faces to have a large distance