L7: Bag of Words Flashcards
What is Bag of Words (BoW)?
Image classification that represents an image as a set of features. Each feature consists of a keypoint and descriptor which are used to construct vocabularies through clustering to make visual words.
❗️❗️❗️What are the requirements for the feature extraction in BoW?
- Sample a number of descriptors in each frame (from feature extraction (e.g. SIFT or ORB))
- Must be invariant (light, rotation and scale, giving us more robust features)
- Number of descriptors will influence the runtime and size of vocabulary
What is a visual word in BoW?
This is the clustering done for the descriptors, where each cluster represents a visual word
What is a vocabulary?
This is a collection of visual words.
❗️❗️❗️Training/clustering (1):
How to gather the correct data for learning a vocabulary (local features from a training set)
Sample of invariant 2D descriptors (SIFT, ORB) from each image in training data.
Clustered using k-means to form the visual words in the vocabulary.
❗️❗️❗️Training/clustering (2):
The k-means algorithm and the interpretation of the clusters
K-means cluster used to cluster features for the training data. The clusters will represent the “words” of our visual vocabulary.
How does K-means clustering work for the features?
- k initial means are randomly generated within n feature vectors x.
- K clusters are created by associating every data point x_i with the nearest mean with squared euclidean distance.
- The centroid of each of the k clusters become the new mean.
- steps 2 and 3 repeated until we hit convergence.
Size of k is important:
- Small: Underrepresented
- Big: Over determined
Use elbow plot to find the good k.
❗️❗️❗️Training/clustering (3):
How to make the training set searchable by forming global image descriptors
- Have k clusters in descriptor space
- Quantize all descriptors of each image by taking the index of the cluster each descriptor belongs to.
- Bin the index-quantized features to a histogram giving Global BoW image descriptor
Each image can be represented with a histogram (Global BoW descriptor)
❗️❗️❗️Retrieval(1):
How BoW descriptors can be used for retrieval based on a novel image
Vocabulary with all visual words found with k-means from training set.
New image find nearest match in training set by comparing histogram of the image and the visual word.
- Repeat the whole process to get global BoW descriptor (feature extraction, clustering, index-quantizing, histogram, global BoW descriptor)
Use cosine distance (euclidean, chi-squared) for comparison, the smaller the distance the more similar the images are = best match.
❗️❗️❗️Retrieval(2):
How to perform tf-idf weighting to obtain better matching results
Some words have no special meaning (“and” in text, sky in environment), ad weights to each bin of k-dim BoW descriptor.
Use tf-idf
- tf: Instead of using the raw bin count, normalize each component. (a_i/n_i) “Upvotes” words that are frequent in a single image.
- idf: Number of times a visual word N_i appears at least once in an image against the total number of images N (log N/w_i). Prioritizes word that are rate over the whole training set
(a_i/n_i)*log(N/N_i)
❗️❗️❗️Retrieval(3):
How to reduce the search problem using an inverted file index
Storing the indices of the training set images where each visual word appears and during retrieval, match only against the images associated with non-zero words in the image descriptor.
- Menaing you only look at smaller set of images than the whole vocabulary
❗️❗️❗️What can BoW be used for related to our study?
Used for loop-closure in a VSLAM algorithm. It provides a fast/good way of comparing (finding similarities in) a new image to images of our database.