Weel 9 - Object Recognition and Categorisation Flashcards
What is Indexing with Local Features
Each patch/region surround a point of interest has a descriptor; some high-dimensional feature space (e.g., SIFT)
Close points in feature space have similar descriptors, indicating similar local content
Important for 3d reconstruction and
Retrieving images of similar objects
(try match target image features to descriptors)
How do we efficiently find the relevant features of a new image
Using the idea of an inverted file index
What is Inverted File Index
for text docs, use an index to find pages where word occurs
We want to find all images(pages) in which a
feature(word) occurs
we want to map our features to ‘visual words’
Visual words: Main idea
Extract some local features from a number of images and map them into the 128-dimension space (if using SIFT)
Each point in the space is a local descriptor (SIFT vector)
How do we match visual words
When we see close points in feature space, we have similar descriptors (similar content)
Content is close enough to assume it is the same
How do we use clusters for visual words
We can create clusters to reduce the complexity (millions of points) to far fewer clusters which are considered the same
“quantize via clustering”
What are the cluster centres
the prototype “words”
How do we create the inverted file for visual words
-database of images
-run sift, find interest points and encode descriptors
-cluster descriptors
-Create our list of visual words (cluster centres)
-we pass all images through the visual words
-for each word, we have the list of images where this visual word occurs
How does the inverted file index handle new images
extract visual words it contains (sift)
map image to relevant words in the index
find all the other images that contain the same words
then compare word counts
(images will have the same similar visual words in common)
What is Spatial Verification
Sometimes, non similar images will have high visual word similarity (eg buildings with lots of similar windows)
spatial verification is used to check the images are actually the same
only some of the matches are mutually consistent
What is the spatial verification strategy
Use the generalised hough transform
Let each matched feature cast a vote on location, scale, orientation of the model object
(uses encoded information about the position, scale, and orientation of each feature match)
Verify parameters with enough votes
What is the Video Google system
eg find all scenes in film where actor is wearing blue tie
1.Collect all words within query region
2.Inverted file index to find relevant frames
3.Compare word counts
4.Spatial verification
What are the issues with visual vocabulary formation
- Sampling strategy: where to extract features eg blobs or corners…?
- Clustering / quantization algorithm
- Unsupervised vs. supervised(external labels or annotations are used to guide the clustering process)
- What corpus provides features (universal vocabulary?)
- Vocabulary size, number of words(too small-> may not capture visual content)
What is a good sampling strategy to find specific, textured objects
Sparse sampling at interest points
What is a good sampling strategy for object categorisation
Dense sampling
What is a good sampling strategy for more image coverage
Multiple complementary interest operators
What are 4 main sampling strategies
Randomly
Multiple interest operators
Dense
Sparse
What are some clustering/quantisation methods
k-means (typical choice)
agglomerative clustering
mean-shift
What is a query region
Eg want to find specific object in images
pull out only the SIFT descriptors whose positions are within the relevant polygon
What is object categorisation
Instead of recognising a specific dress
how can we recognise any dress
(a category)
What is the task description of object categorisation
Given a small number of training images of a category, recognize a-priori unknown instances of that category and assign the correct category label
What are Visual object categories
humans tend to find ‘basic-level categories’
EG
abstract level = animal, mammal
basic level = dog, cat, cow
individual level = dobermann, “Gary”
What is a-priori unknown
not known or seen by the system during the training phase
What is a functional category
“Something i can sit on”
“something i can eat”
what is an ad-hoc category
“something you can find in an office environment”
What are challenges to robustness of object categorisation
illumination
object pose
clutter
occlusions
intra-class appearance
viewpoint
what is the scale of supervision we can use
less: unlabeled, multiple objects
medium: classes labelled, some clutter
more: cropped to object, parts within object are labelled
What features must our visual word representation have for object categorisation
robust to intra-category variation
robust to deformation, articulation
…
still discriminative
What is the loose v strict BoW definition
Looser: independent features
Stricter: independent features with histogram representation (how frequently each word appears in image)
BoW overview
- feature detection and representation (SIFT)
- create codewords dictionary (visual word index)
- Image representation via bag of codewords
BoW: Feature detection and representation
creating regular grid:
- eg histogram
Interest point detector:
- Use state-of-the-art interest point detector
- represent features using SIFT
BoW: Image representation
for each image, we have frequency histogram
with each visual word and the count of how many times it appears
How do we compare BoWs
We can use many methods with histogram frequency data
Eg euclidean distance, normalised scalar product
Recognition with BoW histograms
BoWs representation means we can describe
the unordered set of points with a single vector (of fixed dimension across image examples)
this provides an easy way to use distribution of feature types
what are the two method types for recognition
generative methods
discriminative methods
What is a discriminative method for recognition
Learn a decision boundary/ rule (classifier) assigning bag of features representation of images to difference classes
Zebra/non-zebra
What is a generative method for recognition
Use probability
p(image|zebra)
p(image|no zebra)
look at likelihood values
Example discriminative: knn classification
map histogram in graph space, with boundaries between images
when we have a new image, build histogram which maps to a point in the graph
find the k nearest histograms
if the average says (positive eg yes zebra) then positive (negative -> negative)
nearest neighbour classification pros
Simple to implement
Flexible to feature / distance choices
Naturally handles multi-class cases
Can do well in practice with enough representative data
nearest neighbour classification cons
Large search problem to find nearest neighbors
Storage of data
Must know we have a meaningful distance function
what are some other types of discriminative classifiers
boosting
SVMs
Example generative: the naive bayes model
Assume that each feature is conditionally independent
p(w1,…,w|c) = ∏p(wi|c)
want to maximise:
c* = argmaxc p(c)∏p(wi|c)
if we know nothing about data assume uniform prior p(c)
What is p(c) in naive bayes
Prior prob. of the object classes
What is p(wi|c) in naive bayes
Likelihood of i-th visual word given the class
Estimated by empirical frequencies of visual words in images from a given class
what is ∏
multiply them together
How can we improve spatial information of BoW model
-visual phrases: frequently co-occurring words
-semi-local features: describe configuration, neighbourhood
-let position be part of each feature
BoW pros
- Flexible to geometry / deformations / viewpoint
- Compact summary of image content
- Provides vector representation for sets
- Empirically good recognition results in practice
BoW cons
- Basic model ignores geometry – must verify afterwards, or encode via features.
- Background and foreground mixed when bag covers whole image
- Interest points or sampling: no guarantee to capture object-level parts.
- Optimal vocabulary formation remains unclear.