Weel 9 - Object Recognition and Categorisation Flashcards by Isabel Draper

What is Indexing with Local Features

Each patch/region surround a point of interest has a descriptor; some high-dimensional feature space (e.g., SIFT)
Close points in feature space have similar descriptors, indicating similar local content
Important for 3d reconstruction and
Retrieving images of similar objects
(try match target image features to descriptors)

How well did you know this?

Not at all

Perfectly

How do we efficiently find the relevant features of a new image

Using the idea of an inverted file index

How well did you know this?

Not at all

Perfectly

What is Inverted File Index

for text docs, use an index to find pages where word occurs
We want to find all images(pages) in which a
feature(word) occurs
we want to map our features to ‘visual words’

How well did you know this?

Not at all

Perfectly

Visual words: Main idea

Extract some local features from a number of images and map them into the 128-dimension space (if using SIFT)
Each point in the space is a local descriptor (SIFT vector)

How well did you know this?

Not at all

Perfectly

How do we match visual words

When we see close points in feature space, we have similar descriptors (similar content)
Content is close enough to assume it is the same

How well did you know this?

Not at all

Perfectly

How do we use clusters for visual words

We can create clusters to reduce the complexity (millions of points) to far fewer clusters which are considered the same
“quantize via clustering”

How well did you know this?

Not at all

Perfectly

What are the cluster centres

the prototype “words”

How well did you know this?

Not at all

Perfectly

How do we create the inverted file for visual words

-database of images
-run sift, find interest points and encode descriptors
-cluster descriptors
-Create our list of visual words (cluster centres)
-we pass all images through the visual words
-for each word, we have the list of images where this visual word occurs

How well did you know this?

Not at all

Perfectly

How does the inverted file index handle new images

extract visual words it contains (sift)
map image to relevant words in the index
find all the other images that contain the same words
then compare word counts
(images will have the same similar visual words in common)

How well did you know this?

Not at all

Perfectly

What is Spatial Verification

Sometimes, non similar images will have high visual word similarity (eg buildings with lots of similar windows)

spatial verification is used to check the images are actually the same
only some of the matches are mutually consistent

How well did you know this?

Not at all

Perfectly

What is the spatial verification strategy

Use the generalised hough transform
Let each matched feature cast a vote on location, scale, orientation of the model object
(uses encoded information about the position, scale, and orientation of each feature match)
Verify parameters with enough votes

How well did you know this?

Not at all

Perfectly

What is the Video Google system

eg find all scenes in film where actor is wearing blue tie
1.Collect all words within query region
2.Inverted file index to find relevant frames
3.Compare word counts
4.Spatial verification

How well did you know this?

Not at all

Perfectly

What are the issues with visual vocabulary formation

Sampling strategy: where to extract features eg blobs or corners…?
Clustering / quantization algorithm
Unsupervised vs. supervised(external labels or annotations are used to guide the clustering process)
What corpus provides features (universal vocabulary?)
Vocabulary size, number of words(too small-> may not capture visual content)

How well did you know this?

Not at all

Perfectly

What is a good sampling strategy to find specific, textured objects

Sparse sampling at interest points

How well did you know this?

Not at all

Perfectly

What is a good sampling strategy for object categorisation

Dense sampling

How well did you know this?

Not at all

Perfectly

What is a good sampling strategy for more image coverage

Multiple complementary interest operators

How well did you know this?

Not at all

Perfectly

What are 4 main sampling strategies

Randomly
Multiple interest operators
Dense
Sparse

How well did you know this?

Not at all

Perfectly

What are some clustering/quantisation methods

k-means (typical choice)
agglomerative clustering
mean-shift

How well did you know this?

Not at all

Perfectly

What is a query region

Eg want to find specific object in images
pull out only the SIFT descriptors whose positions are within the relevant polygon

How well did you know this?

Not at all

Perfectly

What is object categorisation

Instead of recognising a specific dress
how can we recognise any dress
(a category)

What is the task description of object categorisation

Given a small number of training images of a category, recognize a-priori unknown instances of that category and assign the correct category label

What are Visual object categories

humans tend to find ‘basic-level categories’
EG
abstract level = animal, mammal
basic level = dog, cat, cow
individual level = dobermann, “Gary”

What is a-priori unknown

not known or seen by the system during the training phase

What is a functional category

“Something i can sit on”
“something i can eat”

what is an ad-hoc category

"something you can find in an office environment"

What are challenges to robustness of object categorisation

illumination object pose clutter occlusions intra-class appearance viewpoint

what is the scale of supervision we can use

less: unlabeled, multiple objects medium: classes labelled, some clutter more: cropped to object, parts within object are labelled

What features must our visual word representation have for object categorisation

robust to intra-category variation robust to deformation, articulation ... still discriminative

What is the loose v strict BoW definition

Looser: independent features Stricter: independent features with histogram representation (how frequently each word appears in image)

BoW overview

1. feature detection and representation (SIFT) 2. create codewords dictionary (visual word index) 3. Image representation via bag of codewords

BoW: Feature detection and representation

creating regular grid: - eg histogram Interest point detector: - Use state-of-the-art interest point detector - represent features using SIFT

BoW: Image representation

for each image, we have frequency histogram with each visual word and the count of how many times it appears

How do we compare BoWs

We can use many methods with histogram frequency data Eg euclidean distance, normalised scalar product

Recognition with BoW histograms

BoWs representation means we can describe the unordered set of points with a single vector (of fixed dimension across image examples) this provides an easy way to use distribution of feature types

what are the two method types for recognition

generative methods discriminative methods

What is a discriminative method for recognition

Learn a decision boundary/ rule (classifier) assigning bag of features representation of images to difference classes Zebra/non-zebra

What is a generative method for recognition

Use probability p(image|zebra) p(image|no zebra) look at likelihood values

Example discriminative: knn classification

map histogram in graph space, with boundaries between images when we have a new image, build histogram which maps to a point in the graph find the k nearest histograms if the average says (positive eg yes zebra) then positive (negative -> negative)

nearest neighbour classification pros

Simple to implement Flexible to feature / distance choices Naturally handles multi-class cases Can do well in practice with enough representative data

nearest neighbour classification cons

Large search problem to find nearest neighbors Storage of data Must know we have a meaningful distance function

what are some other types of discriminative classifiers

boosting SVMs

Example generative: the naive bayes model

Assume that each feature is conditionally independent p(w1,...,w|c) = ∏p(wi|c) want to maximise: c* = argmaxc p(c)∏p(wi|c) if we know nothing about data assume uniform prior p(c)

What is p(c) in naive bayes

Prior prob. of the object classes

What is p(wi|c) in naive bayes

Likelihood of i-th visual word given the class Estimated by empirical frequencies of visual words in images from a given class

what is ∏

multiply them together

How can we improve spatial information of BoW model

-visual phrases: frequently co-occurring words -semi-local features: describe configuration, neighbourhood -let position be part of each feature

BoW pros

- Flexible to geometry / deformations / viewpoint - Compact summary of image content - Provides vector representation for sets - Empirically good recognition results in practice

BoW cons

- Basic model ignores geometry – must verify afterwards, or encode via features. - Background and foreground mixed when bag covers whole image - Interest points or sampling: no guarantee to capture object-level parts. - Optimal vocabulary formation remains unclear.