6. Interest Points Flashcards
What is an interest point?
An interest point is a point in an image that can help us understand whether two images are of the same object. We can see that it is interesting if we have some window of the image, and if we move it slightly in any direction, there will be some change in the picture. On the other hand, if we move the window (in any direction) and almost nothing changes, we say that that point is not interesting.
What are some easy cases to find whether two images depict the same object/scene. What are the difficult ones?
Easy: Same viewpoint, but maybe with a different camera, or rotated a bit
Hard: different viewpoint, or out-of-plane rotation where some new information is shown but some is removed.
How to find interest points in an image?
Idea: an interest point is a point that is unique. If we move some window in any direction, the window changes a lot. Edge is not an interest point: we can move along the edge. Usually we want corners or blobs.
So, if we move a window in (u, v) direction, we can calculate the change using SSD (sum of squared differences). This can be approximated using 1st order of Taylor series where we need first derivatives of each of the pixels in the window.
Goal: maximize this error, but it has to be big error in all of the directions. We see that by using eigen decomposition of the structure tensor (matrix) which is just a an outer product of gradients of each pixel. H =
IxIx IxIy
IyIx IyIy
A point is interesting if the smaller eigenvalue is big (lambda_minus is above some threshold)
At the end, we do some non-max suppression to find local maximum of each point (so they are localized)
What is a Harris operator?
It is used for approximating of interest points detection. Computing eigenvalues and eigenvectors is fairly expensive (square root is needed). Instead of calculating lambda_minus, we can just calculate the det(H) and trace(H) which are easier to compute.
We find the same interest point local maximum, but the surrounding pixels have different values (don’t care about that because of the non-max suppression)
Explain the Hessian determinant in interest point detection.
- It is a interest point detector based on the Harris detection. The difference is that, in Harris operator, a kernel is a matrix of products of 1st derivatives. In the Hessian determinant, we use the 2nd derivative. It is much more focused on the blob-like structures.
Ixx Ixy
Iyx Iyy
And to find the points, we have to calculate the det(H) = IxxIyy - Ixy_squared. The rest of the process is the same as Harris:
- threshold it
- non-max suppression (find local maximas)
Give me the sensitivity to invariances for Harris/Hessian detectors.
- image plane rotation: it can deal with it
- translation: it can deal with it since the algorithm is applied per pixel, the shift doesn’t matter for example
- projective transformation: a problem
- scaling: not invariant, we have to use a modified version of Harris
Is Harris detection rotation invariant? Why?
Yes. We can apply the rotation matrix to the gradients (maximal change in intensity) of an image. The eigenvevtor will still point in the same direction, just with the rotated coordinate system
Is Harris detection scale invariant? Why? How can we deal with it?
It is not, because if we have an image and a zoomed-in image and we find the same interest point, and then crop out 11x11 around the interest point, the cropped out will be different images (some information in one that is not shown in the other one).
We need scale-invariant interest point detectors. One is Harris-Laplace, and it takes into account the scale of the image, or how big of the area around an interest point should we consider.
Explain the Harris-Laplace interest point detector.
Idea is that some points on the image become interest points at different scales (how much area we consider around an interest point). This algorithm finds, with the interest point, also the scale (sigma) that gives the local maximum of some criterion.
We actually don’t resize the window, but we consider the scale space and different levels of smoothing, because smoothing (like gaussian) brings information from far.
The criterion we use is the laplacian function (trace(H)) which is Ixx + Iyy. It finds the scale that maximized the blobiness of the response.
How it works: We compute all sigmas for the laplacian of a particular point. We get the “multiscale harris points”. Then, we find the local optima compared to its neighbours across all scales. If it is a maximum, then we found the right sigma. The scale selection is according to the criterion:
What is a scale space?
It is a space where we have different smoothing scales. It consists of the same images with different smoothing scales (sigma in the case of gaussian smoothing)
What are SIFT and SURF?
They are versions of blob interest point detectors. SIFT, similarly to HarLab, detects local maximas in space and scale. It verifies the interest points with a Hessian-based. criterion
SURF is an efficient version of SIFT using only 2nd derivative.
Are blob detectors better or worse than corner ones?
For natural images of the nature, blobs are better (corners in tyhe nature are rare), but for human-made things, the corners are better.
How to evaluate the interest points?
Problem is the ground truth. We can’t say that the detector MUST find some points. Rather, that it should find same interest points in the same altered images (with noise, rotation…).
Comparing two of the same images (where one is altered), we just compare how many of the interest points from the original image did the detector find (percentage).
Also, to have also an ability for zoomed in altered images, I can look at the intersection over the union error (Jaccard index).
How to match interest points from two images?
We can use the local features/descriptiors. They are cropped image around an interest point. We do this and compare two descriptors from two images. This can be by using:
- Image patches: use windows around an interest point
How to find what interest points correspond to one another?
- Find the local descriptors (region around an interest point) and compare them using some distance metric:
image patches
What are important properties of local descriptors in interest point matching?
What are image patches and how can they be used to match interest points?
Image patches (vector of pixels): We can compare pixel (intensity) after we did some scale and rotation normalization (harris-laplase). It is very distinctive but there is it is highly variant to different lighting, not robust to noise and it has a high dimensionality (doesn’t abstract)
What are filter banks and how can they be used to match interest points?
Instead of using pixels themselves, we use the idea of subspace representation using different gaussian derivatives (1st, 2nd…) because they carry the most amount of invariance for general images (we don’t know if the feature is a face of smth else)
What is the idea of SIFT for matching interest points?
Orientation histogram: we take the 16x16 window around an interest point and we compute the angle histogram. To do it, we first subdivide the window into 16 parts of 4x4 grids, we compute edge orientation (angle of the gradient - 90 degrees) for each pixel. For each of the 16 pixels. we count and we put them into an 8 orientation angle historgram (x and y axis and diagonals).
We divide because if we compute histogram for the whole 16x16 window, we don’t know where the gradients are. The image could be totally different but with the same histogram. To mitigate this, we divide it into 16 different windows to we have some distinctiveness (but not as much as in the image patching)
What are the properties of SIFT?
- Can handle changes in viewpoint (up to about 60 degree out of plane rotation)
- Cn handle significant changes in illumination
- Fast and efficient
What is the idea of shape context for matching interest points?
Features come in different shapes and sizes, but they represent the same object (like the fish example). To tackle this problem, we use the log polar histogram. First, we use some edge detector and find all edge-points of an image. Idea is that the pixels close to the interest points should be closer to each other, put pixels further away from the point can have higher variance in distance
How do we evaluate the algorithm for matching interest points?
We have to maximise the true positives (correctly matched points) but to minimize the false positives (incorrectly matched points). Usually, we have to threshold the distance between matching points. What threshold we decide, it affects the performance/evaluation.
We can plot the ROC curve and see the TP/FP for all thresholds (represents the tradeoff between TP and FP), we we want to maximise the AUC metric (area under curve)
What is a feature matched
Descriptor + distance/classifier