7. Single and Two-View Geometry Flashcards by Savo Simeunovic

What is homography?

It is perspective image of a plane. If we change the angle of a plane that we are taking a picture of, that is homography.

The 3x3 matrix that maps some plane in a real world to some general quadrangle shape is called homography matrix.

How well did you know this?

Not at all

Perfectly

What is single-view geometry?

It is basically taking pictures while not moving the camera but the angle (plane of the scene). The application is panorama stitching where multiple photos are taken and then assembled into one.

How well did you know this?

Not at all

Perfectly

What is a projection of a planar object (plane)?

When we take a photo of some plane (a building) from an angle, we can observe that the building is not parallel (distortion) (windows to the building) but in reality we know they are parallel. This plane gets projected to some other non-rectangular object.

How well did you know this?

Not at all

Perfectly

Why is homography matrix 3x3?

The 3x3 matrix that maps some plane in a real world to some general quadrangle shape is called homography matrix. It has 8 degrees of freedom and is reversable. 3x3 and not 3x4 because we assume that the plane we are projecting has Z=0. This is assumed because we can position our camera (using extrinsic transformation) however we want. That makes the 3rd column drop.

How well did you know this?

Not at all

Perfectly

Why do you need to estimate the homography and how to do it?

Only 4 matching interest points are needed to get the homography matrix, but the problem are false positives. To mitigate this, we do the estimation.

It is done by estimating the homography between two overlapping images, and transforming one into the image plane of the other

We scale and center the interest point coordinates so that we achieve the numberical stability (avoid some numbers in equation being 0, 1 and some a few millions (x*y like for pixels)
We pick 4 different points, build the equations and estimate the homography matrix. Transform all points from the object-plane into an image plane (not just the 4) and we measure the distance to all points x’ (in non-homogeneous coordinates). Think of this as having a fitting line and measuring how far are other points from that line.
We take the inliners (applying some threshold around it and if they fall within the threshold, we take them to the next step)
Re-estimate the final homography using the inliners

How well did you know this?

Not at all

Perfectly

What is RANSAC?

(random sample consensus)

Assuming we have the proportion of inliners! We can calculate how many times we have to draw samples from our points to reach some confidence that we found the best sample.

Example for the fitting line:
- pick 2 points and calculate the fitting line.
- Compute some threshold around the fitting line and the inliners are points within that threshold
- Re-estimate the fitting line using the inliners
- Do this k-times, depending on the wanted confidence percentage and the percentage of inliners

How well did you know this?

Not at all

Perfectly

What is stereo?

Stereo vision, stereopsis, or stereo is the perception of measurement of depth from two projections.

How well did you know this?

Not at all

Perfectly

What is triangulation?

It is 3D reconstruction of a point by ray intersection. There are multiple approaches:
- Geometric method
- Linear method
- Non-linear geometric method

How well did you know this?

Not at all

Perfectly

Explain the simple geometric triangulation

If we have two points on the image and we assume they are from the same point in the same scene, we can intersect the rays from the pinhole, to the virtual image point (from two different cameras). We are also assuming we know the intrinsics of both cameras.

Because of the noise and numerical errors, the rays don’t intersect perfectly. What we can calculate is the shortest line connecting the two rays and find the midpoint. That is our estimate of the 3D point.

Mathematically, we construct 2 planes each containing the ray-vector and the line between the two rays. Intersection between planes with the other ray is the segment endpoint. We average them.

Does not generalize to more than 2 views nicely.

How well did you know this?

Not at all

Perfectly

Explain the linear approach in triangulation.

We can use the projection matrix 3x4 to convert the 3D homogeneous point into 2D one. We have 2 points from two image planes. We can use SVD to calculate this linear equation system.

We of course say that the x1 and X1 are orthogonal to each other and that the norm of the vector (scaling factor) is 1 (to avoid solutions being 0).

It generalizes to more than 2 views: just add more equations and find the best point.

How well did you know this?

Not at all

Perfectly

Explain the non-linear geometric approach in triangulation.

The idea is to minimize the squared projection error (in 2D) (instead of some random measurement of meters or centimeters of 3D points, we minimize the error between pixels)

This error also takes into account the measurement error of the cameras (hardware error, projection error etc.)

To find the optimum, we need to use some gradient descent technique to find the global minimum. Initialization of the method is the linear estimate (pretty good estimation, close to the minimum) to avoid local minimums.

How well did you know this?

Not at all

Perfectly

Explain an overview of epipolar geometry.

We have two converging cameras and there is a point in the world P that projects onto both camera’s image planes (p1 and p2)

How well did you know this?

Not at all

Perfectly

What is an epipole?

It is an image location of the optical center of the other camera. It can be outside of the visible area.

How well did you know this?

Not at all

Perfectly

What is the baseline in epipolar geometry?

It is a line that connects optical centers of two cameras and it passes through the epipoles of both cameras. The cameras have to be converging (not parallel nor diverging)

How well did you know this?

Not at all

Perfectly

What is the epipolar plane?

It is a plane that passes through two camera center’s, epipoles, and some world point P. It has to also pass through the projections of the P on both cameras (p1, p2)

How well did you know this?

Not at all

Perfectly

What are epipolar lines? What is its interesting feature?

Study These Flashcards

If I have the epipolar plane and two image planes, the epipolar lines are the lines that are the intersections of the epipolar plane with the image planes.

An interesting feature is that, if we have a point p1 on one image plane, and we know that this point is also projected on the other camera, the p2 projection will lie somewhere on its epipolar line.

Are there multiple epipolar lines? If so, how are they related?

Study These Flashcards

For 2 different points in the world, there are different epipolar lines. Assuming the cameras are not binocular, these lines will intersect in epipole.

What is the difference between homography and epipolar geometry in terms of mapping of points?

Study These Flashcards

In homography, one point p1 corresponded to only one point p2 in other image (projecting of the plane onto an image plane). Here, in epipolar geometry, one point p1 corresponds to a line (epipolar line) on the other camera/image plane.

Where are the epipoles in binocular cameras?

Study These Flashcards

They are at the infinity which mean the epipolar lines are parallel to one another.

What is forward motion? What are the epipoles?

Study These Flashcards

When the viewer is moving through the world, but not the world. We can simulate this by taking one image, and then going a few steps forward and then taking an another image. This is assuming that the world did not move (nothing changed). That means that the baseline is orthogonal to the image plane which means epipoles have the same coordinate in both images.

What is the point of expansion?

Study These Flashcards

When we have the forward motion (viewer/camera moved in the world in the forward direction), the epipoles of the cameEras (have the same coords in both images) are called focus of expansion since, while we move forward, we have the feeling that the image expands from that point

Explain the epipolar constraint

Study These Flashcards

It is a constraint that gives us the correspondence between p1 and p2 (same point on two different image planes).

p1_T is simply a p1 vector (from O1 camera center to point p1)
t is the baseline vector that is just a translation vector from center O1 to O2 (if we choose that O1 is the main center of the world)
p2 is also simply a vector from O2 to the points p2, but we have to rotate it so it is consistent to the O1 being a center of the world

What is the essential matrix? What are its properties?

Study These Flashcards

Assuming calibrated cameras (know the intrinsics of the cameras)

It is a matrix that transforms p1 point in one image plane to a p2 on a different image plane, under the assumption that they are a projection of the same point in the 3D world.

Assuming we have many p1-p2 points we know they are the same point in the world, we can estimate the essential matrix that is common to all of them.

Properties:
p1_T E p2 = 0
Epipolar lines: l1 = Ep1, l2 = E_T p1
Epipoles are the left/right null-space of the essential matrix:
e1_T E = E_T e1 = 0, and Ee2 = 0
It has rank of 2 (smallest eigenvalue is 0), and the other two eigenvalues are equal.
5 degrees of freedom (translation + rotation = 6, 3 for each, but the scale is arbitrary because of homogeneous coords so 5)

What is the binocular stereo? What is the essential matrix for it?

Study These Flashcards

It is estimation of the depth of the point by using two cameras which are parallel to each other (have the same y and z coords, but different x coords, and are pointing in the same direction).

The rotation of these two cameras is R = I (no rotation since they point in the same direction), and the translation is [-b, 0, 0], which means it only has to be shifted to the left/right (change of x-coords).

This means that the essential matrix is
0 0 0
0 0 b
0 -b 0

If we derive the epipolar constraint, we get by1 - by2 = 0, which means that the y-coords of the points have to be the same which leads to parallel epipolar lines.

What happens when we want stereo using uncalibrated cameras?

To get the essential matrix, we derived it by using intersection between the epipolar plane and image planes. This can only be done if we know the intrinsics of the camera (its focal line, image plane size etc.) But if have two different cameras and they are not calibrated, we can't use the same formula. We have to also take into account the calibration matrix K to come from the camera coordinates to the image coordinates. x1 = K1p1 where x1 are the image coordinates and p1 are the camera coordinates, and we know from essential matrix that p1_T * E * p2 = 0. We can then just substitute and we get x1_T * F * x2 = 0 where F = K1_-T * E * K2_-1 This is called a fundamental matrix.

What is the fundamental matrix and what are its properties?

Fundamental matrix is a matrix is a common element between two image points x1 and x2 knowing they are a projection of the same point P, without knowing the intrinsics of the camera (calibration matrix). 0 = x1_T * F * x2 where F = K1_-T * E * K2_-1 and E is the essential matrix [t]_x r. Epipolar lines: l1 = Fx2 l2 = F_Tx1 Epipoles: e1_T F = F_T e1 = 0 Fe2 = 0 It is a singular matrix with rank 2 (smaller eigenvalue is 0) 7 degrees of freedom (more than essential matrix)

How do we estimate the fundamental matrix?

Since it has 7 degrees of freedom, we need at least 7 points and one equation per point, but the solution is non-linear. We need some gradient descent method which can be bad due to being stuck in local minimum, but initialize with the linear approach. To use linear solution, we need 8 point pairs and we need to use the normalized eight-point algorithm

Explain the normalized eight-point algorithm

It is an algorithm to estimate the fundamental matrix. 1. Find interest points in both images and use RANSAC to take multiple samples of 8 points from these interest points (depending on the desired confidence). Perform the epipolar geometry calculation on these 8 points Take the 8 point-pairs and compute the equations Af = 0 where A is the matrix gotten from the x1 and x2 homogeneous 2D points, and f is the fundamental matrix as a vector. We stack the equations and solve it using homogeneous least squares. We ofc but constraint ||f|| = 0 and solve it using SVD. That is the eight-point algorithms but there is a problem: - There is a problem that coefficients of A have different orders of magnitude (1 - a few million). First the coordinated have to be scaled from -1 to 1, and shifted around 0 (numerical stability). But if we use the normalized coordinates, the resulting matrix is not necessarly rank = 2, but rank = 3. Then, we have to find the most similar matrix by using SVD on the estimated fundamental matrix and constraining it to be rank = 2 (D_F_33 = 0). (least squared error) And then undo the coordinate estimation in the end.

Why is eight-point algorithm not enough for reconstruction?

Because we don't know the focal length (intrinsics), we can't get an accurate representation of an object. So, we can determine the line intersection, coplanarity, but we need either a focal length, or some reference from the scene (saying a human is on average 1.75m tall) to have an additional constraint. 3rd view is also enough. Even if we have calibrated cameras but no measure reference, we can reconstruct but we don't know the scale.

7. Single and Two-View Geometry Flashcards

(29 cards)