Computer Vision Flashcards

1
Q

Perspective projection

A
  • y / Y = f / Z -> y = fY/Z
  • Similarly, x = fX/Z
  • (X,Y,Z) -> (x,y): R³ -> R²
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Coordinate conversion

A
  • Cartesian to homogenous
    1) P = (x,y) -> (x,y,1)
    2) P ∈ R² -> P̃ ∈ P²
  • Homogenous to cartesian
    1) P̃ = (x̃, ỹ, z̃) -> P = (x,y)

Sidenote: x = x̃/z, y = ỹ/z

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Perspective projection equation

A

2D point = [x̃;ỹ;z̃]

Projection matrix = [f 0 0 0; 0 f 0 0; 0 0 1 0]

3D point: [X; Y; Z; 1]

  • 2D point = projection matrix * 3D point
  • x̃ = fX, ỹ = fY, z̃ = Z
  • To convert back to cartesian, divide by third coordinate (i.e., x = fX/Z, y = fY/Z)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Intrisic model

A
  • An upgrade from the simple projection model
  • Includes the following properties:
    1) Change of coordinate meters to pixels
    2) Focal lengths are independent for x and y (defined as fₓ and fᵧ)
    3) A skew , s , is included
    4) Optical centre (cₓ, cᵧ) is included
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Equation for intrisic model

A

[fₓ s cₓ 0; 0 fᵧ cᵧ 0; 0 0 1 0] = [1/pₓ 1/pₛ cₓ; 0 1/pᵧ cᵧ; 0 0 1] * [f 0 0 0; 0 f 0 0; 0 0 1 0]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Extrinsic model

A
  • Also known as the pose
  • Consists of a 3x1 3D-Translation vector t
  • Consists of a 3x3 3D-Rotation matrix R
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Overall camera matrix

A
  • Considers both intrinsic and extrinsic calibration parameters together
  • [x̃;ỹ;z̃] = [fₓ s cₓ; 0 fᵧ cᵧ; 0 0 1]*[R3D t]*[X;Y;Z;1]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Lens cameras

A
  • Larger aperture = more light = less fous
  • Less shutter speed = less exposure time = less amount of light reaching the sensor
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Focal length

A
  • Equation: 1/f = 1/zᵢ + 1/z₀
  • f = focal length
  • z₀ = where object is
  • zᵢ = where image is formed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

A/D

A
  • Analogue to Digital conversion
  • Spatially-continuous image is sampled (CCD or CMOS) by a few pixels
  • Pixel values are then quantized e.g. 0 (black) to 255 (white)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Shannon Sampling Theorem

A
  • Used to choose sampling resolution
  • If signal is bandlimited, then fₛ ≥ 2 * fₘₐₓ, where:
    1) fₛ = sampling (aka Nyquist) rate
    2) fₘₐₓ = maximum signal frequency
  • This equation is enough to guarantee that the signal can be perfectly linearly reconstructed from its samples
  • If signal is not bandlimited, then use analogue low-pass filter to clip frequency contents fᵤₜ ≤ fₛ/2
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Quantization

A
  • Used to determine bit rate
  • In audio, this would be the number of bits sent per second (CD’s are usually 16-bit)
  • In imaging, this would be in terms of coding the value of pixels (Images are usually 24-bit)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Compressed sensing

A

Steps

1) Regular sampling (namely Shannon)
2) Random sampling
- Used to collect a subset
- Complexity: M = O (K log (N/K))
3) Reconstruction
- Obtain x from y
4) Sparse approximation
- Is typically a non-linear reconstruction
- Should have a prior awareness
- Equation for prior awareness:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Compressed sensing hallmarks

A
  • Stable
    1) Acquisition and recovery is numerically stable
  • Assymetrical
  • Democratic
    1) Each measurement carries the same amount of information
  • Encrypted
    1) Random measurements are encrypted
  • Universal
    1) Same random patterns can be used to sense any sparse digital class
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Convolution steps

A

1) Flip filter vertically and horizontally
2) Pad image with zeros
3) Slide flipped filter over the image and compute weighted sum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Convolution distributive property

A
  • Consider an image I
  • Consider two filters g and h
  • Image I is convolved by filters g and h and the results are added together (i.e. Î = (I * g) + (I * h))
  • Distributive property states that ((I * g) + (I * h)) ≡ (I * (g + h))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Correlation

A
  • No flipping of kernel
  • No padding image
  • Just slide filter over the image and compute weighted sum
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Normalized correlation

A

-Used to measure similarity between a patch and image at location (x,y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

SSD

A
  • Stands for Sum of Squared Differences
  • Calculates the moving difference between the intensities of a patch and an image
  • Given a template image/patch, fins where it is located in another image by finding displacement (x*,y*) which minimises SSD cost
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Low-pass filter

A
  • Smooths (blurs) an image by (weighted/ unweighted) averaging pixels in its neighborhood
  • Also known as a Gaussian filter
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

High-pass filter

A
  • Computes finite (weighted/unweighted) differences between pixels and their neighbors
  • Extracts important low-level information: intensity variations
  • HP filters have zero mean and remove DC component from images
  • Example of HP filter is edge detection
  • Horizontal edge: [-1 -1; 1 1]
  • Vertical edge: [-1 1; -1 1]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Derivative of Gaussian

A
  • A high-pass filter
  • Used to implement nth order image differences
    • For example, with a second order derivative
      • (∂2I)/(∂2x) ≈ f1∗f1∗I
      • (∂2I)/(∂2y) ≈ f2∗f2∗I
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Laplacian of Gaussian filter

A
  • A high-pass filter
  • Used to implement 2nd order image differences
  • In the equation, ∇² represents the Laplacian operator
  • ∇²f = (∂²f / ∂x²) + (∂²f / ∂y²)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Subtraction of filters

A
  • Typically used for sharpening
  • ISharp(x,y) = I(x,y) - α(I ∗ ∇²Gσ(x,y))
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Gaussian vs mean filtering

A
  • Gaussian blur results in less artifacts

-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Anti-aliasing

A
  • Aliasing = appearance of artefacts due to loss of image data
  • To solve, apply a low-pass filter to an image before downsizing an image
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Difference of Gaussians (DoG)

A
  • DoG{I,σ₁,σ₂} = gσ₁ ∗ I - gσ₂ ∗ I = (gσ₁ - gσ₂) ∗ I
  • Extracts “edges” at multiple scales i.e. the information difference between two scales
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Image resampling (stretching)

A
  • Increase image size
  • Interleave the small image with zeros
  • Apply a low-pass filter
  • This linearly interpolates between the pixels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Canny edge detection

A
  • The most wildely used edge detector

Steps

1) Compute a robust gradient of the image (i.e. using Gaussian differentiation)

  • Compute gradient’s magnitude and direction
    2) Apply non-maxima suppression for edge-thinning
  • Keep largest edge (i.e. local maxima in a neighborhood) across gradient direction, suppress the rest
    3) “Hysteresis thresholding” to discard non-connected components
  • Define two thresholds: T-high and T-low
  • ‘Track’ a strong edge from T-high, keep connected edges above T-low, discard the rest
  • Note: Step 2&3 discard many False Positives to get clean edges*
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Non-maxima suppression

A

Check if pixel is local maximum along gradient direction (in a window); if yes then keep it & discard other candidates

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Hysteresis thresholding

A
  • Threshold the gradient magnitude with two thresholds: T-high and T-low
  • Strong Edges start at edge locations with gradient magnitude > T-high
  • The values in between may be an edge (blow T-low not)
  • Continue tracing that edge until gradient magnitude falls below T-low, keeping only connected components in this range
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Effect of σ

A
  • σ = Gaussian kernel spread/size
  • The choice of depends on desired behavior
    1) Large σ detects large scale edges
    2) Small σ detects fine features
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Intensity-based matching process

A
  • Select patches
  • Slide them across the image
  • Score the SSD & report minimizer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Good features of intensity-based matching

A

Locality

  • In a “neighbourhood” pixels reveal important structures
  • Shift-invariance (e.g. patch space vs. original image)

Efficiency

  • Features extraction in real-time is possible Quantity • Instead of O(106) pixels, e.g. O(103) features to match. • Search in real-time is possible Distinctiveness • Can differentiate a large database of objects Building invariances • Invariant to image transformations e.g. shift, rotation, scale, deformations,… • robust to clutter, noise and occlusion
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Gradients behavior in a patch equations

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Gradients behavior in a patch analysis

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

PCA

A
  • Stands for Principal Component Analysis

-

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Harris corners

A
  • In practice computing eigen values is not very fast
  • Harris instead proposed to approximate the process by the following functions:

If R scores higher than a threshold is a corner candidate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Harris detector process

A
  1. Compute directional derivatives Ix, Iy and IxIy at each pixel (and smooth them with Gaussian filters).
  2. Compute the Structure Matrix in a window/patch around each pixel
  3. Compute corner response function R
  4. Threshold R
  5. Find local maxima of response function (non-maximum suppression)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

SIFT

A
  • Stands for Scale Invariant Feature Transform
  • Used in object recognition as it’s powerful, especially for well-textured objects
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

SIFT matching process

A
  • Each image has 100 to 1000 SIFT descriptors (instead of 1M pixels)
  • Each is represented by 128 numbers vector φ
  • SSD can be used for measure: ∑_(i=1)^128 (φi(1) – φi(2))
  • Feature matching = minimizing SSD (search)
  • fast KD-tree type search can be adopted
  • SIFT feature matching can run in real-time!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

SIFT fine-tuning

A

Feature matching still returns a set of noisy correspondences

Refinements:

  1. Discard weak matches with SSD(φ(1)(2)) > threshold
  2. Keep only clear matches i.e. R << 1
    • R = SSD(φ(1)(2))/SSD(φ(1),φ′(2))
      • φ(2) = best SSD match to φ(1) in second image
      • φ′(2) = 2nd best SSD match to φ(1) in second image
      • R gives large values ~ 1 for ambiguous matches
  3. Discard inconsistent matches; for which we need to know something about the geometry of the scene
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Applications of SIFT

A

Without geometry

Object recognition

With geometry

  • Building panoramas
    • Extract & match SIFT features
    • Use matched pairs to learn a (homography) mapping which holds for all points in two images
    • Apply the mapping to stitch two images
  • 3D reconstruction
    • Use SIFT to find dense correspondences between multiple images of a scene
    • Use matched tuples (+ non-planar geometry) to estimate 3D shape
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

SIFT vs Harris

A
  • SIFT has scale invariance, Harris doesn’t
  • Both have shift, rotation and illumination variance
  • SIFT provides descriptors, Harris doesn’t
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Measuring performance of a feature matcher

A
  • TP (true positive): number of correct matches
  • FN (false negative): number of matches that were not correctly detected
  • FP (false positive): number of proposed matches that are incorrect
  • TN (true negative): number of non-matches that were correctly rejected
  • TP rate = TP/(TP+FN) = TP/P
  • FP rate = FP/(FP+TN) = FP/N
  • Accuracy = (TP+TN)/(N+P)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Taylor Expansion of DoG

A

Has three purposes:

1) Achieves sub-pixel accuracy for keypoint localiazion
2) Reject edges
3) Reject low-contrast keypoints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Key point localisation

A

Detect maxima and minima of DoG in scale & space

  • Each point is compared compared with to the 26 pixels in the current and its adjacent scales
  • Detected if larger/smaller than all of the 26 pixels
  • Extrema must be stronger than a threshold to reject candidates with low contrast
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

Discarding of low-contact points

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Elimination of edge responses

A
  • For candidates, construct a 2x2 (Hessian) matrix:
  • [Dxx Dxy; Dxy Dyy]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

Orientation Assignment

A
52
Q

Homography

A
  • Defines a relation between two images in the same planar surface
  • Defined as: [ũ; ṽ; w̃] = [h11 h12 h13; h21 h22 h23; h31 h32 1][x; y; 1]
  • Extra step included: x’ = ũ/w̃, y’ = ṽ/w̃
  • ũ = h11x + h12y + h13
  • ṽ = h21x + h22y + h23
  • w̃ = h31x + h32y + 1
  • h33 = 1 due to scale invariance
  • Degrees of freedom = 8 as a result
53
Q

Application of homography

A

Building a panorama

Process

  1. Find correspondences
  2. Reject outliers using RANSAC
  3. Compute the homography using inlier matches
  4. Harmonize both image intensities by gain adjustment
  • Compute the average RGB intensity of each image in the overlapping region
  • Normalize the intensities using the ratio of between these averages
54
Q

2D geometric transform

A
  • Result = [u;v;1]
  • Matrix = [α cos(θ), α sin(θ), tx; α sin(θ), α cos(θ), ty; 0 0 1]
  • Point = [x;y;1]
  • Result = Matrix * Point
  • Result = [αx cos(θ) + αy sin(θ) + t₁; αx sin(θ) + αy cos(θ) + t₂; 1]
  • α = scale
  • tx/y = translation in axis
  • θ = rotation angle
55
Q

Calculating relation between two images

A

ĩ1 = H1[X; Y; 1], ĩ2 = H2[X; Y; 1]

-> ĩ2 = H2H1-1ĩ1, where H = homography

56
Q

LSQ

A

Stands for Least Squares

Used to solve the equation At = b

Find t that minimises ||At - b||2

To solve, form the following normal equations:

1) ATAt = ATb
2) t = (ATA)-1 ATb

LSQ is robust against (moderate) noise but not against outliers

Including outliers in model-fitting (LSQ) leads to a large model error

Outliers are avoided through checking global consistency

57
Q

RANSAC

A

Known as Random Sample Consensus

Used for robust model fitting

Process

  • Loop
    • Pick a random set of K points, where K represents the minimal number of points require for fitting to model M
    • Compute (fit) the model M using these points
    • Apply M to the other points & measure the error
    • Label high error matches as outliers
    • Keep a copy (M) each time the number of outliers reduces
  • After S iterations, return M with fewest outliers
  • Recompute M using all of the inliers
58
Q

RANSAC calculation

A

Aim: estimate a model with K points, given the probability that a point is an inlier

  • q = the probability that a point is an inlier
    • qK = probability of selecting all inliers for a K-point model (at each iteration)
    • 1-qK = probability of choosing at least one outlier (at each iteration)
  • P = probability that RANSAC chooses an outlier-free model after S iterations: P = 1 - (1 - qK)S

*

59
Q

Calculating iterations from RANSAC

A

S = iterations to be calculated

K = points in model

q = 1 - probability of choosing outlier

60
Q

Forward model

A
  • Used to generate 2D images from a 3D shape
  • Properties such as texture and reflectance are also defined
61
Q

Camera model

A
  • Explains the formation of images from 3D points (i.e., a linear model in homogeneous coordinates)
    • P = [x.y,z], a point in 3D
    • p = [x,y], a corresponding image pixel
  • [x̃; ỹ; z̃] = K[R t] [X;Y;Z;1]
  • Extra step required: x = x̃/z̃, y = ỹ/z̃
62
Q

3D reconstruction

A
  • The inverse to the camera model, where a 3D shape is estimated that would generate the input photographs given the same camera viewpoints, as well as material and illumination
  • Requires more than one view i.e. motion to resolve depth/size ambiguities in “back projection”
  • Process is typically as follows:
    1. Camera calibration
    2. Matching correspondences (e.g. SIFT) between Multiview images
    3. Triangulation
63
Q

Uses of 3D reconstruction

A
  • 3D reconstruction allows the creation of a “digital copy” of a real object
  • This allows the following things:
    • Inspection of object details
    • Measurement of properties
    • Reproduction of an object in different material
  • Therefore, 3D recon has several applications such as:
    • Preservation of cultural heritage
      • Some companies use 3D recon to provide “digital museums”
    • Computer games
    • Movies
    • City modelling
      • Typically done for historical buildings, in order to help preservation of them in the long-term
    • E-commerce
64
Q

Camera calibration

A
  • The pre-requisite step for many computer vision tasks, including 3D reconstruction
  • [x̃;ỹ;z̃] = [C11 C12 C13 C14;C21 C22 C23 C24;C31 C32 C33 C34; C41 C42 C43 1]*[X;Y;Z;1]
  • There are 11 parameters for camera calibration, due to scale invariance
    *
65
Q

Determining camera calibration

A
  • A number of correspondences, n, between the 3D object and 2D images is needed: {(xi,yi, Xi,Yi,Zi )}(i=1…n)
  • [x̃12 … x̃n; ỹ12 … ỹn; z̃12 … z̃n] = [C11 C12 C13 C14;C21 C22 C23 C24;C31 C32 C33 C34; C41 C42 C43 1]*[X1 X2 … Xn; Y1 Y2 … Yn ;Z1 Z2 … Zn; 1 1 … 1]
    • This matrix is in homogeneous coordinates but the image pixels are in a cartesian domain
    • Therefore, a divison is required to relate X,Y,Z to some x and y
  • x = x̃/z̃ and y = ỹ/z̃
    • x̃ = C11Xi + C12Yi + C13Zi + C14
    • ỹ = C21Xi + C22Yi + C23Zi + C24
    • z̃ = C31Xi + C32Yi + C33Zi + 1
66
Q

Solving a camera matrix with a linear system

A
  • This involves the factoring of knowns and unknowns and writing it as a linear system
    • System: A*[C11; C12; C13; C14; … C33] = b
  • Each correspondence adds two rows to A: one for the x-coordinate and one for the y-coordinate of the point
    • At least 6 correspondences are needed to compute all 1 1 unknowns in the camera matrix
    • Number of rows in A = 2n
    • Number of columns in A= 11
    • Number of rows in B = 2n
    • Number of columns in B = 1
67
Q

Finding correspondences

A
  • Steps of usual pipeline
    • Use a calibration rig e.g. checkerboard with known dimensions & geometry
    • This gives a number of known 3D world points.
    • Chess corners are ideal.
    • Find corresponding points (chess corners) in the image using e.g. Harris or SIFT.
    • It is common to use 3 planar objects, such as the inside corner of a cube. Assume checkerboard corner is the centre of the world
68
Q

Internal and external calibrations

A
  • This is found after computing the 3x4 camera matrix C
  • C can be factorised at K [R t]
  • K = 3x3 intrisic calibration matrix: [fx s cx; 0 fy cy; 0 0 1]
  • [R t] = 3x4 pose matrix
69
Q

Triangulation

A
  • The estimaion of points (shapes) in 3D from their Multiview images through the use of calibrated cameras
  • We need to know the camera matrices and we require robust (outlier-free) matches
70
Q

Triangulation process

A
  1. First, find Multiview correspondences e.g. using SIFT
  2. For each view, back project a ray originating at camera centre and passing through the correspondence pixel
  3. Find intersection of the cross view rays to determine 3D location
71
Q

Solving triangulation using linear systems

A

Given Multiview correspondences {p1,p2,…} in calibrated cameras {C1,C2,…}, reconstruct the corresponding 3D point P = [X, Y, Z, 1] compliant with the projection model in every views

72
Q

Structure from motion (SFM)

A

3D reconstruction from uncalibrated cameras:

Given a set of Multiview correspondences in unknown cameras, estimate jointly the 3D locations and camera viewpoints

  • Yields camera pose
  • Estimates 3D world coordinates from 2D images
73
Q

Steps in solving SFM

A
  1. Group similar images
    • Extract SIFT features
    • Identify/group strong overlaps
  2. Alternate between the two steps with an initialization:
    • With the current estimated camera calibrations solve a Triangulation problem to find 3D locations
    • Use 3D locations and their Multiview pixel correspondences to Calibrate Cameras
  3. This will first give a sparse feature-based reconstruction
  4. Next, seed with sparse results a dense (intensity-based) reconstruction
74
Q

Epipolar constraints

A
  • Outlier correspondences (e.g. similar patches) result in false matches and deteriorate 3D recon
  • Epiplor constraints are introduced to reduce this
75
Q

Epipolar plane

A
  • In non-planar geometry, there are point-to-line correspondences between views, contrasting from the point-to-point correspondences present in planar geometry
  • They are relevant for finding robust correspondences in non-planar geometry
  • They restrict the search space from an entire image to a line which can save the search time and it
    brings robustness to the keypoint matching stage
  • This can be used in non-planar geometry
    applications namely depth estimation, Structure from Motion and 3D reconstruction.
  • An epipolar plane has all of the following aspects on the same plane:
    1. Camera centres
    2. 3D point
    3. Corresponding image points
    4. A baseline (a line connecting the two camera centers
    5. Epipoles
      • Projection of Camera 1’s centre (also baseline) to Camera 2’s image
      • Projection of Camera 2’s centre (also baseline) to Camera 1’s image
    6. Epipolar line l’ (plane projection onto Camera 2)
      • Potential matches for x have to lie on the corresponding epipolar line l’.
    7. Epipolar line l (plane projection onto Camera 1)
      • Potential matches for x′ have to lie on the corresponding epipolar line l
    8. Duality
      • All points on line 1 corresponds to all points in line 2 (& vice versa)
76
Q

Example of epipolar lines

A
  • Converging views
    • Span the whole 3D space by all epipolar planes:
    • This gives all epipolar lines pairs
    • All epipolar planes intersect at the baseline
    • All epipolar lines converge to epipoles because tje baseline planes intersect at the baseline, and the baseline’s projection in each view is the epilole
  • To span the whole 3D space, infinitely epipolar planes need to be considerd
    • Since all of these planes share the same baseline, they can be viewed as they rotate along the Baseline axis to span the 3D space
    • If the 3D space is spanned in this way, and each epipolar plane is projected to the two image views, we will get a set of dual epipolar lines that span the images of each camera
      • They are dual because a point in red line in one camera corresponds to a red line in other camera i.e. it finds it’s correspondences there.
77
Q

Examples of epipolar lines #2

A
  • Motion parallel to image plane
    • This is a typical scenario in stereo vision, parallel views with motion
    • However, there is no convergence due to the views being parallel with motion
    • Each view contains parallel epipolar lines
    • Epipoles are at infinity
78
Q

Examples of epipolar lines #3

A
  • Motion perpendicular to image plane
    • Epipoles have same coordinates in both images
    • Epipolar lines radiate from the epipoles
79
Q

Rejecting outliers

A
  • Geometry tells point-to-line correspondences between views
  • Epipolar lines reduce (constraint) the search space. Useful for feature matching:
    • Rejecting outliers
    • Fast search, which is even faster when the lines are aligned (e.g., stereo rectification)
80
Q

Fundamental Matrix for characterising epipolar lines

A
  • Used to characterise epipolar lines as an equation x′TFx=0
    • F = a 3x3 fundamental matrix determined by the internal and external calibrations
    • x,x′ = image points (homogeneous coordinates) in camera 1 and 2, respectively
81
Q

Forming the fundamental matrix for known cameras

A
  • Fundamental matrix is related to the internal and external calibrations
    • F=K^T[t] * RK-1
  • Relative rotation R and translation t between two cameras with Internal calibrations K, K’
    • [t]x = [0 -t3 t2; t3 0 -t1; -t2 t1 0]
  • Epipolar lines can be used to refine correspondences
82
Q

Forming the fundamental matrix from unknown cameras

A
  • It’s possible to jointly estimate the Fundamental matrix and reject outliers when calibrations are missing
    • The first step is to take data-driven approach to compute fundamental matrix (through model fitting)
    • Use RANSAC to reject outliers during model-fitting
83
Q

Epipolar constraints on data

A
  • Given: noisy correspondences (x,x’) in two views
  • Unknown: cameras’ internal/extral calibrations
  • Goal: compute the 3x3 fundamental matrix F
  • If the points x and x’ are correspondences of each other, then they are epipolar constraints and the equation x′TFx = 0 holds
84
Q

Using model fitting to find the fundamental matrix

A
  • x & x’ indexed by i indicate a pair of matched points between the views
  • For finding F, a number N of such raw correspondences between two views needs to be established
  • Each of these x&x’ are in homogeneous coordinate form and thus can be expanded by the 3D vector u,v,1
    • [u’; v’; 1][f11 f12 f13; f21 f22 f23; f31 f32 f33][u; v; 1] = 0’
  • Every pair must individually satify the epipolar constraint, through some matrix F to be determined
  • Every match gives one equation through spanning the epipolar constraint equation, this is done for each pair
  • If factoring out known and unknows to form a standard linear system, it’s seen that it will have the form of a Tall matrix A times vectorised form of F eq =0.
  • This systems has one non-interesting trivial solution, that is F=0.
  • To mitigate this, he scale of F can be fixed
    • This can be done either through assuming one lemenet of F =1, or as shown here the norm of F =1. i.e. F is normalised to disambiguate the scale.
    • Therefore it’s possible to solve the following constrained least square to minimize norm of Af and find f, the elements of the fundamental matrix.
    • For this, at least 8-pairs of correspondences between the views are required
    • F has 9-elements but its is scale invariant, and so by constrainting its norm, a minimum of 8 equations are required, i.e. correspondences to solve this constrained sustem and find F.

However:

  • Outliers have large distances (error) to the true epipolar lines
  • This can deteriorate the outcome of model fitting to find correct F
  • RANSAC can be ussd as a meta algorithm to fix this
85
Q

Using RANSAC in finding a fundamental matrix

A

Process:

  • Iterate x times:
    • Randomly select noisy matches
    • Compute a fundamental matrix which satisfies the epipolar constraints x′T Fx ≈ 0
    • Find the corresponding epipolar lines
    • Count number of inliers, if improved keep current F and iterate next

Process with a non-planar fit

F = eye(3,3); nBest = 0; K=8;

for i = 1:S %nIterations

P2 = SelectRandomSubset(Matches,K);

f = ComputeFundamental(P2);

nInliers = ComputeInliers(f);

if (nInliers > nBest)

F = f;

nBest = nInliers;

end

end

86
Q

Humans depth cues: Motion

A

Convergence

When watching an object close to us, our eyes point slightly inward. This difference in the direction of the eyes is called convergence. This depth cue is effective only on short distances (less than 10 meters).

Binocular Parallax

As our eyes see the world from slightly different locations, the images sensed by the eyes are slightly different. This difference in the sensed images is called binocular parallax. Human visual system is very sensitive to these differences, and binocular parallax is the most important depth cue for medium viewing distances. The sense of depth can be achieved using binocular parallax even if all other depth cues are removed.

Monocular Movement Parallax

If we close one of our eyes, we can perceive depth by moving our head. This happens because human visual system can extract depth information in two similar images sensed after each other, in the same way it can combine two images from different eyes.

87
Q

Humans depth cues: Focus

A

Accommodation

Accommodation is the tension of the muscle that changes the focal length of the lens of eye. Thus it brings into focus objects at different distances. This depth cue is quite weak, and it is effective only at very short viewing distances (less than 2 meters) and with other cues.

88
Q

Human depth cues: Image cues

A

Retinal Image Size

When the real size of the object is known, our brain compares the sensed size of the object to this real size, and thus acquires information about the distance of the object.

Perspective

When looking down a straight level road we see the parallel sides of the road meet in the horizon. This effect is often visible in photos and it is an important depth cue. It is called linear perspective. (assumes geometry is known).

Texture Gradient

The closer we are to an object the more detail we can see of its surface texture. So objects with smooth textures are usually interpreted being farther away. This is especially true if the surface texture spans all the distance from near to far.

Occlusions

When objects block each other out of our sight, we know that the object that blocks the other one is closer to us. The object whose outline pattern looks more continuous is felt to lie closer.

Aerial Haze

The mountains in the horizon look always slightly blurry or hazy. The reason for this are small water and dust particles in the air between the eye and the far objects. The farther the objects, the hazier they look.

Shades and Shadows

When we know the location of a light source and see objects casting shadows on other objects, we learn that the object shadowing the other is closer to the light source. As most illumination comes downward we tend to resolve ambiguities using this information. Also, bright objects seem to be closer to the observer than dark ones.

89
Q

Depth from stereo

A

Goal

Recover depth by finding image coordinates for x’ that correspond to x

Problems

  1. Intrinsic & extrinsic calibration: recover the relation of the cameras
  2. Correspondence: where do we search for the matching point x’?
90
Q

Depth from disparity

A
  • Disparity is inversely proportional to depth.
  • Disparity = x - x’ = (B • f) / z
    • f = intrinsic parameters
    • B = extrinsic parameters
  • (x - x’) / (O - O’) = f / z
91
Q

Stereo Image Rectification

A
  • Project image planes onto a common plane parallel to the baseline (the line between camera centers)
  • Two Homographies (3x3 transform), one for each input image projection
  • Pixel motion is horizontal after this transformation
92
Q

Correspondence search

A

Process

Slide a window along the right scanline and compare contents of that window with the reference window in the left image

Challenges

  • Textureless surfaces
  • Occlusions
  • Non-Lambertian surfaces
  • Specularities
  • Dense scenes, unique patterns
93
Q

Effects of window size

A

Smaller window

Pro: More detail

Con: More noise

Larger window

Pro: Smoother disparity maps

Con: Less detail

Con: Fails near boundaries

94
Q

Useful non-local prior: Uniqueness

A

For any point in one image, there should be at most one matching point in the other image

95
Q

Useful non-local prior: Ordering

A

Corresponding points are in the same order in both views

96
Q

Useful non-local prior: Piecewise smoothness

A

Disparity values are expected to change slowly for the most parts (except for the edges)

Energy minimisation framework

Used to enforce piecewise smoothness

Pros

  • Generality
  • Does not need training data
  • Rather uses geometry knowledge (physics)

Cons

  • Solving optimisation is time-consuming

Process

Edata = data fidelity

Esmooth = data smoothness

D = disparity

97
Q

Deep learning for depth estimation

A
  • Deep learning helps to do a task in real-time
    • It prioritises effort during training rather than in testing

Components

  • Convolutional Neural Network (CNN)
    • A CNN uses epipolar geometry in every layer, therefore it doesn’t need to learn how to do it from scratch
    • As a result, it uses less training data and achieves accurate performance in comparison to end-to-end communication
  • The incoroporation of epipolar geometry constraints into the depth reconstruction loss to reduce need training data
    • Etraining = EL-R consistency + Esmooth disparity + Ereconstruction
98
Q

Applications of face detection

A
  • Facial motion capture
    • Tracking facial expressions, motion
  • Photography
    • Auto-focus, intensity adjustment, artistic blur (portrait images)
  • Photo tagging
    • Recognition, social media, security, marketing, etc.
99
Q

Challenges faced by face detection

A
  • Facial expressions
  • Facial appearance
  • Ilumination variation
  • Variation in pose
  • Occlusions
  • Variations across individuals

Sidenote: Good algorithms run in real-time and are robust against these challenges

100
Q

Feature extraction

A
  • Example. Given a dataset {x} in 2D
  • Feature: transformation of raw data attributes to a new reduced set of discriminative features
  • Motivations in high dimensional settings: real-time processing, robustness to nuisance attributes, noise, outliers etc

Process

  • Place data on a scatter graph
  • Place a line that represents a projection of high-dimensional space into a lower-dimensional space
  • This line creates two conditional distributions
    • Given the input matches what is required, a threshold is set to separate these distributions
101
Q

Weak classifier

A
  • Weak classifiers can be defined for each boxlet feature
  • Weak classifiers individually have high detection errors
  • They are parameterised by some threshold value and some polarity value, which has a difference of 1 in comparison to the threshold
  • A problem with such a classifier is that due to individual processing of features, it is not enough to correctly detect a “face” or “not face”
102
Q

Boxlet filter library

A

Viola Jones uses 3 types of boxlet:

  • 2-rectangle: difference between regions of equal size
  • 3-rectangle: difference between extremes and centre region
  • 4-rectangle: difference between diagonal pairs of rectangles

Considering all possible filter parameters: position, scale, and type: there are at least 160,000 possible features associated with (small) windows

However, a problem is that the library is extremely large and informative combinations have to be subselected as a result

103
Q

AdaBoost

A

Also known as adaptive boosting

Idea

  • Build a complex classifier using weighted combination of a subset (m<<160K ) of weak classifiers
  • Weak classifier may individually have high error rate but the combination will perform well

Each weak classifier is parameterised in the following format: {type, polarity, threshold}

Training the parameters

  1. Given a face & not-face training dataset {(xi, yi)}
    • xi image patch
    • yi ∈ {-1, 1} ‘not face’ or ‘face’
  2. Initialize uniform weights〖 w〗_i for all data samples
    • Select a feature f_t at a time (randomly)
    • Find parameters that minimizes the overall misclassification error on the training data e.g.e=∑i wi|ht(xi) - yi|
    • Boosting: weight the misclassified samples more strongly at the next stage of learning (next feature should focus more on them)
      • wi ← wiδ if xi misclassified (δ > 1)

104
Q

Boosting for face detection

A
  • A m = 200 feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084
  • However, while a larger m value brings better accuracy, it also brings a slower runtime
105
Q

Attentional cascade

A

Motivation

Improved runtime for a given budget mtotal

Chain of classifiers

  • Start with simple classifiers (i.e. small m) which reject many of the non-face patches (negatives) while detecting almost all face patches (positives)
  • Positive response from the first classifier triggers the evaluation of a second classifier, and so on
  • A negative outcome at any point leads to the immediate rejection of the sub-window (time saved)
  • Keeping the same stagewise TPR, FPR for classifiers 1 to N means that they become progressively more complex
106
Q

Attentional cascade calculation

A
  • TPR₁ x TPR₂ x …. x TPRₙ = y
  • Or FPR₁ x FPR₂ x …. x FPRₙ = y
  • n = number of stages
  • y = end-to-end TPR/FPR
  • Detection rate = end-to-end TPR
107
Q

Training an attentional cascade

A
108
Q

Misclassification rate

A
  • Denoted as a(1 - TPR) + b(FPR)
  • a = chance of face
  • b = chance of not face
  • a = 100% - b and vice versa
109
Q

NMS as a detector

A
  • Gives too many false positives after thresholding
  • Nonetheless, it keeps enough spatial separation between detected objects
110
Q

Segmentation

A
  • Goal: Find boundaries around objects of interest
  • Each object/part might have texture and other fine details, but it’s pivotal to group its pixels together, form a region and a draw a contour around it
111
Q

Applications of segmentation

A
  1. Biomedical imaging
    • Measure tissue volume
    • Location and diagnosis of tumours
    • Surgery planning
    • Study of cell structures
  2. Remote sensing, satellite imaging
    • Measures 100-1K wavelengths (channels) to

discriminate materials’ based on their

reflectance properties

  1. Hyperspectral imagery
  2. Object recognition
    • Fingerprint/iris recognition
    • Driverless cars
    • Pedestrian detection
112
Q

Histogram splitting

A

Process

  • Groups of similar pixels appear as bumps in the histogram
  • Split the histogram at local-minima
  • Label pixels according to which bump they belong to

Pros

  • Works well if the bumps are well separated
  • If two regions have good separation in the means and low variance, then they can be segmented
  • This makes it easy to define
113
Q

Fisher’s Separability

A
  • µ = mean
  • σ = standard deviation (i.e., σ2 = variance)
114
Q

Lack of valid threshold

A
  • Thresholds can be hard to find, or may not globally exist.
  • Also, histograms do not account for neighbourhood relationships
115
Q

Colour histograms

A
  • Good for visualising segmentation
  • Makes it possible to perform clustering based on separate histograms
  • However, it’s non-scalable
  • Complexity = bc
    • b = # bins
    • c = # channels
116
Q

K-Means

A
  • A popular approach for high dimensional data clustering
  • K means finds K centres and associates pixels to these centres (clustering)
  • K-Means is fast & scalable to multichannel images
  • However it needs good initialization
    • Bad initialization can lead to poor local minima
  • Complexity: O(NKd x iter)
    • N = pixels
    • K = clusters
    • d = channels

Goal

  • Cluster multichannel pixels xi ∈ Rd around K centroids
  • Jointly find K clusters and centroids such that clusters have minimal distance to their centres

Process

  1. Initialization: Choose K centroids at random
  2. Pixels are assigned (clustered) to the closest centroid
  3. Move (update) centroids toward the center (mean, median)
  4. Iterate 2&3 until convergence to a defined threshold value
117
Q

Connectivity clusters

A
  • Sometimes segments do not cluster around centroids
  • But data in each cluster is well connected
  • Connections are defined by affinity (similarity) between data points (i.e. pixel values)
  • Points in the same cluster are close to at least one other point in that cluster
118
Q

Graph-partitioning

A
  • Idea: graph-partitioning
  • Similar pixels have similar values (e.g. RGB) and are spatially close to each other i.e. neighbors
  • A graph represents an image where the nodes represent the pixels & edges measure affinity (similarity) between pixels
119
Q

Spectral clustering

A
  • Goal: to approximate the Ncut partitioning
  • Application: image clustering
  • Recursively apply spectral clustering for new sub-segments
  • Pros: adds local similarity, great accuracy
  • Cons: memory = O(N2), runtime = O(N3)
120
Q

Challenges with triangulation

A
  • Outlier-free matches do not exist in real world
    • The epi-polar geometry can help us to search for the correspondences not in the entire image but only along the epipolar lines
    • This restriction not only saves the search time but also brings robustness to the keypoint matching stage
  • Another challenge to this problem is the occlusion
    • A 3D point may not be visible in small number of views and thus we may not be able to find correspondences for it
    • To mitigate this issue, we can incorporate a larger number of views (possibly from different angles) to make sure the point is visible in cameras and correspondences can be found
121
Q

Image suppression techniques

A
  • Before computing the image gradients, we can first smooth the image using a lowpass filter. The low-pass filter will remove the noise.
  • A better way is to use smoothed high-pass filters to compute the image gradients. The high-pass filters here are smoothed by e.g. a gaussian blur.
122
Q

Degrees of freedom

A
  • The set of independent displacements that specify completely the displaced or deformed position of the body or system
123
Q

2D to 2D transformations

A
124
Q

3D to 3D transformations

A
125
Q
A