Computer Vision Flashcards
Perspective projection
- y / Y = f / Z -> y = fY/Z
- Similarly, x = fX/Z
- (X,Y,Z) -> (x,y): R³ -> R²
Coordinate conversion
- Cartesian to homogenous
1) P = (x,y) -> (x,y,1)
2) P ∈ R² -> P̃ ∈ P² - Homogenous to cartesian
1) P̃ = (x̃, ỹ, z̃) -> P = (x,y)
Sidenote: x = x̃/z, y = ỹ/z
Perspective projection equation
2D point = [x̃;ỹ;z̃]
Projection matrix = [f 0 0 0; 0 f 0 0; 0 0 1 0]
3D point: [X; Y; Z; 1]
- 2D point = projection matrix * 3D point
- x̃ = fX, ỹ = fY, z̃ = Z
- To convert back to cartesian, divide by third coordinate (i.e., x = fX/Z, y = fY/Z)
Intrisic model
- An upgrade from the simple projection model
- Includes the following properties:
1) Change of coordinate meters to pixels
2) Focal lengths are independent for x and y (defined as fₓ and fᵧ)
3) A skew , s , is included
4) Optical centre (cₓ, cᵧ) is included
Equation for intrisic model
[fₓ s cₓ 0; 0 fᵧ cᵧ 0; 0 0 1 0] = [1/pₓ 1/pₛ cₓ; 0 1/pᵧ cᵧ; 0 0 1] * [f 0 0 0; 0 f 0 0; 0 0 1 0]
Extrinsic model
- Also known as the pose
- Consists of a 3x1 3D-Translation vector t
- Consists of a 3x3 3D-Rotation matrix R
Overall camera matrix
- Considers both intrinsic and extrinsic calibration parameters together
- [x̃;ỹ;z̃] = [fₓ s cₓ; 0 fᵧ cᵧ; 0 0 1]*[R3D t]*[X;Y;Z;1]
Lens cameras
- Larger aperture = more light = less fous
- Less shutter speed = less exposure time = less amount of light reaching the sensor
Focal length
- Equation: 1/f = 1/zᵢ + 1/z₀
- f = focal length
- z₀ = where object is
- zᵢ = where image is formed
A/D
- Analogue to Digital conversion
- Spatially-continuous image is sampled (CCD or CMOS) by a few pixels
- Pixel values are then quantized e.g. 0 (black) to 255 (white)
Shannon Sampling Theorem
- Used to choose sampling resolution
- If signal is bandlimited, then fₛ ≥ 2 * fₘₐₓ, where:
1) fₛ = sampling (aka Nyquist) rate
2) fₘₐₓ = maximum signal frequency - This equation is enough to guarantee that the signal can be perfectly linearly reconstructed from its samples
- If signal is not bandlimited, then use analogue low-pass filter to clip frequency contents fᵤₜ ≤ fₛ/2
Quantization
- Used to determine bit rate
- In audio, this would be the number of bits sent per second (CD’s are usually 16-bit)
- In imaging, this would be in terms of coding the value of pixels (Images are usually 24-bit)
Compressed sensing
Steps
1) Regular sampling (namely Shannon)
2) Random sampling
- Used to collect a subset
- Complexity: M = O (K log (N/K))
3) Reconstruction
- Obtain x from y
4) Sparse approximation
- Is typically a non-linear reconstruction
- Should have a prior awareness
- Equation for prior awareness:

Compressed sensing hallmarks
- Stable
1) Acquisition and recovery is numerically stable - Assymetrical
- Democratic
1) Each measurement carries the same amount of information - Encrypted
1) Random measurements are encrypted - Universal
1) Same random patterns can be used to sense any sparse digital class
Convolution steps
1) Flip filter vertically and horizontally
2) Pad image with zeros
3) Slide flipped filter over the image and compute weighted sum

Convolution distributive property
- Consider an image I
- Consider two filters g and h
- Image I is convolved by filters g and h and the results are added together (i.e. Î = (I * g) + (I * h))
- Distributive property states that ((I * g) + (I * h)) ≡ (I * (g + h))
Correlation
- No flipping of kernel
- No padding image
- Just slide filter over the image and compute weighted sum

Normalized correlation
-Used to measure similarity between a patch and image at location (x,y)

SSD
- Stands for Sum of Squared Differences
- Calculates the moving difference between the intensities of a patch and an image
- Given a template image/patch, fins where it is located in another image by finding displacement (x*,y*) which minimises SSD cost
Low-pass filter
- Smooths (blurs) an image by (weighted/ unweighted) averaging pixels in its neighborhood
- Also known as a Gaussian filter

High-pass filter
- Computes finite (weighted/unweighted) differences between pixels and their neighbors
- Extracts important low-level information: intensity variations
- HP filters have zero mean and remove DC component from images
- Example of HP filter is edge detection
- Horizontal edge: [-1 -1; 1 1]
- Vertical edge: [-1 1; -1 1]
Derivative of Gaussian
- A high-pass filter
- Used to implement nth order image differences
- For example, with a second order derivative
- (∂2I)/(∂2x) ≈ f1∗f1∗I
- (∂2I)/(∂2y) ≈ f2∗f2∗I
- For example, with a second order derivative

Laplacian of Gaussian filter
- A high-pass filter
- Used to implement 2nd order image differences
- In the equation, ∇² represents the Laplacian operator
- ∇²f = (∂²f / ∂x²) + (∂²f / ∂y²)

Subtraction of filters
- Typically used for sharpening
- ISharp(x,y) = I(x,y) - α(I ∗ ∇²Gσ(x,y))
Gaussian vs mean filtering
- Gaussian blur results in less artifacts
-
Anti-aliasing
- Aliasing = appearance of artefacts due to loss of image data
- To solve, apply a low-pass filter to an image before downsizing an image
Difference of Gaussians (DoG)
- DoG{I,σ₁,σ₂} = gσ₁ ∗ I - gσ₂ ∗ I = (gσ₁ - gσ₂) ∗ I
- Extracts “edges” at multiple scales i.e. the information difference between two scales
Image resampling (stretching)
- Increase image size
- Interleave the small image with zeros
- Apply a low-pass filter
- This linearly interpolates between the pixels
Canny edge detection
- The most wildely used edge detector
Steps
1) Compute a robust gradient of the image (i.e. using Gaussian differentiation)
- Compute gradient’s magnitude and direction
2) Apply non-maxima suppression for edge-thinning - Keep largest edge (i.e. local maxima in a neighborhood) across gradient direction, suppress the rest
3) “Hysteresis thresholding” to discard non-connected components - Define two thresholds: T-high and T-low
- ‘Track’ a strong edge from T-high, keep connected edges above T-low, discard the rest
- Note: Step 2&3 discard many False Positives to get clean edges*
Non-maxima suppression
Check if pixel is local maximum along gradient direction (in a window); if yes then keep it & discard other candidates
Hysteresis thresholding
- Threshold the gradient magnitude with two thresholds: T-high and T-low
- Strong Edges start at edge locations with gradient magnitude > T-high
- The values in between may be an edge (blow T-low not)
- Continue tracing that edge until gradient magnitude falls below T-low, keeping only connected components in this range
Effect of σ
- σ = Gaussian kernel spread/size
- The choice of depends on desired behavior
1) Large σ detects large scale edges
2) Small σ detects fine features
Intensity-based matching process
- Select patches
- Slide them across the image
- Score the SSD & report minimizer
Good features of intensity-based matching
Locality
- In a “neighbourhood” pixels reveal important structures
- Shift-invariance (e.g. patch space vs. original image)
Efficiency
- Features extraction in real-time is possible Quantity • Instead of O(106) pixels, e.g. O(103) features to match. • Search in real-time is possible Distinctiveness • Can differentiate a large database of objects Building invariances • Invariant to image transformations e.g. shift, rotation, scale, deformations,… • robust to clutter, noise and occlusion
Gradients behavior in a patch equations

Gradients behavior in a patch analysis

PCA
- Stands for Principal Component Analysis
-
Harris corners
- In practice computing eigen values is not very fast
- Harris instead proposed to approximate the process by the following functions:
If R scores higher than a threshold is a corner candidate
Harris detector process
- Compute directional derivatives Ix, Iy and IxIy at each pixel (and smooth them with Gaussian filters).
- Compute the Structure Matrix in a window/patch around each pixel
- Compute corner response function R
- Threshold R
- Find local maxima of response function (non-maximum suppression)
SIFT
- Stands for Scale Invariant Feature Transform
- Used in object recognition as it’s powerful, especially for well-textured objects
SIFT matching process
- Each image has 100 to 1000 SIFT descriptors (instead of 1M pixels)
- Each is represented by 128 numbers vector φ
- SSD can be used for measure: ∑_(i=1)^128 (φi(1) – φi(2))
- Feature matching = minimizing SSD (search)
- fast KD-tree type search can be adopted
- SIFT feature matching can run in real-time!
SIFT fine-tuning
Feature matching still returns a set of noisy correspondences
Refinements:
- Discard weak matches with SSD(φ(1),φ(2)) > threshold
- Keep only clear matches i.e. R << 1
- R = SSD(φ(1),φ(2))/SSD(φ(1),φ′(2))
- φ(2) = best SSD match to φ(1) in second image
- φ′(2) = 2nd best SSD match to φ(1) in second image
- R gives large values ~ 1 for ambiguous matches
- R = SSD(φ(1),φ(2))/SSD(φ(1),φ′(2))
- Discard inconsistent matches; for which we need to know something about the geometry of the scene
Applications of SIFT
Without geometry
Object recognition
With geometry
- Building panoramas
- Extract & match SIFT features
- Use matched pairs to learn a (homography) mapping which holds for all points in two images
- Apply the mapping to stitch two images
- 3D reconstruction
- Use SIFT to find dense correspondences between multiple images of a scene
- Use matched tuples (+ non-planar geometry) to estimate 3D shape
SIFT vs Harris
- SIFT has scale invariance, Harris doesn’t
- Both have shift, rotation and illumination variance
- SIFT provides descriptors, Harris doesn’t
Measuring performance of a feature matcher
- TP (true positive): number of correct matches
- FN (false negative): number of matches that were not correctly detected
- FP (false positive): number of proposed matches that are incorrect
- TN (true negative): number of non-matches that were correctly rejected
- TP rate = TP/(TP+FN) = TP/P
- FP rate = FP/(FP+TN) = FP/N
- Accuracy = (TP+TN)/(N+P)
Taylor Expansion of DoG
Has three purposes:
1) Achieves sub-pixel accuracy for keypoint localiazion
2) Reject edges
3) Reject low-contrast keypoints
Key point localisation
Detect maxima and minima of DoG in scale & space
- Each point is compared compared with to the 26 pixels in the current and its adjacent scales
- Detected if larger/smaller than all of the 26 pixels
- Extrema must be stronger than a threshold to reject candidates with low contrast
Discarding of low-contact points
Elimination of edge responses
- For candidates, construct a 2x2 (Hessian) matrix:
- [Dxx Dxy; Dxy Dyy]
Orientation Assignment
Homography
- Defines a relation between two images in the same planar surface
- Defined as: [ũ; ṽ; w̃] = [h11 h12 h13; h21 h22 h23; h31 h32 1][x; y; 1]
- Extra step included: x’ = ũ/w̃, y’ = ṽ/w̃
- ũ = h11x + h12y + h13
- ṽ = h21x + h22y + h23
- w̃ = h31x + h32y + 1
- h33 = 1 due to scale invariance
- Degrees of freedom = 8 as a result
Application of homography
Building a panorama
Process
- Find correspondences
- Reject outliers using RANSAC
- Compute the homography using inlier matches
- Harmonize both image intensities by gain adjustment
- Compute the average RGB intensity of each image in the overlapping region
- Normalize the intensities using the ratio of between these averages
2D geometric transform
- Result = [u;v;1]
- Matrix = [α cos(θ), α sin(θ), tx; α sin(θ), α cos(θ), ty; 0 0 1]
- Point = [x;y;1]
- Result = Matrix * Point
- Result = [αx cos(θ) + αy sin(θ) + t₁; αx sin(θ) + αy cos(θ) + t₂; 1]
- α = scale
- tx/y = translation in axis
- θ = rotation angle
Calculating relation between two images
ĩ1 = H1[X; Y; 1], ĩ2 = H2[X; Y; 1]
-> ĩ2 = H2H1-1ĩ1, where H = homography
LSQ
Stands for Least Squares
Used to solve the equation At = b
Find t that minimises ||At - b||2
To solve, form the following normal equations:
1) ATAt = ATb
2) t = (ATA)-1 ATb
LSQ is robust against (moderate) noise but not against outliers
Including outliers in model-fitting (LSQ) leads to a large model error
Outliers are avoided through checking global consistency
RANSAC
Known as Random Sample Consensus
Used for robust model fitting
Process
- Loop
- Pick a random set of K points, where K represents the minimal number of points require for fitting to model M
- Compute (fit) the model M using these points
- Apply M to the other points & measure the error
- Label high error matches as outliers
- Keep a copy (M) each time the number of outliers reduces
- After S iterations, return M with fewest outliers
- Recompute M using all of the inliers
RANSAC calculation
Aim: estimate a model with K points, given the probability that a point is an inlier
- q = the probability that a point is an inlier
- qK = probability of selecting all inliers for a K-point model (at each iteration)
- 1-qK = probability of choosing at least one outlier (at each iteration)
- P = probability that RANSAC chooses an outlier-free model after S iterations: P = 1 - (1 - qK)S
*
Calculating iterations from RANSAC
S = iterations to be calculated
K = points in model
q = 1 - probability of choosing outlier
Forward model
- Used to generate 2D images from a 3D shape
- Properties such as texture and reflectance are also defined
Camera model
- Explains the formation of images from 3D points (i.e., a linear model in homogeneous coordinates)
- P = [x.y,z], a point in 3D
- p = [x,y], a corresponding image pixel
- [x̃; ỹ; z̃] = K[R t] [X;Y;Z;1]
- Extra step required: x = x̃/z̃, y = ỹ/z̃
3D reconstruction
- The inverse to the camera model, where a 3D shape is estimated that would generate the input photographs given the same camera viewpoints, as well as material and illumination
- Requires more than one view i.e. motion to resolve depth/size ambiguities in “back projection”
- Process is typically as follows:
- Camera calibration
- Matching correspondences (e.g. SIFT) between Multiview images
- Triangulation
Uses of 3D reconstruction
- 3D reconstruction allows the creation of a “digital copy” of a real object
- This allows the following things:
- Inspection of object details
- Measurement of properties
- Reproduction of an object in different material
- Therefore, 3D recon has several applications such as:
- Preservation of cultural heritage
- Some companies use 3D recon to provide “digital museums”
- Computer games
- Movies
- City modelling
- Typically done for historical buildings, in order to help preservation of them in the long-term
- E-commerce
- Preservation of cultural heritage
Camera calibration
- The pre-requisite step for many computer vision tasks, including 3D reconstruction
- [x̃;ỹ;z̃] = [C11 C12 C13 C14;C21 C22 C23 C24;C31 C32 C33 C34; C41 C42 C43 1]*[X;Y;Z;1]
- There are 11 parameters for camera calibration, due to scale invariance
*
Determining camera calibration
- A number of correspondences, n, between the 3D object and 2D images is needed: {(xi,yi, Xi,Yi,Zi )}(i=1…n)
- [x̃1 x̃2 … x̃n; ỹ1 ỹ2 … ỹn; z̃1 z̃2 … z̃n] = [C11 C12 C13 C14;C21 C22 C23 C24;C31 C32 C33 C34; C41 C42 C43 1]*[X1 X2 … Xn; Y1 Y2 … Yn ;Z1 Z2 … Zn; 1 1 … 1]
- This matrix is in homogeneous coordinates but the image pixels are in a cartesian domain
- Therefore, a divison is required to relate X,Y,Z to some x and y
- x = x̃/z̃ and y = ỹ/z̃
- x̃ = C11Xi + C12Yi + C13Zi + C14
- ỹ = C21Xi + C22Yi + C23Zi + C24
- z̃ = C31Xi + C32Yi + C33Zi + 1
Solving a camera matrix with a linear system
- This involves the factoring of knowns and unknowns and writing it as a linear system
- System: A*[C11; C12; C13; C14; … C33] = b
- Each correspondence adds two rows to A: one for the x-coordinate and one for the y-coordinate of the point
- At least 6 correspondences are needed to compute all 1 1 unknowns in the camera matrix
- Number of rows in A = 2n
- Number of columns in A= 11
- Number of rows in B = 2n
- Number of columns in B = 1
Finding correspondences
- Steps of usual pipeline
- Use a calibration rig e.g. checkerboard with known dimensions & geometry
- This gives a number of known 3D world points.
- Chess corners are ideal.
- Find corresponding points (chess corners) in the image using e.g. Harris or SIFT.
- It is common to use 3 planar objects, such as the inside corner of a cube. Assume checkerboard corner is the centre of the world
Internal and external calibrations
- This is found after computing the 3x4 camera matrix C
- C can be factorised at K [R t]
- K = 3x3 intrisic calibration matrix: [fx s cx; 0 fy cy; 0 0 1]
- [R t] = 3x4 pose matrix
Triangulation
- The estimaion of points (shapes) in 3D from their Multiview images through the use of calibrated cameras
- We need to know the camera matrices and we require robust (outlier-free) matches
Triangulation process
- First, find Multiview correspondences e.g. using SIFT
- For each view, back project a ray originating at camera centre and passing through the correspondence pixel
- Find intersection of the cross view rays to determine 3D location
Solving triangulation using linear systems
Given Multiview correspondences {p1,p2,…} in calibrated cameras {C1,C2,…}, reconstruct the corresponding 3D point P = [X, Y, Z, 1] compliant with the projection model in every views
Structure from motion (SFM)
3D reconstruction from uncalibrated cameras:
Given a set of Multiview correspondences in unknown cameras, estimate jointly the 3D locations and camera viewpoints
- Yields camera pose
- Estimates 3D world coordinates from 2D images
Steps in solving SFM
- Group similar images
- Extract SIFT features
- Identify/group strong overlaps
- Alternate between the two steps with an initialization:
- With the current estimated camera calibrations solve a Triangulation problem to find 3D locations
- Use 3D locations and their Multiview pixel correspondences to Calibrate Cameras
- This will first give a sparse feature-based reconstruction
- Next, seed with sparse results a dense (intensity-based) reconstruction
Epipolar constraints
- Outlier correspondences (e.g. similar patches) result in false matches and deteriorate 3D recon
- Epiplor constraints are introduced to reduce this
Epipolar plane
- In non-planar geometry, there are point-to-line correspondences between views, contrasting from the point-to-point correspondences present in planar geometry
- They are relevant for finding robust correspondences in non-planar geometry
- They restrict the search space from an entire image to a line which can save the search time and it
brings robustness to the keypoint matching stage - This can be used in non-planar geometry
applications namely depth estimation, Structure from Motion and 3D reconstruction. - An epipolar plane has all of the following aspects on the same plane:
- Camera centres
- 3D point
- Corresponding image points
- A baseline (a line connecting the two camera centers
- Epipoles
- Projection of Camera 1’s centre (also baseline) to Camera 2’s image
- Projection of Camera 2’s centre (also baseline) to Camera 1’s image
- Epipolar line l’ (plane projection onto Camera 2)
- Potential matches for x have to lie on the corresponding epipolar line l’.
- Epipolar line l (plane projection onto Camera 1)
- Potential matches for x′ have to lie on the corresponding epipolar line l
- Duality
- All points on line 1 corresponds to all points in line 2 (& vice versa)
Example of epipolar lines
- Converging views
- Span the whole 3D space by all epipolar planes:
- This gives all epipolar lines pairs
- All epipolar planes intersect at the baseline
- All epipolar lines converge to epipoles because tje baseline planes intersect at the baseline, and the baseline’s projection in each view is the epilole
- To span the whole 3D space, infinitely epipolar planes need to be considerd
- Since all of these planes share the same baseline, they can be viewed as they rotate along the Baseline axis to span the 3D space
- If the 3D space is spanned in this way, and each epipolar plane is projected to the two image views, we will get a set of dual epipolar lines that span the images of each camera
- They are dual because a point in red line in one camera corresponds to a red line in other camera i.e. it finds it’s correspondences there.
Examples of epipolar lines #2
- Motion parallel to image plane
- This is a typical scenario in stereo vision, parallel views with motion
- However, there is no convergence due to the views being parallel with motion
- Each view contains parallel epipolar lines
- Epipoles are at infinity
Examples of epipolar lines #3
- Motion perpendicular to image plane
- Epipoles have same coordinates in both images
- Epipolar lines radiate from the epipoles
Rejecting outliers
- Geometry tells point-to-line correspondences between views
- Epipolar lines reduce (constraint) the search space. Useful for feature matching:
- Rejecting outliers
- Fast search, which is even faster when the lines are aligned (e.g., stereo rectification)
Fundamental Matrix for characterising epipolar lines
- Used to characterise epipolar lines as an equation x′TFx=0
- F = a 3x3 fundamental matrix determined by the internal and external calibrations
- x,x′ = image points (homogeneous coordinates) in camera 1 and 2, respectively
Forming the fundamental matrix for known cameras
- Fundamental matrix is related to the internal and external calibrations
- F=K^T[t] * RK-1
- Relative rotation R and translation t between two cameras with Internal calibrations K, K’
- [t]x = [0 -t3 t2; t3 0 -t1; -t2 t1 0]
- Epipolar lines can be used to refine correspondences
Forming the fundamental matrix from unknown cameras
- It’s possible to jointly estimate the Fundamental matrix and reject outliers when calibrations are missing
- The first step is to take data-driven approach to compute fundamental matrix (through model fitting)
- Use RANSAC to reject outliers during model-fitting
Epipolar constraints on data
- Given: noisy correspondences (x,x’) in two views
- Unknown: cameras’ internal/extral calibrations
- Goal: compute the 3x3 fundamental matrix F
- If the points x and x’ are correspondences of each other, then they are epipolar constraints and the equation x′TFx = 0 holds
Using model fitting to find the fundamental matrix
- x & x’ indexed by i indicate a pair of matched points between the views
- For finding F, a number N of such raw correspondences between two views needs to be established
- Each of these x&x’ are in homogeneous coordinate form and thus can be expanded by the 3D vector u,v,1
- [u’; v’; 1][f11 f12 f13; f21 f22 f23; f31 f32 f33][u; v; 1] = 0’
- Every pair must individually satify the epipolar constraint, through some matrix F to be determined
- Every match gives one equation through spanning the epipolar constraint equation, this is done for each pair
- If factoring out known and unknows to form a standard linear system, it’s seen that it will have the form of a Tall matrix A times vectorised form of F eq =0.
- This systems has one non-interesting trivial solution, that is F=0.
- To mitigate this, he scale of F can be fixed
- This can be done either through assuming one lemenet of F =1, or as shown here the norm of F =1. i.e. F is normalised to disambiguate the scale.
- Therefore it’s possible to solve the following constrained least square to minimize norm of Af and find f, the elements of the fundamental matrix.
- For this, at least 8-pairs of correspondences between the views are required
- F has 9-elements but its is scale invariant, and so by constrainting its norm, a minimum of 8 equations are required, i.e. correspondences to solve this constrained sustem and find F.
However:
- Outliers have large distances (error) to the true epipolar lines
- This can deteriorate the outcome of model fitting to find correct F
- RANSAC can be ussd as a meta algorithm to fix this
Using RANSAC in finding a fundamental matrix
Process:
- Iterate x times:
- Randomly select noisy matches
- Compute a fundamental matrix which satisfies the epipolar constraints x′T Fx ≈ 0
- Find the corresponding epipolar lines
- Count number of inliers, if improved keep current F and iterate next
Process with a non-planar fit
F = eye(3,3); nBest = 0; K=8;
for i = 1:S %nIterations
P2 = SelectRandomSubset(Matches,K);
f = ComputeFundamental(P2);
nInliers = ComputeInliers(f);
if (nInliers > nBest)
F = f;
nBest = nInliers;
end
end
Humans depth cues: Motion
Convergence
When watching an object close to us, our eyes point slightly inward. This difference in the direction of the eyes is called convergence. This depth cue is effective only on short distances (less than 10 meters).
Binocular Parallax
As our eyes see the world from slightly different locations, the images sensed by the eyes are slightly different. This difference in the sensed images is called binocular parallax. Human visual system is very sensitive to these differences, and binocular parallax is the most important depth cue for medium viewing distances. The sense of depth can be achieved using binocular parallax even if all other depth cues are removed.
Monocular Movement Parallax
If we close one of our eyes, we can perceive depth by moving our head. This happens because human visual system can extract depth information in two similar images sensed after each other, in the same way it can combine two images from different eyes.
Humans depth cues: Focus
Accommodation
Accommodation is the tension of the muscle that changes the focal length of the lens of eye. Thus it brings into focus objects at different distances. This depth cue is quite weak, and it is effective only at very short viewing distances (less than 2 meters) and with other cues.
Human depth cues: Image cues
Retinal Image Size
When the real size of the object is known, our brain compares the sensed size of the object to this real size, and thus acquires information about the distance of the object.
Perspective
When looking down a straight level road we see the parallel sides of the road meet in the horizon. This effect is often visible in photos and it is an important depth cue. It is called linear perspective. (assumes geometry is known).
Texture Gradient
The closer we are to an object the more detail we can see of its surface texture. So objects with smooth textures are usually interpreted being farther away. This is especially true if the surface texture spans all the distance from near to far.
Occlusions
When objects block each other out of our sight, we know that the object that blocks the other one is closer to us. The object whose outline pattern looks more continuous is felt to lie closer.
Aerial Haze
The mountains in the horizon look always slightly blurry or hazy. The reason for this are small water and dust particles in the air between the eye and the far objects. The farther the objects, the hazier they look.
Shades and Shadows
When we know the location of a light source and see objects casting shadows on other objects, we learn that the object shadowing the other is closer to the light source. As most illumination comes downward we tend to resolve ambiguities using this information. Also, bright objects seem to be closer to the observer than dark ones.
Depth from stereo
Goal
Recover depth by finding image coordinates for x’ that correspond to x
Problems
- Intrinsic & extrinsic calibration: recover the relation of the cameras
- Correspondence: where do we search for the matching point x’?
Depth from disparity
- Disparity is inversely proportional to depth.
- Disparity = x - x’ = (B • f) / z
- f = intrinsic parameters
- B = extrinsic parameters
- (x - x’) / (O - O’) = f / z
Stereo Image Rectification
- Project image planes onto a common plane parallel to the baseline (the line between camera centers)
- Two Homographies (3x3 transform), one for each input image projection
- Pixel motion is horizontal after this transformation
Correspondence search
Process
Slide a window along the right scanline and compare contents of that window with the reference window in the left image
Challenges
- Textureless surfaces
- Occlusions
- Non-Lambertian surfaces
- Specularities
- Dense scenes, unique patterns
Effects of window size
Smaller window
Pro: More detail
Con: More noise
Larger window
Pro: Smoother disparity maps
Con: Less detail
Con: Fails near boundaries
Useful non-local prior: Uniqueness
For any point in one image, there should be at most one matching point in the other image
Useful non-local prior: Ordering
Corresponding points are in the same order in both views
Useful non-local prior: Piecewise smoothness
Disparity values are expected to change slowly for the most parts (except for the edges)
Energy minimisation framework
Used to enforce piecewise smoothness
Pros
- Generality
- Does not need training data
- Rather uses geometry knowledge (physics)
Cons
- Solving optimisation is time-consuming
Process
Edata = data fidelity
Esmooth = data smoothness
D = disparity
Deep learning for depth estimation
- Deep learning helps to do a task in real-time
- It prioritises effort during training rather than in testing
Components
- Convolutional Neural Network (CNN)
- A CNN uses epipolar geometry in every layer, therefore it doesn’t need to learn how to do it from scratch
- As a result, it uses less training data and achieves accurate performance in comparison to end-to-end communication
- The incoroporation of epipolar geometry constraints into the depth reconstruction loss to reduce need training data
- Etraining = EL-R consistency + Esmooth disparity + Ereconstruction
Applications of face detection
- Facial motion capture
- Tracking facial expressions, motion
- Photography
- Auto-focus, intensity adjustment, artistic blur (portrait images)
- Photo tagging
- Recognition, social media, security, marketing, etc.
Challenges faced by face detection
- Facial expressions
- Facial appearance
- Ilumination variation
- Variation in pose
- Occlusions
- Variations across individuals
Sidenote: Good algorithms run in real-time and are robust against these challenges
Feature extraction
- Example. Given a dataset {x} in 2D
- Feature: transformation of raw data attributes to a new reduced set of discriminative features
- Motivations in high dimensional settings: real-time processing, robustness to nuisance attributes, noise, outliers etc
Process
- Place data on a scatter graph
- Place a line that represents a projection of high-dimensional space into a lower-dimensional space
- This line creates two conditional distributions
- Given the input matches what is required, a threshold is set to separate these distributions
Weak classifier
- Weak classifiers can be defined for each boxlet feature
- Weak classifiers individually have high detection errors
- They are parameterised by some threshold value and some polarity value, which has a difference of 1 in comparison to the threshold
- A problem with such a classifier is that due to individual processing of features, it is not enough to correctly detect a “face” or “not face”
Boxlet filter library
Viola Jones uses 3 types of boxlet:
- 2-rectangle: difference between regions of equal size
- 3-rectangle: difference between extremes and centre region
- 4-rectangle: difference between diagonal pairs of rectangles
Considering all possible filter parameters: position, scale, and type: there are at least 160,000 possible features associated with (small) windows
However, a problem is that the library is extremely large and informative combinations have to be subselected as a result
AdaBoost
Also known as adaptive boosting
Idea
- Build a complex classifier using weighted combination of a subset (m<<160K ) of weak classifiers
- Weak classifier may individually have high error rate but the combination will perform well
Each weak classifier is parameterised in the following format: {type, polarity, threshold}
Training the parameters
- Given a face & not-face training dataset {(xi, yi)}
- xi image patch
- yi ∈ {-1, 1} ‘not face’ or ‘face’
- Initialize uniform weights〖 w〗_i for all data samples
- Select a feature f_t at a time (randomly)
- Find parameters that minimizes the overall misclassification error on the training data e.g.e=∑i wi|ht(xi) - yi|
- Boosting: weight the misclassified samples more strongly at the next stage of learning (next feature should focus more on them)
- wi ← wiδ if xi misclassified (δ > 1)
•
Boosting for face detection
- A m = 200 feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084
- However, while a larger m value brings better accuracy, it also brings a slower runtime
Attentional cascade
Motivation
Improved runtime for a given budget mtotal
Chain of classifiers
- Start with simple classifiers (i.e. small m) which reject many of the non-face patches (negatives) while detecting almost all face patches (positives)
- Positive response from the first classifier triggers the evaluation of a second classifier, and so on
- A negative outcome at any point leads to the immediate rejection of the sub-window (time saved)
- Keeping the same stagewise TPR, FPR for classifiers 1 to N means that they become progressively more complex
Attentional cascade calculation
- TPR₁ x TPR₂ x …. x TPRₙ = y
- Or FPR₁ x FPR₂ x …. x FPRₙ = y
- n = number of stages
- y = end-to-end TPR/FPR
- Detection rate = end-to-end TPR
Training an attentional cascade
Misclassification rate
- Denoted as a(1 - TPR) + b(FPR)
- a = chance of face
- b = chance of not face
- a = 100% - b and vice versa
NMS as a detector
- Gives too many false positives after thresholding
- Nonetheless, it keeps enough spatial separation between detected objects
Segmentation
- Goal: Find boundaries around objects of interest
- Each object/part might have texture and other fine details, but it’s pivotal to group its pixels together, form a region and a draw a contour around it
Applications of segmentation
- Biomedical imaging
- Measure tissue volume
- Location and diagnosis of tumours
- Surgery planning
- Study of cell structures
- Remote sensing, satellite imaging
- Measures 100-1K wavelengths (channels) to
discriminate materials’ based on their
reflectance properties
- Hyperspectral imagery
- Object recognition
- Fingerprint/iris recognition
- Driverless cars
- Pedestrian detection
Histogram splitting
Process
- Groups of similar pixels appear as bumps in the histogram
- Split the histogram at local-minima
- Label pixels according to which bump they belong to
Pros
- Works well if the bumps are well separated
- If two regions have good separation in the means and low variance, then they can be segmented
- This makes it easy to define
Fisher’s Separability
- µ = mean
- σ = standard deviation (i.e., σ2 = variance)
Lack of valid threshold
- Thresholds can be hard to find, or may not globally exist.
- Also, histograms do not account for neighbourhood relationships
Colour histograms
- Good for visualising segmentation
- Makes it possible to perform clustering based on separate histograms
- However, it’s non-scalable
- Complexity = bc
- b = # bins
- c = # channels
K-Means
- A popular approach for high dimensional data clustering
- K means finds K centres and associates pixels to these centres (clustering)
- K-Means is fast & scalable to multichannel images
- However it needs good initialization
- Bad initialization can lead to poor local minima
- Complexity: O(NKd x iter)
- N = pixels
- K = clusters
- d = channels
Goal
- Cluster multichannel pixels xi ∈ Rd around K centroids
- Jointly find K clusters and centroids such that clusters have minimal distance to their centres
Process
- Initialization: Choose K centroids at random
- Pixels are assigned (clustered) to the closest centroid
- Move (update) centroids toward the center (mean, median)
- Iterate 2&3 until convergence to a defined threshold value
Connectivity clusters
- Sometimes segments do not cluster around centroids
- But data in each cluster is well connected
- Connections are defined by affinity (similarity) between data points (i.e. pixel values)
- Points in the same cluster are close to at least one other point in that cluster
Graph-partitioning
- Idea: graph-partitioning
- Similar pixels have similar values (e.g. RGB) and are spatially close to each other i.e. neighbors
- A graph represents an image where the nodes represent the pixels & edges measure affinity (similarity) between pixels
Spectral clustering
- Goal: to approximate the Ncut partitioning
- Application: image clustering
- Recursively apply spectral clustering for new sub-segments
- Pros: adds local similarity, great accuracy
- Cons: memory = O(N2), runtime = O(N3)
Challenges with triangulation
- Outlier-free matches do not exist in real world
- The epi-polar geometry can help us to search for the correspondences not in the entire image but only along the epipolar lines
- This restriction not only saves the search time but also brings robustness to the keypoint matching stage
- Another challenge to this problem is the occlusion
- A 3D point may not be visible in small number of views and thus we may not be able to find correspondences for it
- To mitigate this issue, we can incorporate a larger number of views (possibly from different angles) to make sure the point is visible in cameras and correspondences can be found
Image suppression techniques
- Before computing the image gradients, we can first smooth the image using a lowpass filter. The low-pass filter will remove the noise.
- A better way is to use smoothed high-pass filters to compute the image gradients. The high-pass filters here are smoothed by e.g. a gaussian blur.
Degrees of freedom
- The set of independent displacements that specify completely the displaced or deformed position of the body or system
2D to 2D transformations
3D to 3D transformations