Computer Vision Flashcards

Question

Subtraction of filters

Answer 1

- Typically used for sharpening - ISharp(x,y) = I(x,y) - α(I ∗ ∇²G_σ(x,y))

Answer 2

- Gaussian blur results in less artifacts -

Answer 3

- Aliasing = appearance of artefacts due to loss of image data - To solve, apply a low-pass filter to an image before downsizing an image

Answer 4

- DoG{I,σ₁,σ₂} = gσ₁ ∗ I - gσ₂ ∗ I = (gσ₁ - gσ₂) ∗ I - Extracts “edges” at multiple scales i.e. the information difference between two scales

Answer 5

- Increase image size - Interleave the small image with zeros - Apply a low-pass filter - This linearly interpolates between the pixels

Answer 6

- The most wildely used edge detector _Steps_ 1_)_ Compute a robust gradient of the image (i.e. using Gaussian differentiation) - Compute gradient's magnitude and direction 2) Apply non-maxima suppression for edge-thinning - Keep largest edge (i.e. local maxima in a neighborhood) across gradient direction, suppress the rest 3) “Hysteresis thresholding” to discard non-connected components - Define two thresholds: T-high and T-low - ‘Track’ a strong edge from T-high, keep connected edges above T-low, discard the rest * Note: Step 2&3 discard many False Positives to get clean edges*

Answer 7

Check if pixel is local maximum along gradient direction (in a window); if yes then keep it & discard other candidates

Answer 8

- Threshold the gradient magnitude with two thresholds: T-high and T-low - Strong Edges start at edge locations with gradient magnitude \> T-high - The values in between may be an edge (blow T-low not) - Continue tracing that edge until gradient magnitude falls below T-low, keeping only connected components in this range

Answer 9

- σ = Gaussian kernel spread/size - The choice of depends on desired behavior 1) Large σ detects large scale edges 2) Small σ detects fine features

Answer 10

- Select patches - Slide them across the image - Score the SSD & report minimizer

Answer 11

_Locality_ * In a “neighbourhood” pixels reveal important structures * Shift-invariance (e.g. patch space vs. original image) _Efficiency_ * Features extraction in real-time is possible Quantity • Instead of O(106) pixels, e.g. O(103) features to match. • Search in real-time is possible Distinctiveness • Can differentiate a large database of objects Building invariances • Invariant to image transformations e.g. shift, rotation, scale, deformations,… • robust to clutter, noise and occlusion

Answer 12

- Stands for Principal Component Analysis -

Answer 13

- In practice computing eigen values is not very fast - Harris instead proposed to approximate the process by the following functions: If R scores higher than a threshold is a corner candidate

Answer 14

1. Compute directional derivatives I_x, I_y and I_xI_y at each pixel (and smooth them with Gaussian filters). 2. Compute the Structure Matrix in a window/patch around each pixel 3. Compute corner response function R 4. Threshold R 5. Find local maxima of response function (non-maximum suppression)

Answer 15

- Stands for Scale Invariant Feature Transform - Used in object recognition as it's powerful, especially for well-textured objects

Answer 16

* Each image has 100 to 1000 SIFT descriptors (instead of 1M pixels) * Each is represented by 128 numbers vector φ * SSD can be used for measure: ∑\_(i=1)^128 (φ_i⁽¹⁾ – φ_i⁽²⁾) * Feature matching = minimizing SSD (search) * fast KD-tree type search can be adopted * SIFT feature matching can run in real-time!

Answer 17

Feature matching still returns a set of noisy correspondences Refinements: 1. Discard weak matches with SSD(φ⁽¹⁾,φ⁽²⁾) \> threshold 2. Keep only clear matches i.e. R \<\< 1 * R = SSD(φ⁽¹⁾,φ⁽²⁾)/SSD(φ⁽¹⁾,φ′⁽²⁾) * φ⁽²⁾ = best SSD match to φ⁽¹⁾ in second image * φ′⁽²⁾ = 2nd best SSD match to φ⁽¹⁾ in second image * R gives large values ~ 1 for ambiguous matches 3. Discard inconsistent matches; for which we need to know something about the geometry of the scene

Answer 18

_Without geometry_ Object recognition _With geometry_ * Building panoramas * Extract & match SIFT features * Use matched pairs to learn a (homography) mapping which holds for all points in two images * Apply the mapping to stitch two images * 3D reconstruction * Use SIFT to find dense correspondences between multiple images of a scene * Use matched tuples (+ non-planar geometry) to estimate 3D shape

Answer 19

- SIFT has scale invariance, Harris doesn't - Both have shift, rotation and illumination variance - SIFT provides descriptors, Harris doesn't

Answer 20

* TP (true positive): number of correct matches * FN (false negative): number of matches that were not correctly detected * FP (false positive): number of proposed matches that are incorrect * TN (true negative): number of non-matches that were correctly rejected * TP rate = TP/(TP+FN) = TP/P * FP rate = FP/(FP+TN) = FP/N * Accuracy = (TP+TN)/(N+P)

Answer 21

Has three purposes: 1) Achieves sub-pixel accuracy for keypoint localiazion 2) Reject edges 3) Reject low-contrast keypoints

Answer 22

Detect maxima and minima of DoG in scale & space * Each point is compared compared with to the 26 pixels in the current and its adjacent scales * Detected if larger/smaller than all of the 26 pixels * Extrema must be stronger than a threshold to reject candidates with low contrast

Answer 23

- For candidates, construct a 2x2 (Hessian) matrix: * [D_xx D_xy; D_xy D_yy]

Answer 24

- Defines a relation between two images in the same planar surface - Defined as: [ũ; ṽ; w̃] = [h₁₁ h₁₂ h₁₃; h₂₁ h₂₂ h₂₃; h₃₁ h₃₂1][x; y; 1] - Extra step included: x' = ũ/w̃, y' = ṽ/w̃ * ũ = h₁₁x + h₁₂y + h₁₃ * ṽ = h₂₁x + h₂₂y + h₂₃ * w̃ = h₃₁x + h₃₂y + 1 - h₃₃ = 1 due to scale invariance * Degrees of freedom = 8 as a result

Answer 25

Building a panorama _Process_ 1. Find correspondences 2. Reject outliers using RANSAC 3. Compute the homography using inlier matches 4. Harmonize both image intensities by gain adjustment * Compute the average RGB intensity of each image in the overlapping region * Normalize the intensities using the ratio of between these averages

Answer 26

- Result = [u;v;1] - Matrix = [α cos(θ), α sin(θ), t_x; α sin(θ), α cos(θ), t_y; 0 0 1] - Point = [x;y;1] - Result = Matrix \* Point - Result = [αx cos(θ) + αy sin(θ) + t₁; αx sin(θ) + αy cos(θ) + t₂; 1] * α = scale * t_x/y = translation in axis * θ = rotation angle

Answer 27

ĩ₁ = H₁[X; Y; 1], ĩ₂ = H₂[X; Y; 1] -\> ĩ₂ = H₂H₁^-1ĩ₁, where H = homography

Answer 28

Stands for Least Squares Used to solve the equation At = b Find t that minimises ||At - b||² To solve, form the following normal equations: 1) A^TAt = A^Tb 2) t = (A^TA)^-1 A^Tb LSQ is robust against (moderate) noise but not against outliers Including outliers in model-fitting (LSQ) leads to a large model error Outliers are avoided through checking global consistency

Answer 29

Known as Random Sample Consensus Used for robust model fitting _Process_ * Loop * Pick a random set of K points, where K represents the minimal number of points require for fitting to model M * Compute (fit) the model M using these points * Apply M to the other points & measure the error * Label high error matches as outliers * Keep a copy (M) each time the number of outliers reduces * After S iterations, return M with fewest outliers * Recompute M using all of the inliers

Answer 30

Aim: estimate a model with K points, given the probability that a point is an inlier * q = the probability that a point is an inlier * q^K = probability of selecting all inliers for a K-point model (at each iteration) * 1-q^K = probability of choosing at least one outlier (at each iteration) * P = probability that RANSAC chooses an outlier-free model after S iterations: P = 1 - (1 - q^K)^S *

Answer 31

![]() S = iterations to be calculated K = points in model q = 1 - probability of choosing outlier

Answer 32

* Used to generate 2D images from a 3D shape * Properties such as texture and reflectance are also defined

Answer 33

* Explains the formation of images from 3D points (i.e., a linear model in homogeneous coordinates) * P = [x.y,z], a point in 3D * p = [x,y], a corresponding image pixel * [x̃; ỹ; z̃] = K[R t] [X;Y;Z;1] * Extra step required: x = x̃/z̃, y = ỹ/z̃

Answer 34

* The inverse to the camera model, where a 3D shape is estimated that would generate the input photographs given the same camera viewpoints, as well as material and illumination * Requires more than one view i.e. motion to resolve depth/size ambiguities in “back projection” * Process is typically as follows: 1. Camera calibration 2. Matching correspondences (e.g. SIFT) between Multiview images 3. Triangulation

Answer 35

* 3D reconstruction allows the creation of a "digital copy" of a real object * This allows the following things: * Inspection of object details * Measurement of properties * Reproduction of an object in different material * Therefore, 3D recon has several applications such as: * Preservation of cultural heritage * Some companies use 3D recon to provide "digital museums" * Computer games * Movies * City modelling * Typically done for historical buildings, in order to help preservation of them in the long-term * E-commerce

Answer 36

* The pre-requisite step for many computer vision tasks, including 3D reconstruction * [x̃;ỹ;z̃] = [C₁₁C₁₂C₁₃C₁₄;C₂₁C₂₂C₂₃C₂₄;C₃₁C₃₂C₃₃C₃₄; C₄₁C₄₂ C₄₃ 1]\*[X;Y;Z;1] * There are 11 parameters for camera calibration, due to scale invariance *

Answer 37

* A number of correspondences, n, between the 3D object and 2D images is needed: {(x_i,y_i, X_i,Y_i,Z_i )}_(i=1…n) * [x̃₁ x̃₂ ... x̃_n; ỹ₁ ỹ₂ ... ỹ_n; z̃₁ z̃₂ ... z̃_n] = [C₁₁ C₁₂ C₁₃ C₁₄;C₂₁ C₂₂ C₂₃ C₂₄;C₃₁ C₃₂ C₃₃ C₃₄; C₄₁ C₄₂ C₄₃ 1]\*[X₁ X₂ ... X_n; Y₁ Y₂ ... Y_n ;Z₁ Z₂ ... Z_n; 1 1 ... 1] * This matrix is in homogeneous coordinates but the image pixels are in a cartesian domain * Therefore, a divison is required to relate X,Y,Z to some x and y * x = x̃/z̃ and y = ỹ/z̃ * x̃ = C₁₁X_i+ C₁₂Y_i+ C₁₃Z_i+ C₁₄ * ỹ = C₂₁X_i+ C₂₂Y_i+ C₂₃Z_i+ C₂₄ * z̃ = C₃₁X_i+ C₃₂Y_i+ C₃₃Z_i+ 1

Answer 38

* This involves the factoring of knowns and unknowns and writing it as a linear system * System: A\*[C11; C12; C13; C14; ... C33] = b * Each correspondence adds two rows to A: one for the x-coordinate and one for the y-coordinate of the point * At least 6 correspondences are needed to compute all 1 1 unknowns in the camera matrix * Number of rows in A = 2n * Number of columns in A= 11 * Number of rows in B = 2n * Number of columns in B = 1

Answer 39

* Steps of usual pipeline * Use a calibration rig e.g. checkerboard with known dimensions & geometry * This gives a number of known 3D world points. * Chess corners are ideal. * Find corresponding points (chess corners) in the image using e.g. Harris or SIFT. * It is common to use 3 planar objects, such as the inside corner of a cube. Assume checkerboard corner is the centre of the world

Answer 40

* This is found after computing the 3x4 camera matrix C * C can be factorised at K [R t] * K = 3x3 intrisic calibration matrix: [f_x s c_x; 0 f_y c_y; 0 0 1] * [R t] = 3x4 pose matrix

Answer 41

* The estimaion of points (shapes) in 3D from their Multiview images through the use of calibrated cameras * We need to know the camera matrices and we require robust (outlier-free) matches

Answer 42

1. First, find Multiview correspondences e.g. using SIFT 2. For each view, back project a ray originating at camera centre and passing through the correspondence pixel 3. Find intersection of the cross view rays to determine 3D location

Answer 43

Given Multiview correspondences {p₁,p₂,…} in calibrated cameras {C₁,C₂,…}, reconstruct the corresponding 3D point P = [X, Y, Z, 1] compliant with the projection model in every views

Answer 44

3D reconstruction from uncalibrated cameras: Given a set of Multiview correspondences in unknown cameras, estimate jointly the 3D locations and camera viewpoints - Yields camera pose - Estimates 3D world coordinates from 2D images

Answer 45

1. Group similar images * Extract SIFT features * Identify/group strong overlaps 2. Alternate between the two steps with an initialization: * With the current estimated camera calibrations solve a Triangulation problem to find 3D locations * Use 3D locations and their Multiview pixel correspondences to Calibrate Cameras 3. This will first give a sparse feature-based reconstruction 4. Next, seed with sparse results a dense (intensity-based) reconstruction

Answer 46

* Outlier correspondences (e.g. similar patches) result in false matches and deteriorate 3D recon * Epiplor constraints are introduced to reduce this

Answer 47

* In non-planar geometry, there are point-to-line correspondences between views, contrasting from the point-to-point correspondences present in planar geometry * They are relevant for finding robust correspondences in non-planar geometry * They restrict the search space from an entire image to a line which can save the search time and it brings robustness to the keypoint matching stage * This can be used in non-planar geometry applications namely depth estimation, Structure from Motion and 3D reconstruction. * An epipolar plane has all of the following aspects on the same plane: 1. Camera centres 2. 3D point 3. Corresponding image points 4. A baseline (a line connecting the two camera centers 5. Epipoles * Projection of Camera 1’s centre (also baseline) to Camera 2’s image * Projection of Camera 2’s centre (also baseline) to Camera 1’s image 6. Epipolar line l’ (plane projection onto Camera 2) * Potential matches for x have to lie on the corresponding epipolar line l’. 7. Epipolar line l (plane projection onto Camera 1) * Potential matches for x′ have to lie on the corresponding epipolar line l 8. Duality * All points on line 1 corresponds to all points in line 2 (& vice versa)

Answer 48

* Converging views * Span the whole 3D space by all epipolar planes: * This gives all epipolar lines pairs * All epipolar planes intersect at the baseline * All epipolar lines converge to epipoles because tje baseline planes intersect at the baseline, and the baseline’s projection in each view is the epilole * To span the whole 3D space, infinitely epipolar planes need to be considerd * Since all of these planes share the same baseline, they can be viewed as they rotate along the Baseline axis to span the 3D space * If the 3D space is spanned in this way, and each epipolar plane is projected to the two image views, we will get a set of dual epipolar lines that span the images of each camera * They are dual because a point in red line in one camera corresponds to a red line in other camera i.e. it finds it’s correspondences there.

Answer 49

* Motion parallel to image plane * This is a typical scenario in stereo vision, parallel views with motion * However, there is no convergence due to the views being parallel with motion * Each view contains parallel epipolar lines * Epipoles are at infinity

Answer 50

* Motion perpendicular to image plane * Epipoles have same coordinates in both images * Epipolar lines radiate from the epipoles

Answer 51

* Geometry tells point-to-line correspondences between views * Epipolar lines reduce (constraint) the search space. Useful for feature matching: * Rejecting outliers * Fast search, which is even faster when the lines are aligned (e.g., stereo rectification)

Answer 52

* Used to characterise epipolar lines as an equation x′^TFx=0 * F = a 3x3 fundamental matrix determined by the internal and external calibrations * x,x′ = image points (homogeneous coordinates) in camera 1 and 2, respectively

Answer 53

* Fundamental matrix is related to the internal and external calibrations * F=K^^T[t] \* RK^-1 * Relative rotation R and translation t between two cameras with Internal calibrations K, K’ * [t]_x = [0 -t₃ t₂; t₃ 0 -t₁; -t₂ t₁ 0] * Epipolar lines can be used to refine correspondences

Answer 54

* It's possible to jointly estimate the Fundamental matrix and reject outliers when calibrations are missing * The first step is to take data-driven approach to compute fundamental matrix (through model fitting) * Use RANSAC to reject outliers during model-fitting

Answer 55

* Given: noisy correspondences (x,x’) in two views * Unknown: cameras’ internal/extral calibrations * Goal: compute the 3x3 fundamental matrix F * If the points x and x' are correspondences of each other, then they are epipolar constraints and the equation x′^TFx = 0 holds

Answer 56

* x & x’ indexed by i indicate a pair of matched points between the views * For finding F, a number N of such raw correspondences between two views needs to be established * Each of these x&x’ are in homogeneous coordinate form and thus can be expanded by the 3D vector u,v,1 * [u'; v'; 1][f₁₁ f₁₂ f₁₃; f₂₁ f₂₂ f₂₃; f₃₁ f₃₂ f₃₃][u; v; 1] = 0' * Every pair must individually satify the epipolar constraint, through some matrix F to be determined * Every match gives one equation through spanning the epipolar constraint equation, this is done for each pair * If factoring out known and unknows to form a standard linear system, it's seen that it will have the form of a Tall matrix A times vectorised form of F eq =0. * This systems has one non-interesting trivial solution, that is F=0. * To mitigate this, he scale of F can be fixed * This can be done either through assuming one lemenet of F =1, or as shown here the norm of F =1. i.e. F is normalised to disambiguate the scale. * Therefore it's possible to solve the following constrained least square to minimize norm of Af and find f, the elements of the fundamental matrix. * For this, at least 8-pairs of correspondences between the views are required * F has 9-elements but its is scale invariant, and so by constrainting its norm, a minimum of 8 equations are required, i.e. correspondences to solve this constrained sustem and find F. However: * Outliers have large distances (error) to the true epipolar lines * This can deteriorate the outcome of model fitting to find correct F * RANSAC can be ussd as a meta algorithm to fix this

Answer 57

_Process:_ * Iterate x times: * Randomly select noisy matches * Compute a fundamental matrix which satisfies the epipolar constraints x′^T Fx ≈ 0 * Find the corresponding epipolar lines * Count number of inliers, if improved keep current F and iterate next _Process with a non-planar fit_ F = eye(3,3); nBest = 0; K=8; for i = 1:S %nIterations P2 = SelectRandomSubset(Matches,K); f = ComputeFundamental(P2); nInliers = ComputeInliers(f); if (nInliers \> nBest) F = f; nBest = nInliers; end end

Answer 58

_Convergence_ When watching an object close to us, our eyes point slightly inward. This difference in the direction of the eyes is called convergence. This depth cue is effective only on short distances (less than 10 meters). _Binocular Parallax_ As our eyes see the world from slightly different locations, the images sensed by the eyes are slightly different. This difference in the sensed images is called binocular parallax. Human visual system is very sensitive to these differences, and binocular parallax is the most important depth cue for medium viewing distances. The sense of depth can be achieved using binocular parallax even if all other depth cues are removed. _Monocular Movement Parallax_ If we close one of our eyes, we can perceive depth by moving our head. This happens because human visual system can extract depth information in two similar images sensed after each other, in the same way it can combine two images from different eyes.

Answer 59

_Accommodation_ Accommodation is the tension of the muscle that changes the focal length of the lens of eye. Thus it brings into focus objects at different distances. This depth cue is quite weak, and it is effective only at very short viewing distances (less than 2 meters) and with other cues.

Answer 60

_Retinal Image Size_ When the real size of the object is known, our brain compares the sensed size of the object to this real size, and thus acquires information about the distance of the object. _Perspective_ When looking down a straight level road we see the parallel sides of the road meet in the horizon. This effect is often visible in photos and it is an important depth cue. It is called linear perspective. (assumes geometry is known). _Texture Gradient_ The closer we are to an object the more detail we can see of its surface texture. So objects with smooth textures are usually interpreted being farther away. This is especially true if the surface texture spans all the distance from near to far. _Occlusions_ When objects block each other out of our sight, we know that the object that blocks the other one is closer to us. The object whose outline pattern looks more continuous is felt to lie closer. _Aerial Haze_ The mountains in the horizon look always slightly blurry or hazy. The reason for this are small water and dust particles in the air between the eye and the far objects. The farther the objects, the hazier they look. _Shades and Shadows_ When we know the location of a light source and see objects casting shadows on other objects, we learn that the object shadowing the other is closer to the light source. As most illumination comes downward we tend to resolve ambiguities using this information. Also, bright objects seem to be closer to the observer than dark ones.

Answer 61

_Goal_ Recover depth by finding image coordinates for x’ that correspond to x _Problems_ 1. Intrinsic & extrinsic calibration: recover the relation of the cameras 2. Correspondence: where do we search for the matching point x’?

Answer 62

* Disparity is inversely proportional to depth. * Disparity = x - x' = (B • f) / z * f = intrinsic parameters * B = extrinsic parameters * (x - x') / (O - O') = f / z

Answer 63

* Project image planes onto a common plane parallel to the baseline (the line between camera centers) * Two Homographies (3x3 transform), one for each input image projection * Pixel motion is horizontal after this transformation

Answer 64

_Process_ Slide a window along the right scanline and compare contents of that window with the reference window in the left image _Challenges_ * Textureless surfaces * Occlusions * Non-Lambertian surfaces * Specularities * Dense scenes, unique patterns

Answer 65

_Smaller window_ Pro: More detail Con: More noise _Larger window_ Pro: Smoother disparity maps Con: Less detail Con: Fails near boundaries

Answer 66

For any point in one image, there should be at most one matching point in the other image

Answer 67

Corresponding points are in the same order in both views

Answer 68

Disparity values are expected to change slowly for the most parts (except for the edges) **_Energy minimisation framework_** Used to enforce piecewise smoothness _Pros_ * Generality * Does not need training data * Rather uses geometry knowledge (physics) _Cons_ * Solving optimisation is time-consuming _Process_ E_data = data fidelity E_smooth= data smoothness D = disparity

Answer 69

* Deep learning helps to do a task in real-time * It prioritises effort during training rather than in testing _Components_ * Convolutional Neural Network (CNN) * A CNN uses epipolar geometry in every layer, therefore it doesn't need to learn how to do it from scratch * As a result, it uses less training data and achieves accurate performance in comparison to end-to-end communication * The incoroporation of epipolar geometry constraints into the depth reconstruction loss to reduce need training data * E_training = E_{L-R consistency}+ E_{smooth disparity}+ E_{reconstruction}

Answer 70

* Facial motion capture * Tracking facial expressions, motion * Photography * Auto-focus, intensity adjustment, artistic blur (portrait images) * Photo tagging * Recognition, social media, security, marketing, etc.

Answer 71

* Facial expressions * Facial appearance * Ilumination variation * Variation in pose * Occlusions * Variations across individuals *Sidenote: Good algorithms run in real-time and are robust against these challenges*

Answer 72

* Example. Given a dataset {x} in 2D * Feature: transformation of raw data attributes to a new reduced set of discriminative features * Motivations in high dimensional settings: real-time processing, robustness to nuisance attributes, noise, outliers etc _Process_ * Place data on a scatter graph * Place a line that represents a projection of high-dimensional space into a lower-dimensional space * This line creates two conditional distributions * Given the input matches what is required, a threshold is set to separate these distributions

Answer 73

* Weak classifiers can be defined for each boxlet feature * Weak classifiers individually have high detection errors * They are parameterised by some threshold value and some polarity value, which has a difference of 1 in comparison to the threshold * A problem with such a classifier is that due to individual processing of features, it is not enough to correctly detect a "face" or "not face"

Answer 74

Viola Jones uses 3 types of boxlet: * 2-rectangle: difference between regions of equal size * 3-rectangle: difference between extremes and centre region * 4-rectangle: difference between diagonal pairs of rectangles Considering all possible filter parameters: position, scale, and type: there are at least 160,000 possible features associated with (small) windows However, a problem is that the library is extremely large and informative combinations have to be subselected as a result

Answer 75

Also known as adaptive boosting _Idea_ * Build a complex classifier using weighted combination of a subset (m\<\<160K ) of weak classifiers * Weak classifier may individually have high error rate but the combination will perform well Each weak classifier is parameterised in the following format: {type, polarity, threshold} _Training the parameters_ 1. Given a face & not-face training dataset {(x_i, y_i)} * x_i image patch * y_i ∈ {-1, 1} ‘not face’ or ‘face’ 2. Initialize uniform weights〖 w〗\_i for all data samples * Select a feature f\_t at a time (randomly) * Find parameters that minimizes the overall misclassification error on the training data e.g.e=∑_iw_i|h_t(x_i) - y_i| * Boosting: weight the misclassified samples more strongly at the next stage of learning (next feature should focus more on them) * w_i← w_iδ if x_i misclassified (δ \> 1) •

Answer 76

* A m = 200 feature classifier can yield 95% detection rate and a false positive rate of 1 in 14084 * However, while a larger m value brings better accuracy, it also brings a slower runtime

Answer 77

_Motivation_ Improved runtime for a given budget mtotal _Chain of classifiers_ * Start with simple classifiers (i.e. small m) which reject many of the non-face patches (negatives) while detecting almost all face patches (positives) * Positive response from the first classifier triggers the evaluation of a second classifier, and so on * A negative outcome at any point leads to the immediate rejection of the sub-window (time saved) * Keeping the same stagewise TPR, FPR for classifiers 1 to N means that they become progressively more complex

Answer 78

- TPR₁ x TPR₂ x .... x TPRₙ = y - Or FPR₁ x FPR₂ x .... x FPRₙ = y - n = number of stages - y = end-to-end TPR/FPR - Detection rate = end-to-end TPR

Answer 79

- Denoted as a(1 - TPR) + b(FPR) - a = chance of face - b = chance of not face - a = 100% - b and vice versa

Answer 80

* Gives too many false positives after thresholding * Nonetheless, it keeps enough spatial separation between detected objects

Answer 81

- Goal: Find boundaries around objects of interest - Each object/part might have texture and other fine details, but it's pivotal to group its pixels together, form a region and a draw a contour around it

Answer 82

1. Biomedical imaging * Measure tissue volume * Location and diagnosis of tumours * Surgery planning * Study of cell structures 2. Remote sensing, satellite imaging * Measures 100-1K wavelengths (channels) to discriminate materials’ based on their reflectance properties 3. Hyperspectral imagery 4. Object recognition * Fingerprint/iris recognition * Driverless cars * Pedestrian detection

Answer 83

_Process_ * Groups of similar pixels appear as bumps in the histogram * Split the histogram at local-minima * Label pixels according to which bump they belong to _Pros_ * Works well if the bumps are well separated * If two regions have good separation in the means and low variance, then they can be segmented * This makes it easy to define

Answer 84

* µ = mean * σ = standard deviation (i.e., σ² = variance)

Answer 85

* Thresholds can be hard to find, or may not globally exist. * Also, histograms do not account for neighbourhood relationships

Answer 86

* Good for visualising segmentation * Makes it possible to perform clustering based on separate histograms * However, it's non-scalable * Complexity = b^c * b = # bins * c = # channels

Answer 87

* A popular approach for high dimensional data clustering * K means finds K centres and associates pixels to these centres (clustering) * K-Means is fast & scalable to multichannel images * However it needs good initialization * Bad initialization can lead to poor local minima * Complexity: O(NKd x iter) * N = pixels * K = clusters * d = channels _Goal_ * Cluster multichannel pixels x_i∈ R^d around K centroids * Jointly find K clusters and centroids such that clusters have minimal distance to their centres _Process_ 1. Initialization: Choose K centroids at random 2. Pixels are assigned (clustered) to the closest centroid 3. Move (update) centroids toward the center (mean, median) 4. Iterate 2&3 until convergence to a defined threshold value

Answer 88

* Sometimes segments do not cluster around centroids * But data in each cluster is well connected * Connections are defined by affinity (similarity) between data points (i.e. pixel values) * Points in the same cluster are close to at least one other point in that cluster

Answer 89

* Idea: graph-partitioning * Similar pixels have similar values (e.g. RGB) and are spatially close to each other i.e. neighbors * A graph represents an image where the nodes represent the pixels & edges measure affinity (similarity) between pixels

Answer 90

* Goal: to approximate the Ncut partitioning * Application: image clustering * Recursively apply spectral clustering for new sub-segments * Pros: adds local similarity, great accuracy * Cons: memory = O(N²), runtime = O(N³)

Answer 91

* Outlier-free matches do not exist in real world * The epi-polar geometry can help us to search for the correspondences not in the entire image but only along the epipolar lines * This restriction not only saves the search time but also brings robustness to the keypoint matching stage * Another challenge to this problem is the occlusion * A 3D point may not be visible in small number of views and thus we may not be able to find correspondences for it * To mitigate this issue, we can incorporate a larger number of views (possibly from different angles) to make sure the point is visible in cameras and correspondences can be found

Answer 92

* Before computing the image gradients, we can first smooth the image using a lowpass filter. The low-pass filter will remove the noise. * A better way is to use smoothed high-pass filters to compute the image gradients. The high-pass filters here are smoothed by e.g. a gaussian blur.

Answer 93

* The set of independent displacements that specify completely the displaced or deformed position of the body or system

Computer Vision Flashcards

(125 cards)