Midterm Flashcards

Question

SIFT is designed to detect and describe _______ features in images

Answer 1

The scale space pyramid constructions step represents the image at multiple scales to detect features across varying object sizes

Answer 2

Repeatedly blur (with a Gaussian filter) and downsample the image

Answer 3

Compute the difference of Gaussians (DoG) and find local extrema

Answer 4

Compare each pixel's intensity value in the 2D DoG image to the intensity values of its 8 neighbours. The pixel is marked as an extremum if its value is greater/smaller than all its neighbours

Answer 5

SIFT uses mathematical interpolation (Linear + Quadratic Terms of Taylor Series Expansion) to help locate the true extremum position

Answer 6

1. Construct a scale-space pyramid 2. Obtain initial keypoints 3. Improve keypoint localization accuracy 4. Delete unsuitable keypoints 5. Compute keypoint orientations 6. Compute keypoint descriptor

Answer 7

Can occur due to: Low Contrast + Noise: Keypoints with insignificant intensity changes are sensitive to noise Edge Responses: Keypoints along edges are not well-localized and are less robust Removing these keypoints ensures that SIFT retains only distinctive and stable features

Answer 8

A "unique fingerprint" for each keypoint, used to match features across images, even under changes in scale/rotation/illumination

Answer 9

1. Select neighbourhood 2. Divide into subregions 3. Compute gradients 4. Create histograms 5. Combine histograms 6. Normalize the descriptor

Answer 10

Predefined patterns or templates representing specific classes, often stored in raw or processed forms for comparison

Answer 11

Comparing unknown patterns to stored prototypes to determine the class, similarity between unknown and known data determines classification

Answer 12

Minimum Distance Classifier and Template Matching

Answer 13

Compares unknown patterns to the mean of each class, aligns the class with the smallest distance

Answer 14

Uses correlation to find the best match between an unknown pattern and stored templates

Answer 15

1. Mean calculation: Compute mean vector for each class using training data 2. Distance measurement: Measure distance between unknown pattern and each class mean 3. Class assignment: Assign unknown pattern to class with smallest distance

Answer 16

1. Start with a template 2. Slide template across bigger image 3. Compare at each position 4. Find best match

Answer 17

It is used in prototype matching and it determines how close a region of an image is to a predefined prototype

Answer 18

It is calculated using a formula called correlation coefficient. What it does is: 1. Pixel-by-pixel comparison to template 2. Normalization (makes brightnesses between template and image closer) 3. Output score is between -1 and 1 (1: Perfect, 0: No Match, -1: Perfect inverse match (opposite))

Answer 19

Sensitive to intensity changes (i.e., if the image becomes brighter or darker, the correlation score will be affected). To address the limitation, use a normalized correlation formula (normalizes the correlation result to account for intensity variations in the template or the image)

Answer 20

Matching involves comparing SIFT descriptors from a known image (prototype) with descriptors from an unknown image

Answer 21

Best-Bin-First Search: Quickly identifies potential matches by approximating the nearest neighbours using limited computations. Clusters of Matches: To improve reliability, clusters of potential matches are identified using the generalized Hough transform, which groups matches that align well geometrically

Answer 22

1. Keypoint detection: Identify distinctive points in both images 2. Descriptor generation: Compute a 128-dimensional vector for each keypoint 3. Feature matching: Compare descriptors from both images and find the best match for each keypoint 4. Filter matches: Use techniques like Lowe’s Ratio Test and Clustering to improve accuracy

Answer 23

Since comparing all features (brute force) is too slow, BBF Search focuses on the most likely matches first. This is done by: 1. Organizing descriptors into bins (data structures like KD-trees) 2. Searching in the best bin (most promising candidates) 3. Stopping early if a good match is found A good analogy is searching for a book by starting in the correct section instead of scanning the entire library

Answer 24

Since individual matches can be noisy or incorrect, the Generalized Hough Transform identifies clusters of consistent matches. This is done by: 1. Grouping matches that agree on a geometric transformation (e.g., scaling, rotation) 2. Discard outliers that don’t align with the cluster A good analogy is solving a jigsaw puzzle by fitting groups of pieces together

Answer 25

A Neural Network (NN) is a computational system inspired by the human brain, designed to recognize patterns and solve problems

Answer 26

NN is composed of interconnected units called neurons organized in layers. Key components include an input layer, hidden layers, and an output layer

Answer 27

Biological Neurons: - Process and transmit information in the brain - Receive signals, integrate inputs, and send outputs Artificial Neurons: - Perform mathematical operations - Use activation functions to decide outputs

Answer 28

Inputs: Data features or signals Weights: Influence the strength of each input Bias: Adds flexibility to the decision boundary Activation Function: Determines whether a neuron should "fire" (output) Output: Result of processing inputs Formula: Activation(Input * Weight + Bias)

Answer 29

Weights determine the importance of each input feature to the neuron’s output (larger = stronger influence). Weights are adjusted during training to minimize loss. Higher weights amplify corresponding inputs; lower weights diminish them. Fine-tuning weights enables the network to adapt to patterns in the data

Answer 30

Bias is a trainable parameter that allows the model to shift the activation function. Bias enables the neuron to make decisions independent of weighted inputs (helps network fit data more flexibly)

Answer 31

Activation functions introduce non-linearity to the network. They decide whether or not to 'fire' the neuron's output

Answer 32

Sigmoid: Smooth gradient, used for binary classification ReLU: Efficient and widely used for hidden layers Tanh: Zero-centred, scales outputs between -1 and 1 Softmax: Converts outputs to probabilities

Answer 33

A Multi-Layer Perceptron is a class of feed-forward neural networks consisting of multiple layers of neurons. MLPs can learn complex patterns by stacking layers. The architecture is structured in the following manner: Input Layer: Receives the input features Hidden Layers: Perform feature extraction through non-linear transformations Output Layer: Provides predictions

Answer 34

1. Input features are passed through the network 2. Each layer applies weights, biases, and activation functions 3. Outputs are propagated to the next layer until the final output is produced

Answer 35

Loss Function: Measures the error for a single data point or batch of data Objective Function: The function to be minimized (or maximized) during training (often represents the aggregate loss over the entire dataset)

Answer 36

- Process of using optimization algorithms that adjust weights and biases to minimize the loss - Calculates gradients of the loss function with respect to weights

Answer 37

A gradient is a vector representing the direction and rate of a function's steepest increase (or decrease). In neural networks, it typically refers to the partial derivatives of the loss function with respect to the model's parameters (weights and biases). Think of it as a 'guide' or a 'pointer', a gradient just points to the best way to get to where you want to go (reduces errors in a neural network)

Answer 38

Specialized neural networks are primarily used for image recognition and computer vision tasks. CNNs achieve state-of-the-art performance in many tasks (e.g., image classification, object detection)

Answer 39

Traditional machine learning methods require manual feature extraction. CNNs learn hierarchical feature representations directly from raw data (e.g., images). There is a reduced number of parameters compared to fully connected networks (MLP) (exploiting local connectivity and parameter sharing)

Answer 40

1. Convolution Layer 2. Pooling Layer 3. Fully Connected Layer (FC) 4. Activation Functions

Answer 41

The convolution layer performs filtering by sliding filters (kernels) over the input. It learns filters that activate when they see specific features

Answer 42

The pooling layer reduces spatial dimensions (e.g., max pooling). Helps reduce computation and control overfitting.

Answer 43

The fully connected layer (FC) is the final layer for classification or regression.

Answer 44

Filter (kernel): A small matrix applied over the input (e.g., 3×3 or 5×5). Stride: The step size with which the filter moves across the input. Padding: Zero-padding preserves spatial dimensions.

Answer 45

OS = 1 + (W - K - 2P)/S W: Input dimension K: Kernel size P: Padding S: Stride

Answer 46

Convolutional

Answer 47

1. Filter Sliding: Kernel moves across input data with certain stride value until it parses complete width, then moves down one row and starts at left again. This repeats until entire image is traversed 2. Element-wise Multiplication & Summation: At each position, we multiply the overlapping input patch by the filter and sum the results 3. Feature Map: The sum is stored in the feature map at the corresponding location

Answer 48

Grayscale Image Convolution: - A grayscale image has only one channel (intensity values from 0 to 255). - Image shape is denoted as (H×𝑊×1) - Convolution filter shape: (𝑓×𝑓×1) - The convolution operation applies a single 2D filter over the image. - Produces a single feature map as output. RGB (Color) Image Convolution: - Each pixel has three separate intensity values (3 channels: Red, Green, Blue). - Image shape is denoted as (H×𝑊×3) - Convolution filter shape: (𝑓×𝑓×3) – one filter “slice” per channel, then summed into a single feature map. - Element-wise multiplication is performed independently for each channel, and the results are summed across channels. - Produces a single feature map per filter.

Answer 49

- A single filter captures only one type of feature (e.g., horizontal edges). - A Convolutional Layer applies multiple filters to extract different features at the same time. - More filters = richer feature representation. Example: - Filter 1: Detects vertical edges. - Filter 2: Detects horizontal edges. - Filter 3: Detects diagonal lines

Answer 50

- The number of filters in a convolutional layer determines its depth. - If a layer has 64 filters, it produces 64 feature maps. - The output of a convolutional layer has the shape: HxWxD where D = number of filters (depth)

Answer 51

- Filters are not manually set; they are learned during training. - The CNN adjusts filter values using backpropagation. - Each filter activates strongly when it detects a matching pattern. Over multiple layers: - Early layers: Detect edges & textures. - Middle layers: Detect shapes & parts. - Deeper layers: Detect high-level objects (faces, animals, etc.).

Answer 52

- Feature maps generated by Convolutional Layers are large. - Pooling reduces spatial size, keeping only the most important information. - Helps prevent overfitting by forcing CNNs to generalize. - Makes CNNs translation invariant (small shifts in the image don't affect detection).

Answer 53

Max Pooling (Most Common): Takes the maximum value from each sub-region. Average Pooling (Less Common): Takes the average value from each region, retains the overall smoothness of feature maps.

Answer 54

Instead of processing one image at a time, CNNs process multiple images in parallel (batch). A batch size is the number of images processed together before updating weights. The characteristics of both options are listed below: Batch: - Updates weights after computing the gradient over a batch of images - More stable gradients, efficient GPU use - Requires more memory Single-Image: - Updates weights after every image - Faster weight updates -Unstable training, noisy updates

Answer 55

True, with batch processing, an additional Batch Size (B) dimension is added: (HxWxDxB)

Answer 56

Neural networks suffer from internal covariate shift, where layer activations change drastically, slowing training. Batch Normalization (BN) normalizes activations, reducing variance between batches and improving stability.

Answer 57

- Computes the mean and variance for each batch. - Normalizes activations by subtracting the mean and dividing by standard deviation. - Applies learnable scale and shift parameters to maintain network flexibility.

Answer 58

- Faster convergence (reduces training time). - More stable training (reduces sensitivity to learning rate). - Reduces dependence on careful weight initialization. - Acts as a mild regularizer (reduces overfitting).

Answer 59

CNNs can overfit, memorizing training data instead of generalizing. Regularization techniques help improve model generalization.

Answer 60

- During training, random neurons are deactivated with probability, p - This forces the network to learn multiple representations, improving generalization.

Answer 61

- In each training step, some neurons are ignored. - During testing, all neurons are active, but their activations are scaled by p (dropout probability)

Answer 62

- Prevents overfitting. - Helps CNNs learn redundant features. - Improves model robustness.

Answer 63

- Neural networks designed for unsupervised learning. - Learn compact representations (encoding) of input data. - Used to pre-train deep models when labelled data is scarce Consists of two main parts: - Encoder: Compresses input into a lower-dimensional representation. - Decoder: Reconstructs the input from this compressed representation.

Answer 64

- Reduce dimensionality (feature compression). - Learn meaningful latent representations of data. - Useful for denoising, anomaly detection, and pretraining deep models.

Answer 65

- Image reconstruction - Anomaly detection - Data generation (using variational autoencoders (VAEs)) - Image Segmentation (U-Net)

Answer 66

Learn probabilistic representations to generate pixel-wise segmentations. It works by the encoder network converting the input into two vectors: - Mean (μ): Center of the latent space distribution. - Variance (σ2): Spread of the distribution. Then, instead of sampling directly from μ and σ, we generate z: z = μ + σ⋅ϵ, ϵ ∼ N(0,1) Then, the decoder takes the sampled latent vector, z, and reconstructs the original input

Answer 67

- Train an autoencoder to learn unsupervised feature representations. - Use the encoder’s output as input features for a classifier.

Answer 68

Generative Adversarial Networks (GANs) are a type of deep learning model used for generating new data that mimics a given dataset. Consists of two competing neural networks: - Generator (“Artist”): Creates fake data. - Discriminator (“Critic”): Evaluates if data is real or fake

Answer 69

Deconvolution (also called transposed convolution) is used to increase the spatial resolution of feature maps in CNNs. It helps reconstruct finer details lost during convolution. Often used in image segmentation and super-resolution tasks

Answer 70

- Works by spreading pixel values over a larger area. - Deconvolution uses a learnable kernel like standard convolution but performs an inverse process. - Unlike upsampling, deconvolution learns weights dynamically.

Answer 71

Stride: Spacing between output values (upsampling factor). Kernel: Similar concept to the convolution kernel, but effectively “spread out.” Padding & Output Shape: Calculations ensure the desired output height/width. Learnable Parameters: Weights are learned just like in forward convolution.

Answer 72

- Segmentation demands pixel-wise classification. - Deep networks (like CNNs) typically reduce resolution to capture context. - Need to “decode” feature maps back to full resolution (Transposed Conv.) - To classify each pixel in the original image, we need to restore or approximate its original spatial resolution.

Answer 73

The process of dividing an image into meaningful regions. Each pixel is assigned a label corresponding to an object/class

Answer 74

Semantic: Labels every pixel with a class Instance: Identifies and separates individual objects within an image Panoptic: Combination of semantic + instance segmentation (Recognizes both object boundaries and individual instances)

Answer 75

Widely used model for biomedical image segmentation. Helps precisely segment small objects. U-Net concatenates feature maps from the encoder to the decoder preserving features from the earlier layers (skip connection). Consists of two parts: - Contracting path (Downsampling via convolutional layers) - Expanding path (Upsampling via deconvolution layers)

Answer 76

Transfer Learning is a deep learning technique where a pre-trained model is adapted for a new task. Instead of training from scratch, we reuse knowledge from existing models trained on large datasets (e.g., ImageNet). This saves computational resources and improves performance on smaller datasets.

Answer 77

1. Select a Pre-trained Model: Choose a model trained on a large dataset (e.g., VGG16, ResNet, EfficientNet) 2. Feature Extraction or Fine-tuning: - Feature Extraction: Freeze convolutional layers and use them to extract useful representations. - Fine-tuning: Unfreeze some deeper layers and retrain them on the new dataset. 3. Train a New Classifier: Replace the final classification layer with a new one tailored to the target task.

Answer 78

Denoising: Removes noise while preserving details Inpainting: Fills in missing parts or damaged regions of an image Super-Resolution: Enhances low-resolution images to high-resolution

Answer 79

- CNNs - Autoencoders - GANs

Answer 80

It increases dataset diversity by creating artificial/modified copies of existing data. Prevents overfitting and improves model robustness to variations

Answer 81

Geometric Transformations: Rotation, flipping, cropping, scaling Colour-Based Transformations: Brightness adjustment, contrast enhancement, colour jittering Noise Addition: Gaussian noise, salt-and-pepper noise Synthetic Data Generation: GANs and diffusion models for generating new samples

Answer 82

- Augmentation enhances training datasets to improve reconstruction models - Reconstruction techniques can be used to clean augmented images - Example: Super-resolution can be combined with augmentation for better data quality - High-quality data + Diverse training = Robust models.

Midterm Flashcards

(109 cards)