Midterm Flashcards
Define image processing
Image Processing is manipulating an image to improve its quality, extract information, or enable further analysis
Define feature
A distinctive attribute or description used to label or differentiate objects in images
Feature extraction involves two things. What are they?
Detection (finding features) and Description (quantifying features)
What are invariant and covariant features?
Invariant features: Values remain unchanged under specific transformations (e.g., rotation, scaling)
Covariant features: Values change predictably under transformations (e.g., scaling affects area proportionally)
What are local and global features?
Local features: Apply to individual image regions (e.g., corners, edges)
Global features: Describe entire images (e.g., colour histogram)
The purpose of preprocessing techniques is to…
Prepare images for further analysis by reducing noise, enhancing features, and normalizing data
Define boundary analysis
An analysis of the edges or outlines of objects to aid in object shape identification
Define region analysis
An analysis of the areas or segments within an image to support texture and pattern recognition
What is boundary following/tracing
A technique to identify the boundary of an object in a binary image
What are the requirements for boundary following/tracing?
- Must be a binary image
- Image padded with a border of 0’s
- Single connected region
What are chain codes?
Chain codes represent the boundary of an object as a sequence of connected line segments. These segments are described using directional numbers based on connectivity
What are the different connectivity types?
4-Connectivity: Segments connect pixels in horizontal and vertical directions
8-Connectivity: Segments connect pixels in horizontal, vertical, and diagonal directions (finer boundary representation than 4-C)
What are the two types of chain codes?
Freeman chain codes and slope chain codes
Define freeman chain codes
A boundary chain code that assigns a directional number (e.g., 0 for right, 1 for top-right, etc.) to each segment between consecutive boundary pixels (e.g., 0766666453321212)
What is a strategy that could reduce the length of a boundary chain?
Resample fine-grained grid to a coarser grid spacing. This also helps with reducing sensitivity to noise or segmentation errors
What are some normalization techniques for chain codes?
Rotation normalization and starting point normalization
What is rotation normalization
uses the difference between consecutive directions instead of absolute directions
What is starting point normalization
A normalization technique for chain codes that treats the chain code as circular and shifts it to start with the smallest sequence
Define slope chain codes (SCCs)
A chain code for boundary analysis that uses slope changes between contiguous line segments to represent a boundary
How do you normalize a slope chain code?
Positive and zero slope changes are normalized to [0, 1), negative slope changes are normalized to (-1, 0)
What are the advantages of SCCs over Freeman codes?
- Provide finer granularity by utilizing a continuous slope range (-1, 1)
- Better representation under rotation
- Simpler process as SCCs do not require defining a grid
Define boundary approximation using minimum-perimeter polygons (MPP)
Boundary approximation using polygons to minimize the total perimeter while maintaining the shape’s integrity, provides a compact/simplified representation of object boundaries
What are the advantages of boundary approximation using MPP?
- Reduces computational complexity
- Simplifies boundary representation for storage and analysis
- Useful in applications like shape matching and object recognition
Define scale-invariant feature transform (SIFT)
SIFT extracts features that are invariant to scale, rotation, and certain changes in illumination
SIFT is designed to detect and describe _______ features in images
Local
SIFT features are ___________
Invariant
Describe the first step of the SIFT algorithm: Scale Space Pyramid Construction
The scale space pyramid constructions step represents the image at multiple scales to detect features across varying object sizes
How would you construct a Scale Space Pyramid
Repeatedly blur (with a Gaussian filter) and
downsample the image
Each group of blurred images in a scale space pyramid is called ________
An octave
Describe the second step of the SIFT algorithm: Obtain Initial Keypoints
Compute the difference of Gaussians (DoG) and find local extrema
How would you find the local extrema when obtaining initial keypoints in a SIFT algorithm
Compare each pixel’s intensity value in the 2D DoG image to the intensity values of its 8 neighbours. The pixel is marked as an extremum if its value is greater/smaller than all its neighbours
Describe the third step of the SIFT algorithm: Improve Keypoint Localization Accuracy
SIFT uses mathematical interpolation (Linear + Quadratic Terms of Taylor Series Expansion) to help locate the true extremum position
What are the six key steps in the SIFT algorithm?
- Construct a scale-space pyramid
- Obtain initial keypoints
- Improve keypoint localization accuracy
- Delete unsuitable keypoints
- Compute keypoint orientations
- Compute keypoint descriptor
How do unstable keypoints occur in a SIFT algorithm and why delete them?
Can occur due to:
Low Contrast + Noise: Keypoints with insignificant intensity changes are sensitive to noise
Edge Responses: Keypoints along edges are not well-localized and are less robust
Removing these keypoints ensures that SIFT retains only distinctive and stable features
What is a keypoint descriptor?
A “unique fingerprint” for each keypoint, used to match features across images, even under changes in scale/rotation/illumination
How do you compute keypoint descriptors?
- Select neighbourhood
- Divide into subregions
- Compute gradients
- Create histograms
- Combine histograms
- Normalize the descriptor
What is a prototype?
Predefined patterns or templates representing specific classes, often stored in raw or processed forms for comparison
What is prototype matching?
Comparing unknown patterns to stored prototypes to determine the class, similarity between unknown and known data determines classification
What are some methods for prototype-based matching?
Minimum Distance Classifier and Template Matching
Define Minimum Distance Classifier
Compares unknown patterns to the mean of each class, aligns the class with the smallest distance
Define template matching
Uses correlation to find the best match between an unknown pattern and stored templates
What are the steps for minimum distance classification?
- Mean calculation: Compute mean vector for each class using training data
- Distance measurement: Measure distance between unknown pattern and each class mean
- Class assignment: Assign unknown pattern to class with smallest distance
What are the steps to template matching?
- Start with a template
- Slide template across bigger image
- Compare at each position
- Find best match
What is a similarity score?
It is used in prototype matching and it determines how close a region of an image is to a predefined prototype
How is similarity score calculated?
It is calculated using a formula called correlation coefficient. What it does is:
- Pixel-by-pixel comparison to template
- Normalization (makes brightnesses between template and image closer)
- Output score is between -1 and 1 (1: Perfect, 0: No Match, -1: Perfect inverse match (opposite))
What is the limitation of the basic correlation formula? Is there a way to address this limitation?
Sensitive to intensity changes (i.e., if the image becomes brighter or darker, the correlation score will be affected). To address the limitation, use a normalized correlation formula (normalizes the correlation result to account for intensity variations in the template or the image)
How does SIFT matching work?
Matching involves comparing SIFT descriptors from a known image (prototype) with descriptors from an unknown image
SIFT descriptors are high-dimensional vectors, which means matching directly can be computationally expensive. What strategies can be implemented to improve performance?
Best-Bin-First Search: Quickly identifies potential matches by approximating the nearest neighbours using limited computations.
Clusters of Matches: To improve reliability, clusters of potential matches are identified using the generalized Hough transform, which groups matches that align well geometrically
What are the steps for SIFT feature matching?
- Keypoint detection: Identify distinctive points in both images
- Descriptor generation: Compute a 128-dimensional vector for each keypoint
- Feature matching: Compare descriptors from both images and find the best match for each keypoint
- Filter matches: Use techniques like Lowe’s Ratio Test and Clustering to improve accuracy
Describe Best-Bin-First Search (BBF Search)
Since comparing all features (brute force) is too slow, BBF Search focuses on the most likely matches first. This is done by:
- Organizing descriptors into bins (data structures like KD-trees)
- Searching in the best bin (most promising candidates)
- Stopping early if a good match is found
A good analogy is searching for a book by starting in the correct section instead of scanning the entire library
Describe Clusters of Matches (Generalized Hough Transform)
Since individual matches can be noisy or incorrect, the Generalized Hough Transform identifies clusters of consistent matches. This is done by:
- Grouping matches that agree on a geometric transformation (e.g., scaling, rotation)
- Discard outliers that don’t align with the cluster
A good analogy is solving a jigsaw puzzle by fitting groups of pieces together
What is a Neural Network (NN)?
A Neural Network (NN) is a computational system inspired by the human brain, designed to recognize patterns and solve problems
What is the basic structure of a neural network?
NN is composed of interconnected units called neurons organized in layers. Key components include an input layer, hidden layers, and an output layer
What is the difference between a biological and artificial neuron?
Biological Neurons:
- Process and transmit information in the brain
- Receive signals, integrate inputs, and send outputs
Artificial Neurons:
- Perform mathematical operations
- Use activation functions to decide outputs
What is the structure of an artificial neuron?
Inputs: Data features or signals
Weights: Influence the strength of each input
Bias: Adds flexibility to the decision boundary
Activation Function: Determines whether a neuron should “fire” (output)
Output: Result of processing inputs
Formula:
Activation(Input * Weight + Bias)
Describe weights in neural networks
Weights determine the importance of each input feature to the neuron’s output (larger = stronger influence). Weights are adjusted during training to minimize loss. Higher weights amplify corresponding inputs; lower weights diminish them. Fine-tuning weights enables the network to adapt to patterns in the data
Describe bias in neural networks
Bias is a trainable parameter that allows the model to shift the activation function. Bias enables the neuron to make decisions independent of weighted inputs (helps network fit data more flexibly)
Describe activation functions in neural networks
Activation functions introduce non-linearity to the network. They decide whether or not to ‘fire’ the neuron’s output
What are the most commonly used activation functions?
Sigmoid: Smooth gradient, used for binary classification
ReLU: Efficient and widely used for hidden layers
Tanh: Zero-centred, scales outputs between -1 and 1
Softmax: Converts outputs to probabilities
What is a Multi-Layer Perceptron (MLP)
A Multi-Layer Perceptron is a class of feed-forward neural networks consisting of multiple layers of neurons. MLPs can learn complex patterns by stacking layers. The architecture is structured in the following manner:
Input Layer: Receives the input features
Hidden Layers: Perform feature extraction through non-linear transformations
Output Layer: Provides predictions
What is the forward propagation process?
- Input features are passed through the network
- Each layer applies weights, biases, and activation functions
- Outputs are propagated to the next layer until the final output is produced
What is the difference between an objective function and a loss function?
Loss Function: Measures the error for a single data point or batch of data
Objective Function: The function to be minimized (or maximized) during training (often represents the aggregate loss over the entire dataset)
What is backpropagation?
- Process of using optimization algorithms that adjust weights and biases to minimize the loss
- Calculates gradients of the loss function with respect to weights
What is a gradient?
A gradient is a vector representing the direction and rate of a function’s steepest increase (or decrease). In neural networks, it typically refers to the partial derivatives of the loss function with respect to the model’s parameters (weights and biases). Think of it as a ‘guide’ or a ‘pointer’, a gradient just points to the best way to get to where you want to go (reduces errors in a neural network)
What are Convolutional Neural Networks (CNNs)?
Specialized neural networks are primarily used for image recognition and computer vision tasks. CNNs achieve state-of-the-art performance in many tasks (e.g., image classification, object detection)
What makes CNNs stand out from traditional machine learning?
Traditional machine learning methods require manual feature extraction. CNNs learn hierarchical feature representations directly from raw data (e.g., images). There is a reduced number of parameters compared to fully connected networks (MLP) (exploiting local connectivity and parameter sharing)
What is the architecture of a CNN?
- Convolution Layer
- Pooling Layer
- Fully Connected Layer (FC)
- Activation Functions
What is the convolution layer in a CNN?
The convolution layer performs filtering by sliding filters (kernels) over the input. It learns filters that activate when they see specific features
What is the pooling layer in a CNN?
The pooling layer reduces spatial dimensions (e.g., max pooling). Helps reduce computation and control overfitting.
What is the fully connected layer (FC) in a CNN?
The fully connected layer (FC) is the final layer for classification or regression.
Describe what a filter (kernel), stride, and padding is in a convolution operation
Filter (kernel): A small matrix applied over the input (e.g., 3×3 or 5×5).
Stride: The step size with which the filter moves across the input.
Padding: Zero-padding preserves spatial dimensions.
What is the formula for determining the output size (OS) in a convolution neural network?
OS = 1 + (W - K - 2P)/S
W: Input dimension
K: Kernel size
P: Padding
S: Stride
The main building block of a CNN is the __________ layer
Convolutional
Explain the process that occurs during a convolution operation
- Filter Sliding: Kernel moves across input data with certain stride value until it parses complete width, then moves down one row and starts at left again. This repeats until entire image is traversed
- Element-wise Multiplication & Summation: At each position, we multiply the overlapping input patch by the filter and sum the results
- Feature Map: The sum is stored in the feature map at the corresponding location
Describe the difference between grayscale image convolution and RGB (colour) image convolution
Grayscale Image Convolution:
- A grayscale image has only one channel (intensity values from 0 to 255).
- Image shape is denoted as (H×𝑊×1)
- Convolution filter shape: (𝑓×𝑓×1)
- The convolution operation applies a single 2D filter over the image.
- Produces a single feature map as output.
RGB (Color) Image Convolution:
- Each pixel has three separate intensity values (3 channels: Red, Green, Blue).
- Image shape is denoted as (H×𝑊×3)
- Convolution filter shape: (𝑓×𝑓×3) – one filter “slice” per channel, then summed into a single feature map.
- Element-wise multiplication is performed independently for each channel, and the results are summed across channels.
- Produces a single feature map per filter.
Why do we need multiple filters in a convolutional layer?
- A single filter captures only one type of feature (e.g., horizontal edges).
- A Convolutional Layer applies multiple filters to extract different features at the same time.
- More filters = richer feature representation.
Example:
- Filter 1: Detects vertical edges.
- Filter 2: Detects horizontal edges.
- Filter 3: Detects diagonal lines
What is depth in a convolutional layer?
- The number of filters in a convolutional layer determines its depth.
- If a layer has 64 filters, it produces 64 feature maps.
- The output of a convolutional layer has the shape:
HxWxD
where D = number of filters (depth)
How do CNNs learn filters?
- Filters are not manually set; they are learned during training.
- The CNN adjusts filter values using backpropagation.
- Each filter activates strongly when it detects a matching pattern.
Over multiple layers:
- Early layers: Detect edges & textures.
- Middle layers: Detect shapes & parts.
- Deeper layers: Detect high-level objects (faces, animals, etc.).
Why do we need pooling layers in CNNs?
- Feature maps generated by Convolutional Layers are large.
- Pooling reduces spatial size, keeping only the most important information.
- Helps prevent overfitting by forcing CNNs to generalize.
- Makes CNNs translation invariant (small shifts in the image don’t affect detection).
What are the different types of pooling?
Max Pooling (Most Common): Takes the maximum value from each sub-region.
Average Pooling (Less Common): Takes the average value from each region, retains the overall smoothness of feature maps.
Describe the difference between batch processing and single image processing?
Instead of processing one image at a time, CNNs process multiple images in parallel (batch). A batch size is the number of images processed together before updating weights. The characteristics of both options are listed below:
Batch:
- Updates weights after computing the gradient over a batch of images
- More stable gradients, efficient GPU use
- Requires more memory
Single-Image:
- Updates weights after every image
- Faster weight updates
-Unstable training, noisy updates
True or False: Batch processing adds another dimension to an image tensor
True, with batch processing, an additional Batch Size (B) dimension is added: (HxWxDxB)
What is Batch Normalization (BN)?
Neural networks suffer from internal covariate shift, where layer activations
change drastically, slowing training. Batch Normalization (BN) normalizes activations, reducing variance between
batches and improving stability.
How does Batch Normalization (BN) work?
- Computes the mean and variance for each batch.
- Normalizes activations by subtracting the mean and dividing by standard deviation.
- Applies learnable scale and shift parameters to maintain network flexibility.
What are the benefits of Batch Normalization (BN)?
- Faster convergence (reduces training time).
- More stable training (reduces sensitivity to learning rate).
- Reduces dependence on careful weight initialization.
- Acts as a mild regularizer (reduces overfitting).
What is regularization?
CNNs can overfit, memorizing training data instead of generalizing. Regularization techniques help improve model generalization.
What is dropout?
- During training, random neurons are deactivated with
probability, p - This forces the network to learn multiple representations, improving generalization.
How does dropout work?
- In each training step, some neurons are ignored.
- During testing, all neurons are active, but their activations are scaled by p (dropout probability)
What are the benefits of dropout?
- Prevents overfitting.
- Helps CNNs learn redundant features.
- Improves model robustness.
What are Autoencoders?
- Neural networks designed for unsupervised learning.
- Learn compact representations (encoding) of input data.
- Used to pre-train deep models when labelled data is scarce
Consists of two main parts:
- Encoder: Compresses input into a lower-dimensional representation.
- Decoder: Reconstructs the input from this compressed representation.
Why use deep autoencoders?
- Reduce dimensionality (feature compression).
- Learn meaningful latent representations of data.
- Useful for denoising, anomaly detection, and pretraining deep models.
What are the use cases for autoencoders?
- Image reconstruction
- Anomaly detection
- Data generation (using variational autoencoders (VAEs))
- Image Segmentation (U-Net)
What are Variational Autoencoders (VAEs)?
Learn probabilistic representations to generate pixel-wise segmentations. It works by the encoder network converting the input into two vectors:
- Mean (μ): Center of the latent space distribution.
- Variance (σ2): Spread of the distribution.
Then, instead of sampling directly from μ and σ, we generate z:
z = μ + σ⋅ϵ,
ϵ ∼ N(0,1)
Then, the decoder takes the sampled latent vector, z, and reconstructs the original input
Explain how an autoencoder works
- Train an autoencoder to learn unsupervised feature representations.
- Use the encoder’s output as input features for a classifier.
What are Generative Adversarial Networks (GANs)?
Generative Adversarial Networks (GANs) are a type of deep learning model used for generating new data that mimics a given dataset.
Consists of two competing neural networks:
- Generator (“Artist”): Creates fake data.
- Discriminator (“Critic”): Evaluates if data is real or fake
What is deconvolution?
Deconvolution (also called transposed convolution) is used to increase the spatial resolution of feature maps in CNNs. It helps reconstruct finer details lost during convolution. Often used in image segmentation and super-resolution tasks
How do deconvolution layers work?
- Works by spreading pixel values over a larger area.
- Deconvolution uses a learnable kernel like standard convolution but performs an inverse process.
- Unlike upsampling, deconvolution learns weights dynamically.
What are the components in a transposed convolution layer?
Stride: Spacing between output values (upsampling factor).
Kernel: Similar concept to the convolution kernel, but effectively “spread out.”
Padding & Output Shape: Calculations ensure the desired output height/width.
Learnable Parameters: Weights are learned just like in forward convolution.
Why would you use deconvolution in segmentation?
- Segmentation demands pixel-wise classification.
- Deep networks (like CNNs) typically reduce resolution to capture context.
- Need to “decode” feature maps back to full resolution (Transposed Conv.)
- To classify each pixel in the original image, we need to restore or approximate its original spatial resolution.
What is image segmentation?
The process of dividing an image into meaningful regions. Each pixel is assigned a label corresponding to an object/class
What are the different types of image segmentation?
Semantic: Labels every pixel with a class
Instance: Identifies and separates individual objects within an image
Panoptic: Combination of semantic + instance segmentation (Recognizes both object boundaries and individual instances)
What is U-Net?
Widely used model for biomedical image segmentation. Helps precisely segment small objects. U-Net concatenates feature maps from the encoder to the decoder preserving features from the earlier layers (skip connection). Consists of two parts:
- Contracting path (Downsampling via convolutional layers)
- Expanding path (Upsampling via deconvolution layers)
What is transfer learning?
Transfer Learning is a deep learning technique where a pre-trained model is adapted for a new task. Instead of training from scratch, we reuse knowledge from existing models trained on large datasets (e.g., ImageNet). This saves computational resources and improves performance on smaller datasets.
How does transfer learning work?
- Select a Pre-trained Model: Choose a model trained on a large dataset (e.g., VGG16, ResNet, EfficientNet)
- Feature Extraction or Fine-tuning:
- Feature Extraction: Freeze convolutional layers and use them to extract useful representations.
- Fine-tuning: Unfreeze some deeper layers and retrain them on the new dataset. - Train a New Classifier: Replace the final classification layer with a new one tailored to the target task.
What are some common image reconstruction techniques?
Denoising: Removes noise while preserving details
Inpainting: Fills in missing parts or damaged regions of an image
Super-Resolution: Enhances low-resolution images to high-resolution
What are some deep learning models that can be used for image reconstruction?
- CNNs
- Autoencoders
- GANs
What is image augmentation?
It increases dataset diversity by creating artificial/modified copies of existing data. Prevents overfitting and improves model robustness to variations
What are some image augmentation techniques?
Geometric Transformations: Rotation, flipping, cropping, scaling
Colour-Based Transformations: Brightness adjustment, contrast enhancement, colour jittering
Noise Addition: Gaussian noise, salt-and-pepper noise
Synthetic Data Generation: GANs and diffusion models for generating new samples
How do image augmentation and image reconstruction complement each other?
- Augmentation enhances training datasets to improve reconstruction models
- Reconstruction techniques can be used to clean augmented images
- Example: Super-resolution can be combined with augmentation for better data quality
- High-quality data + Diverse training = Robust models.