Post-Midterm Flashcards
Define discontinuity-based segmentation
A segmentation technique that identifies abrupt changes in intensity (where values jump significantly compared to their neighbours)
What is the goal of discontinuity-based segmentation
To separate objects from background by finding boundary pixels (edges/lines/points) that mark transitions between regions
In order to detect edges, an important mathematical foundation needed is _________________
The derivative
A positive derivative means the intensity is __________ as x increases. Vice versa for a negative derivative.
Increasing
(dark –> bright when moving to the right)
How does a first-order derivative (gradient) behave in an edge detection algorithm
Gradients highlight where large changes occur, but also indicate which side is darker or lighter (via sign)
How does a second-order derivative (Laplacian) behave in an edge detection algorithm
Laplacians reinforce fine details and help locate edges precisely. The sign of the second derivative often reveals whether an edge is going from dark-to-light or light-to-dark
What are the three different types of discontinuity
Points, lines, & edges
Describe a point discontinuity
Single pixel that differs sharply from neighbours (detected by the Laplacian)
Describe a line discontinuity
1-2 pixel wide structures differing in intensity from surroundings (detected by directional masks or second derivatives)
Describe an edge discontinuity
Transition zones (ideal: step edges; real: blurred or ramp edges)
What are the advantages of discontinuity segmentation
Can directly locate boundaries. It’s good for images where objects exhibit strong contrast against the background
What are the challenges of discontinuity segmentation
It can be sensitive to noise (derivatives amplify noise). Images often require smoothing (pre-filtering) and careful threshold selection to avoid false edges or fragmented edges
Describe what a ‘point’ is in the context of point detection
An isolated pixel whose intensity differs significantly from its immediate neighbours. It typically appears as a bright or dark ‘spot’ in a relatively uniform background
What are the steps involved to implement a point detection algorithm
Step 1: Apply second-order derivative (Laplacian) filter
Step 2: Take the absolute value of the response
Step 3: Threshold the absolute response
Step 4: Label isolated points
Describe how to accomplish the first step, ‘Apply Laplacian Filter’, when implementing a point detection algorithm
Convolve the kernel below with an image to obtain a filter response
Second-Order 3x3 Laplacian kernel:
0 1 0
1 -4 1
0 1 0
Describe how to accomplish the second step, ‘Take the absolute value of the response’, when implementing a point detection algorithm following the application of the Laplacian filter
After the first step, the Laplacian response can be positive or negative. By taking the absolute value of the response, we get a magnitude that indicates how large the change is, regardless of sign
Describe how to accomplish the third step, ‘Threshold the absolute response’, when implementing a point detection algorithm after taking the absolute value of the Laplacian response
Test or take a percentage of the maximum magnitude such that only prominent ‘spikes’ get labelled
Describe how to accomplish the fourth and final step, ‘Label isolated points’, when implementing a point detection algorithm after thresholding the absolute response
If the magnitude of Z(x,y) is greater than the threshold, declare (x,y) an isolated point. Store this point as a 1 (or ‘white’) in an output binary image, while others are stored as 0 (or ‘black’)
Describe a ‘line’ in the context of line detection
A set of connected pixels with similar intensity, often just 1-2 pixels in thickness, differing in intensity from its background
Briefly describe the process to apply line detection to an image
Convolve the image with a second-order derivative filter or with specialized directional filters, as shown below. After convolution, threshold the filter response to isolate line pixels.
Vertical kernel:
-1 2 -1
-1 2 -1
-1 2 -1
Horizontal kernel:
-1 -1 -1
2 2 2
-1 -1 -1
45-degree kernel:
2 -1 -1
-1 2 -1
-1 -1 2
Define ‘edge’ in the context of edge detection
A boundary between two distinct regions of intensity or texture
What are the different types of edges?
Step edge - Sudden transition in intensity (ideal)
Ramp edge - Gradual transition (common)
Roof edge - Transition to one intensity from another, then quickly back to the original (typical in thin lines or object ridges)
True or False: Image noise/blur do not cause step edges to turn into ramp edges
False
What is ‘clustering’ in clustering segmentation
The clustering approach in segmentation involves grouping pixels based on intensity, colour, or feature similarity without requiring labelled data
What is ‘K-Means’ clustering?
An approach to clustering segmentation that involves the following steps:
- Choose k (# of clusters)
- Initialize cluster centres (random or heuristics)
- Assign each pixel to nearest cluster centre (Euclidean distance in intensity space)
- Update cluster centres as mean of assigned pixels
- Iterate until convergence
The output of this process will be an image with each pixel labelled with a cluster index (segmented image with k segments)
What is ‘Fuzzy C-Means’ clustering?
An approach to clustering segmentation where each pixel has partial memberships to clusters (helps with ambiguous boundaries/smooth transitions) and is most useful when cluster boundaries are not sharp. This method builds clusters step-by-step (agglomerative) or splits them (divisive) and is more computationally intensive than K-Means clustering.
What is watershed segmentation
A gradient-based segmentation method that interprets the gradient magnitude as a topographic surface (low intensities = valleys; high intensities = ridges)
What are active contours (or snakes)
A gradient-based segmentation method that ‘evolves’ a curve such that it locks onto region boundaries. The curve evolves under internal smoothness constraints and external ‘image forces’ derived from the gradient. This means that the contours are minimized through a combination of internal energy (smoothness) and external energy (pulls curve toward edges)
What are some common metrics used to evaluate the effectiveness of a segmentation algorithm?
of correct pixels / # of total pixels
Pixel Accuracy:
qty. of correct pixels / qty. of total pixels
IoU (Intersection over Union):
intersection(A,B) / union(A,B)
Dice Coefficient:
2 x [ intersection(A,B) / ( |A| + |B| ) ]
What are common challenges involved with image segmentation?
- Noise sensitivity (derivatives amplify noise)
- Parameter tuning (thresholds, cluster sizes, kernel sizes)
- Complex scenes (overlapping objects, illumination changes, low contrast)
Define structuring element (SE)
A small, predefined shape (or set of pixels) used in morphological image processing. It acts like a ‘probe’ that scans over an image to analyze or modify its shapes
When processing an image using mathematical morphology, a _____________ is moved across the image to modify object shapes based on specific rules
Structuring element (SE)
What are the most common morphological operations
- Erosion (remove pixels from objects)
- Dilation (expands pixels from objects)
- Opening & Closing (combinations of erosion and dilation)
Define reflection and translation in the context of image morphology
Reflection: Flipping an SE by 180 degrees
Translation: Moving an SE across an image
Define the erosion morphological operation
This operation shrinks foreground objects in an image. This is done by translating an SE over all possible positions in an image, and marking the origin of the SE as a foreground pixel (1) if the SE fits entirely inside the image at each position. Otherwise, all other pixels are marked as a background pixel (0)
What are the effects of the erosion morphological operation?
- Thins/Shrinks objects
- Removes small noise
- Separates connected components in an image
- Shapes objects based on structuring element (e.g. elongated SE can reduce objects to a line)
Define the dilation morphological operation
This operation expands/thickens foreground objects in an image. This is done by translating an SE over all possible positions in an image, and marking the origin of the SE as a foreground pixel (1) if the SE overlaps at least one foreground pixel at each position in the image. Otherwise, all other pixels are marked as a background pixel (0)
What are the effects of the dilation morphological operation?
- Grows objects
- Small gaps/holes filled
- Shapes expand based on SE size and shape
Describe the concept of ‘morphological duality’ in the context of erosion and dilation
Duality describes the relationship between dilation and erosion through complementation (one operation can be derived from the other by working with the image’s background instead of the foreground)
What is the erosion-dilation duality
The complement of the erosion of A by B is equal to the dilation of the complement of A using the reflected structuring element, Br
What is the dilation-erosion duality
The complement of the dilation of A by B is equal to the erosion of the complement of A using the reflected structuring element, Br
Define the opening morphological operation
This operation removes small objects or noise while preserving the general shape of larger objects. This is done by first eroding A by B (shrinking it), then dilating it back using B, partially restoring the main structure.
Define the closing morphological operation
This operation fills small gaps or holes, smooths object contours, and fuses narrow breaks. This is done by first dilating A by B (expanding it), then eroding it back using B, smoothing object boundaries and filling in small gaps.
Define ‘Hit-Or-Miss’ (HMT) Transform
A morphological tool used for shape detection in binary images. It relies on two SEs rather than one (one for foreground, other for background)
Why would one use a hit-or-miss transform
HMT allows detection of small features (e.g. corners, endpoints) precisely by combining erosion with a specially designed pair of SEs
Define boundary extraction
A basic morphological algorithm that isolates the edges of a foreground object using erosion and set difference
What is the process involved in a boundary extraction algorithm?
Erode the object within the image using a structuring element. Then, subtract the eroded image from the original image to leave only the boundary pixels
Define hole filling algorithm
A basic morphological algorithm that fills background regions enclosed by a connected foreground boundary using dilation, complementation, and intersection
What is the process involved in a hole filling algorithm?
Create an array of 0’s the same size as the image and set 1’s at known hole locations. Then, apply dilation to to the newly created array using a symmetric SE. Intersect the result with the complement of the image to limit expansion inside the hole, and repeat this process until no further changes occur. The union of this array and the original image fills all holes while preserving object boundaries
Define connected component extraction
A basic morphological algorithm that identifies and isolates groups of connected foreground pixels in a binary image
What is the process involved in a connected component extraction algorithm?
Create an array of 0’s the same size as the image and set 1’s at known points within each connected component. Apply dilation to this array using a symmetric SE. Intersect this array with the original image to restrict growth, and repeat this process until no further changes occur. The newly created array contains all connected components of the original image
Define convex hull
The convex hull of a set A is the smallest convex set that fully contains A.
This basically means that many of the same type of morphological operation is performed on an image, but each with different SEs. The intersection of all these outputs is the convex hull
Define thinning
A morphological operation that reduces a binary object to a skeleton-like shape while preserving its connectivity. It is defined using HMT and an iterative process involving SEs (see lecture slides for SE structures)
Define thickening
A morphological operation that expands the foreground structure in a controlled way. It is the ‘morphological dual’ of thinning
Define skeleton
A thin, central representation of a set, preserving its topology and shape while reducing redundancy
Define morphological reconstruction
A powerful transformation that uses two images (marker and mask) and an SE to extract or restore objects
What is a marker image and what is a mask image?
A marker image defines starting points for morphological reconstruction. A mask image restricts growth (conditions the reconstruction)
Define geodesic dilation
A morphological operation that expands the marker image while limiting growth by using a mask
Define geodesic erosion
A morphological operation that shrinks the marker image while staying greater than or equal to a mask
What is ‘Reconstruction by Dilation’
A type of morphological reconstruction that utilizes geodesic dilation iterated until stability
What is ‘Reconstruction by Erosion’
A type of morphological reconstruction that utilizes geodesic erosion iterated until stability
Describe how dilation would work in a grayscale image rather than a binary image
The origin pixel (centre of SE) is replaced with the maximum value in the SE neighbourhood. This expands bright regions and enhances peaks
Describe how erosion would work in a grayscale image rather than a binary image
The origin pixel (centre of SE) is replaced with the minimum value in the SE neighbourhood. This shrinks bright regions and enhances valleys
Define Large Language Models (LLMs)
Neural networks trained on vast amounts of text data that can predict/generate human-like text, often using the transformer architecture
What are the three different variations of transformer architecture
- Encoder-Decoder Structure
- Encoder-Only Structure
- Decoder-Only Structure
Describe an encoder-decoder transformer architecture
The encoder processes input (e.g. text tokens) and produces hidden representations, decoder generates output (e.g. translated text or next token) by attending to encoder outputs and previously generated tokens. It’s best applied to text summarization or question answering applications
Describe an encoder-only transformer architecture
This architecture focuses only on understanding or embedding text and not typically used for text generation. It’s best applied to sentiment analysis and text classification applications
Describe a decoder-only transformer architecture
This architecture generates text by using past context in a single transformer block without an encoder (ChatGPT does this). It operates in an autoregressive manner (generates one token at a time while using previous outputs as context). It’s best applied to story/article generation, chatbots, and code generation applications
Define scaled dot-product attention
A core component of Transformer models that calculates ‘attention’ weights by taking the dot product of query and key vectors, scaling the result, and applying a softmax function to obtain normalized weights, which are then used to weight the value vectors.
What are the three inputs that scaled dot-product attention takes
Queries (Q), Keys (K), and Values (V)
Describe the Key (K) input to scaled dot-product attention
An input that represents the “memory” or reference that other tokens compare themselves to. Basically, keys define what each token has to offer
Describe the Query (Q) input to scaled dot-product attention
An input that indicates what the current token is looking for in the other tokens. We compare this query against all the keys to figure out how relevant each key is to our query
Describe the Value (V) input to scaled dot-product attention
Holds the actual content that can be retrieved if the token is deemed relevant (based on the query-key comparison). Once we determine how much attention (weight) to assign to each token (via the query-key matching), we use the corresponding values to form the output representation
Define multi-head attention
A key mechanism of transformers that allows the model to look at different parts of the sequence. It splits the hidden representations into multiple “heads” to attend to different positions or features in parallel. This allows the model to capture more nuanced relationships within text
Define vision transformers (ViT)
The same as a LLM transformer, but images are treated as tokens instead of words. Images are divided into fixed-size patches, each acting as a ‘token’
Define Contrastive Language-Image Pre-training (CLIP)
A model developed by OpenAI that learns to connect images and text by training on a massive dataset of image-text pairs. It creates a shared latent space where visual and textual concepts are aligned
What are the architecture components of CLIP
Image encoder (converts input image into feature vector) and text encoder (converts text description into a corresponding feature vector)
What are the two components that allow the CLIP training mechanism to function
Contrastive loss and joint representation space
Describe contrastive loss
A component in the CLIP training mechanism where the training objective is to bring the representations of matching image-text pairs closer while pushing apart non-matching pairs. This is achieved using a loss function that considers all pairs in a batch
Describe joint representation space
A component in the CLIP training mechanism where both encoders are optimized so that semantically related images and texts have similar representations
Define diffusion models
A class of generative models that create images by iteratively denoising random noise
Fill in the Blank: The diffusion process has two components, _________ and ________
Forward, reverse
Describe the forward process in diffusion
The forward process is the actual ‘diffusion’ portion. Gaussian noise is added gradually over many time steps such that the image slowly degrades until it becomes nearly pure noise. The purpose of this process is to define a known probabilistic path from data to noise, which the model learns to reverse
Describe the backward process in diffusion
The backward process is the ‘denoising’ portion. Starting from pure noise, the model performs a series of denoising steps by predicting and subtracting the noise component. This progressively refines the image and, once complete, the model reconstructs a high-quality image
What is the data space of traditional diffusion models
Traditional diffusion models operate directly on pixel space
Describe the architecture of traditional diffusion models
Consists of two components: Forward (diffusion) and Reverse (denoising) Process
True or False: Traditional diffusion models have low computational cost due to their data space
False, computational cost is high due to the high dimensionality of pixel space
True or False: Traditional diffusion models require many denoising steps for high-quality image reconstruction
True
What is the data space of latent diffusion models
The data space of latent diffusion models is a compressed, lower-dimensional latent space rather than directly on pixels
Describe the architecture of latent diffusion models
Consists of an encoder, decoder, and a diffusion process. The encoder is a pre-trained autoencoder or VAE that converts images into a latent representation. The decoder takes the latent representation and transforms it back to a high resolution image. Among this process, a U-Net or similar network applies the diffusion process within the latent space (prior to decoding)
True or False: Latent diffusion is more efficient in terms of computation and memory than traditional diffusion
True
What is the data space of stable diffusion models
The data space of a stable diffusion model incorporates a text encoder (e.g., from CLIP) to guide the image generation process based on textual prompts
Describe the architecture of stable diffusion models
Consists of an autoencoder, U-Net diffusion model, and cross-attention. The autoencoder encodes images into latent space and decodes them back, as in latent diffusion. The U-Net diffusion model applies the diffusion process in latent space, but enhanced with cross-attention layers to integrate text embeddings. Cross-attention merges the text conditioning with the image latent features to steer the generation towards the desired content
True or False: Latent diffusion models can produce high-quality, detailed images guided by natural language inputs
False, a stable diffusion model does that
When integrating vision and language models, what are the three fusion strategies for accomplishing this?
Early Fusion: Combining modalities early in the network (less common due to different data structures)
Late Fusion: Independent processing followed by alignment in a joint embedding space
Cross-Attention: Integrates features from both modalities during processing for deeper interaction