Post-Midterm Flashcards

Question

What is 'K-Means' clustering?

Answer 1

An approach to clustering segmentation that involves the following steps: 1. Choose k (# of clusters) 2. Initialize cluster centres (random or heuristics) 3. Assign each pixel to nearest cluster centre (Euclidean distance in intensity space) 4. Update cluster centres as mean of assigned pixels 5. Iterate until convergence The output of this process will be an image with each pixel labelled with a cluster index (segmented image with k segments)

Answer 2

An approach to clustering segmentation where each pixel has partial memberships to clusters (helps with ambiguous boundaries/smooth transitions) and is most useful when cluster boundaries are not sharp. This method builds clusters step-by-step (agglomerative) or splits them (divisive) and is more computationally intensive than K-Means clustering.

Answer 3

A gradient-based segmentation method that interprets the gradient magnitude as a topographic surface (low intensities = valleys; high intensities = ridges)

Answer 4

A gradient-based segmentation method that 'evolves' a curve such that it locks onto region boundaries. The curve evolves under internal smoothness constraints and external 'image forces' derived from the gradient. This means that the contours are minimized through a combination of internal energy (smoothness) and external energy (pulls curve toward edges)

Answer 5

# of correct pixels / # of total pixels Pixel Accuracy: qty. of correct pixels / qty. of total pixels IoU (Intersection over Union): intersection(A,B) / union(A,B) Dice Coefficient: 2 x [ intersection(A,B) / ( |A| + |B| ) ]

Answer 6

- Noise sensitivity (derivatives amplify noise) - Parameter tuning (thresholds, cluster sizes, kernel sizes) - Complex scenes (overlapping objects, illumination changes, low contrast)

Answer 7

A small, predefined shape (or set of pixels) used in morphological image processing. It acts like a 'probe' that scans over an image to analyze or modify its shapes

Answer 8

Structuring element (SE)

Answer 9

- Erosion (remove pixels from objects) - Dilation (expands pixels from objects) - Opening & Closing (combinations of erosion and dilation)

Answer 10

Reflection: Flipping an SE by 180 degrees Translation: Moving an SE across an image

Answer 11

This operation shrinks foreground objects in an image. This is done by translating an SE over all possible positions in an image, and marking the origin of the SE as a foreground pixel (1) if the SE fits entirely inside the image at each position. Otherwise, all other pixels are marked as a background pixel (0)

Answer 12

- Thins/Shrinks objects - Removes small noise - Separates connected components in an image - Shapes objects based on structuring element (e.g. elongated SE can reduce objects to a line)

Answer 13

This operation expands/thickens foreground objects in an image. This is done by translating an SE over all possible positions in an image, and marking the origin of the SE as a foreground pixel (1) if the SE overlaps at least one foreground pixel at each position in the image. Otherwise, all other pixels are marked as a background pixel (0)

Answer 14

- Grows objects - Small gaps/holes filled - Shapes expand based on SE size and shape

Answer 15

Duality describes the relationship between dilation and erosion through complementation (one operation can be derived from the other by working with the image's background instead of the foreground)

Answer 16

The complement of the erosion of A by B is equal to the dilation of the complement of A using the reflected structuring element, Br

Answer 17

The complement of the dilation of A by B is equal to the erosion of the complement of A using the reflected structuring element, Br

Answer 18

This operation removes small objects or noise while preserving the general shape of larger objects. This is done by first eroding A by B (shrinking it), then dilating it back using B, partially restoring the main structure.

Answer 19

This operation fills small gaps or holes, smooths object contours, and fuses narrow breaks. This is done by first dilating A by B (expanding it), then eroding it back using B, smoothing object boundaries and filling in small gaps.

Answer 20

A morphological tool used for shape detection in binary images. It relies on two SEs rather than one (one for foreground, other for background)

Answer 21

HMT allows detection of small features (e.g. corners, endpoints) precisely by combining erosion with a specially designed pair of SEs

Answer 22

A basic morphological algorithm that isolates the edges of a foreground object using erosion and set difference

Answer 23

Erode the object within the image using a structuring element. Then, subtract the eroded image from the original image to leave only the boundary pixels

Answer 24

A basic morphological algorithm that fills background regions enclosed by a connected foreground boundary using dilation, complementation, and intersection

Answer 25

Create an array of 0's the same size as the image and set 1's at known hole locations. Then, apply dilation to to the newly created array using a symmetric SE. Intersect the result with the complement of the image to limit expansion inside the hole, and repeat this process until no further changes occur. The union of this array and the original image fills all holes while preserving object boundaries

Answer 26

A basic morphological algorithm that identifies and isolates groups of connected foreground pixels in a binary image

Answer 27

Create an array of 0's the same size as the image and set 1's at known points within each connected component. Apply dilation to this array using a symmetric SE. Intersect this array with the original image to restrict growth, and repeat this process until no further changes occur. The newly created array contains all connected components of the original image

Answer 28

The convex hull of a set A is the smallest convex set that fully contains A. This basically means that many of the same type of morphological operation is performed on an image, but each with different SEs. The intersection of all these outputs is the convex hull

Answer 29

A morphological operation that reduces a binary object to a skeleton-like shape while preserving its connectivity. It is defined using HMT and an iterative process involving SEs (see lecture slides for SE structures)

Answer 30

A morphological operation that expands the foreground structure in a controlled way. It is the 'morphological dual' of thinning

Answer 31

A thin, central representation of a set, preserving its topology and shape while reducing redundancy

Answer 32

A powerful transformation that uses two images (marker and mask) and an SE to extract or restore objects

Answer 33

A marker image defines starting points for morphological reconstruction. A mask image restricts growth (conditions the reconstruction)

Answer 34

A morphological operation that expands the marker image while limiting growth by using a mask

Answer 35

A morphological operation that shrinks the marker image while staying greater than or equal to a mask

Answer 36

A type of morphological reconstruction that utilizes geodesic dilation iterated until stability

Answer 37

A type of morphological reconstruction that utilizes geodesic erosion iterated until stability

Answer 38

The origin pixel (centre of SE) is replaced with the maximum value in the SE neighbourhood. This expands bright regions and enhances peaks

Answer 39

The origin pixel (centre of SE) is replaced with the minimum value in the SE neighbourhood. This shrinks bright regions and enhances valleys

Answer 40

Neural networks trained on vast amounts of text data that can predict/generate human-like text, often using the transformer architecture

Answer 41

- Encoder-Decoder Structure - Encoder-Only Structure - Decoder-Only Structure

Answer 42

The encoder processes input (e.g. text tokens) and produces hidden representations, decoder generates output (e.g. translated text or next token) by attending to encoder outputs and previously generated tokens. It's best applied to text summarization or question answering applications

Answer 43

This architecture focuses only on understanding or embedding text and not typically used for text generation. It's best applied to sentiment analysis and text classification applications

Answer 44

This architecture generates text by using past context in a single transformer block without an encoder (ChatGPT does this). It operates in an autoregressive manner (generates one token at a time while using previous outputs as context). It's best applied to story/article generation, chatbots, and code generation applications

Answer 45

A core component of Transformer models that calculates 'attention' weights by taking the dot product of query and key vectors, scaling the result, and applying a softmax function to obtain normalized weights, which are then used to weight the value vectors.

Answer 46

Queries (Q), Keys (K), and Values (V)

Answer 47

An input that represents the “memory” or reference that other tokens compare themselves to. Basically, keys define what each token has to offer

Answer 48

An input that indicates what the current token is looking for in the other tokens. We compare this query against all the keys to figure out how relevant each key is to our query

Answer 49

Holds the actual content that can be retrieved if the token is deemed relevant (based on the query-key comparison). Once we determine how much attention (weight) to assign to each token (via the query-key matching), we use the corresponding values to form the output representation

Answer 50

A key mechanism of transformers that allows the model to look at different parts of the sequence. It splits the hidden representations into multiple “heads” to attend to different positions or features in parallel. This allows the model to capture more nuanced relationships within text

Answer 51

The same as a LLM transformer, but images are treated as tokens instead of words. Images are divided into fixed-size patches, each acting as a 'token'

Answer 52

A model developed by OpenAI that learns to connect images and text by training on a massive dataset of image-text pairs. It creates a shared latent space where visual and textual concepts are aligned

Answer 53

Image encoder (converts input image into feature vector) and text encoder (converts text description into a corresponding feature vector)

Answer 54

Contrastive loss and joint representation space

Answer 55

A component in the CLIP training mechanism where the training objective is to bring the representations of matching image-text pairs closer while pushing apart non-matching pairs. This is achieved using a loss function that considers all pairs in a batch

Answer 56

A component in the CLIP training mechanism where both encoders are optimized so that semantically related images and texts have similar representations

Answer 57

A class of generative models that create images by iteratively denoising random noise

Answer 58

Forward, reverse

Answer 59

The forward process is the actual 'diffusion' portion. Gaussian noise is added gradually over many time steps such that the image slowly degrades until it becomes nearly pure noise. The purpose of this process is to define a known probabilistic path from data to noise, which the model learns to reverse

Answer 60

The backward process is the 'denoising' portion. Starting from pure noise, the model performs a series of denoising steps by predicting and subtracting the noise component. This progressively refines the image and, once complete, the model reconstructs a high-quality image

Answer 61

Traditional diffusion models operate directly on pixel space

Answer 62

Consists of two components: Forward (diffusion) and Reverse (denoising) Process

Answer 63

False, computational cost is high due to the high dimensionality of pixel space

Answer 64

The data space of latent diffusion models is a compressed, lower-dimensional latent space rather than directly on pixels

Answer 65

Consists of an encoder, decoder, and a diffusion process. The encoder is a pre-trained autoencoder or VAE that converts images into a latent representation. The decoder takes the latent representation and transforms it back to a high resolution image. Among this process, a U-Net or similar network applies the diffusion process within the latent space (prior to decoding)

Answer 66

The data space of a stable diffusion model incorporates a text encoder (e.g., from CLIP) to guide the image generation process based on textual prompts

Answer 67

Consists of an autoencoder, U-Net diffusion model, and cross-attention. The autoencoder encodes images into latent space and decodes them back, as in latent diffusion. The U-Net diffusion model applies the diffusion process in latent space, but enhanced with cross-attention layers to integrate text embeddings. Cross-attention merges the text conditioning with the image latent features to steer the generation towards the desired content

Answer 68

False, a stable diffusion model does that

Answer 69

Early Fusion: Combining modalities early in the network (less common due to different data structures) Late Fusion: Independent processing followed by alignment in a joint embedding space Cross-Attention: Integrates features from both modalities during processing for deeper interaction

Post-Midterm Flashcards

(96 cards)