Clustering Flashcards

Question 1

Q

What is unsupervised learning?

Answer

A

Learning patterns in data without labeled outputs or a “teacher”.

Question 2

Q

What is the goal of clustering?

Answer

A

To partition data into groups or clusters based on similarity.

Question 3

Q

What does the K-means algorithm minimize?

Answer

A

The within-cluster point scatter/variance.

Question 4

Q

What are the two main steps of the K-means algorithm?

Answer

A

1) Assign points to nearest cluster center, 2) Update cluster centers. Until convergence

Question 5

Q

What is the K-means++ algorithm used for?

Answer

A

To initialize the cluster centers for K-means in a way that improves convergence (by spreading out the initial cluster centers)`

Question 6

Q

How can the number of clusters K be determined in non-probabilistic models?

Answer

A

By computing the Mean Square Error (MSE) for different values of K and using Elbow Method or picking K value where SSE has a change of slope.

Question 7

Q

What is a mixture model?

Answer

A

A probabilistic model that represents the presence of subpopulations within an overall population.

Question 8

Q

What are the two steps in generating samples from a Gaussian mixture model?

Answer

A

1) Draw a categorical variable Z to select a component, 2) Draw an observation from the selected Gaussian component.

Question 9

Q

What is the EM algorithm used for in mixture models?

Answer

A

To estimate the parameters of the mixture model by maximizing the likelihood.

Question 10

Q

What are the two main steps of the EM algorithm?

Answer

A

The Expectation (E) step and the Maximization (M) step (Repeat until convergence as in stable assignments and parameters)

Question 11

Q

In the context of Gaussian Mixture Models, what does the E-step compute?

Answer

A

the E-step computes the expected value of the latent variables, specifically the posterior probabilities (responsibilities) that each data point belongs to each Gaussian component.

Question 12

Q

In the context of Gaussian Mixture Models, what does the M-step compute?

Answer

A

the M-step updates the parameters of the model (means, covariances, and mixing coefficients) to maximize the expected log-likelihood found in the E-step. This involves re-estimating the parameters to better fit the data based on the current responsibilities.

Question 13

Q

How does K-means compare to Gaussian Mixture Models?

Answer

A

K-means is usually faster due to fewer iterations and less computation. K-Means assumes spherical clusters with equal variance while GMM can have clusters with different shapes and sizes. GMM also have soft-assignment (probability of belonging to each cluster) while K-Means have hard assignments.

Question 14

Q

What is an advantage of mixture models over K-means?

Answer

A

Mixture models allow for distributional assumptions and can assess the fit of the data by computing likelihood.

Question 15

Q

What is an issue with mixture models in terms of identifiability?

Answer

A

The likelihood is invariant to permutation of class memberships, making the estimators valid only up to permutation.

Question 16

Q

What is hierarchical agglomerative clustering?

Answer

Study These Flashcards

A

A clustering method that starts with singleton clusters and merges the most similar clusters iteratively.

Question 17

Q

What is the purpose of the Bayes classifier in mixture models?

Answer

Study These Flashcards

A

To obtain class posterior probabilities when parameters are known: P(y|x, Θ) ∝ fΘ(x|y)πy

Question 18

Q

What is the main challenge in maximizing the likelihood for mixture models

Answer

Study These Flashcards

A

The lack of a closed-form solution, requiring an iterative procedure like EM.

Question 19

Q

What is the relationship between K-means and Gaussian Mixture Models?

Answer

Study These Flashcards

A

K-means is equivalent to a special case of GMM where all clusters have the same diagonal covariance matrix as σ approaches 0.

Question 20

Q

What is a disadvantage of mixture models compared to K-means?

Answer

Study These Flashcards

A

Mixture models require explicit distributional assumptions.

Question 21

Q

How does the EM algorithm’s performance depend on initialization?

Answer

Study These Flashcards

A

EM’s performance can vary significantly based on parameter initialization due to multiple local maxima.

Question 22

Q

How can missing values be handled in mixture models?

Answer

Study These Flashcards

A

Mixture models can naturally infer missing values as part of the model fitting process.

Clustering Flashcards

(22 cards)