Gaussian Mixture Models Flashcards

1
Q
  1. What is the K-means objective function and what constraints does it have?
A

The K-means objective minimizes the total within-cluster variance, formulated as: minimize R(µ, Z; X) = Σₙ=1^N Σₖ=1^K znk ||xn − µk||², subject to znk ∈ {0,1} for each n and Σₖ=1^K znk = 1 for every data point xn.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. How is the hard clustering constraint in K-means relaxed to allow soft assignments?
A

Instead of requiring znk to be 0 or 1 with Σₖ znk = 1, we relax the constraint so that znk ∈ [0,1] and Σₖ znk = 1. This allows each data point to be partially assigned to different clusters, leading to a weighted (soft) clustering and the centroids become weighted means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. How do mixture models extend the idea of clustering beyond K-means?
A

Mixture models assume that the data is generated from a combination of K probability distributions. The overall density is modeled as p(x) = Σₖ=1^K πk p(x|θk), where each component p(x|θk) (often a Gaussian) has its own parameters θk and mixing coefficient πk, representing the probability of a data point coming from that component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. What is the definition of a one-dimensional Gaussian distribution and its key statistics?
A

A one-dimensional Gaussian is defined as p(x|µ,σ) = (1/√(2πσ²)) exp(− (x−µ)²/(2σ²)). Its mean is µ and its variance is σ².

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. How is a D-dimensional Gaussian distribution defined?
A

For a D-dimensional vector x, the Gaussian is defined as p(x|µ,Σ) = 1/((2π)^(D/2)|Σ|^(1/2)) exp(−½ (x−µ)ᵀ Σ⁻¹ (x−µ)), where µ is the mean vector and Σ is the covariance matrix with determinant |Σ|.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. How is a mixture of K probability densities formulated in a Gaussian Mixture Model (GMM)?
A

A GMM models the data as p(x) = Σₖ=1^K πk N(x|µk,Σk), where each component is a Gaussian with parameters µk and Σk, and πk are the mixing coefficients satisfying Σₖ πk = 1 and πk ≥ 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. What are the roles of the mixing coefficients πk in a GMM?
A

The mixing coefficients πk represent the prior probability that a data point comes from the k-th component. They must satisfy πk ≥ 0 for all k and Σₖ=1^K πk = 1, thereby defining a categorical distribution over the clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. Describe the generative process assumed by a GMM.
A

For each data point xn, a component index k is first sampled from the categorical distribution defined by πk. Then, xn is generated by sampling from the Gaussian distribution N(xn|µk, Σk) corresponding to that component.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. How is the full data likelihood for a GMM defined given a dataset X = {x1, …, xN}?
A

Assuming the data points are independent, the likelihood is p(X|π,µ,Σ) = ∏ₙ=1^N p(xn) = ∏ₙ=1^N Σₖ=1^K πk N(xn|µk, Σk).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. Why is it difficult to directly maximize the likelihood for a GMM using the log-likelihood formulation?
A

Taking the logarithm of the likelihood yields ln p(X|π,µ,Σ) = Σₙ=1^N ln (Σₖ=1^K πk N(xn|µk,Σk)), where the summation inside the logarithm prevents a closed-form analytic solution for the maximum likelihood estimates.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. What algorithm is employed to solve the maximum likelihood estimation problem for GMMs, and what is the key idea behind it?
A

The Expectation-Maximization (EM) algorithm is used. The key idea is to introduce latent (hidden) variables that indicate which component generated each data point, allowing us to alternate between estimating these assignments (E-step) and updating the model parameters (M-step).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. How does the introduction of a latent variable z help in the EM algorithm for GMMs?
A

By defining a latent variable z for each data point with a one-hot (1-of-K) representation (where exactly one element is 1), the complete-data likelihood becomes easier to optimize. If we knew z, we could easily estimate the parameters. The EM algorithm iteratively estimates the posterior probabilities (responsibilities) for z and then updates the parameters accordingly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. What is the definition of the responsibility γ(znk) in the context of the EM algorithm for GMMs?
A

The responsibility γ(znk) is defined as the posterior probability that data point xn was generated by component k, i.e., γ(znk) = p(znk = 1 | xn) = [πk N(xn|µk, Σk)] / [Σⱼ=1^K πj N(xn|µj, Σj)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. What happens in the E-step of the EM algorithm for GMMs?
A

In the E-step, using the current estimates of the parameters (πk, µk, Σk), we compute the responsibilities γ(znk) for each data point xn and each component k via Bayes’ rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. How is the update for the mean µk computed in the M-step of the EM algorithm?
A

The new estimate for the mean µk is computed as a weighted average: µk = (Σₙ=1^N γ(znk) xn) / (Σₙ=1^N γ(znk)), where Nk = Σₙ γ(znk) is the effective number of points assigned to cluster k.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. How are the covariance matrices Σk updated in the M-step?
A

The covariance for component k is updated as: Σk = (Σₙ=1^N γ(znk) (xn − µk)(xn − µk)ᵀ) / Nk, reflecting the weighted scatter of the data points around the updated mean.

17
Q
  1. How are the mixing coefficients πk updated in the M-step?
A

The mixing coefficient πk is updated by πk = Nk / N, where Nk = Σₙ=1^N γ(znk), ensuring that the mixing coefficients sum to 1.

18
Q
  1. Outline the overall steps of the EM algorithm for GMMs.
A

1) Initialize the parameters (πk, µk, Σk) randomly or using another method like K-means; 2) E-step: Compute responsibilities γ(znk) for each data point; 3) M-step: Update the parameters using the computed responsibilities; 4) Evaluate the log-likelihood; 5) Repeat steps 2–4 until convergence.

19
Q
  1. How does the EM algorithm for GMMs relate to K-means clustering?
A

K-means can be seen as a hard-assignment version of EM where responsibilities are either 0 or 1. K-means only updates cluster means (centroids) without estimating covariances or mixing coefficients, whereas EM computes soft assignments and estimates full Gaussian parameters.

20
Q
  1. What is the trade-off in model order selection for mixture models?
A

The trade-off is between data fit and model complexity. A higher number of clusters K generally improves the likelihood (better fit) but increases the number of parameters (more complexity). The goal is to select a model that balances these aspects.

21
Q
  1. How is the negative log-likelihood used in model selection for GMMs?
A

The negative log-likelihood (−ln p(X|π,µ,Σ)) quantifies the lack of fit. A lower value indicates a better fit to the data. However, increasing K typically decreases this value, so we must penalize complexity to avoid overfitting.

22
Q
  1. What is the Akaike Information Criterion (AIC) and how is it defined for model selection in GMMs?
A

AIC is defined as AIC_K = −ln p(X|.) + cK, where cK is the number of parameters in the model. It penalizes models with more parameters to help balance data fit with complexity.

23
Q
  1. How does the Bayesian Information Criterion (BIC) differ from AIC in penalizing model complexity?
A

BIC is defined as BIC_K = −ln p(X|.) + ½ cK ln N, where N is the number of data points. BIC penalizes complexity more strongly than AIC, especially for larger datasets, leading to more parsimonious models.

24
Q
  1. How do you compute the number of parameters cK for a Gaussian Mixture Model with full covariance matrices?
A

For a GMM, cK = K·D + (K−1) + K·D·(D+1)/2, where K is the number of clusters, D is the dimensionality of the data (K·D for the means, K−1 for the mixing coefficients, and K·D(D+1)/2 for the covariance matrices).

25
Q
  1. How does the number of parameters change if the covariance matrices are assumed to be known in advance?
A

If the covariance matrices are fixed and known, then the number of parameters reduces to cK = K·D + (K−1), since only the means and mixing coefficients need to be estimated.

26
Q
  1. What do AIC and BIC help determine in the context of GMMs?
A

AIC and BIC provide criteria for model order selection by balancing the likelihood (data fit) with model complexity. The optimal number of clusters K is typically chosen as the one that minimizes AIC or BIC.

27
Q
  1. How can K-means be used in relation to the EM algorithm for GMMs?
A

K-means can be used to provide a good initialization for the EM algorithm. The cluster centers from K-means can initialize the means µk, the sample covariances of the clusters can initialize Σk, and the fractions of data points in each cluster can initialize the mixing coefficients πk.

28
Q
  1. Describe a scenario where a GMM might encounter an issue if one component mean equals a data point, and the covariance approaches zero. (Exercise Part 1)
A

If for a given component j, µj equals a data point xn and Σj → 0, then the probability density p(xn|µj,Σj) becomes extremely high (tending to infinity), which can dominate the likelihood and destabilize parameter estimation.

29
Q
  1. In the scenario described in Flashcard 28, how do you write down the log-likelihood for that data point? (Exercise Part 2)
A

The log-likelihood for data point xn is ln p(xn|π,µ,Σ) = ln (Σₖ πk N(xn|µk,Σk)). When µj = xn and Σj is very small, the term πj N(xn|µj,Σj) will be very large, affecting the overall log-likelihood.

30
Q
  1. Continuing the exercise, what happens to p(xn|µj,Σj) in the limit as σk → 0, and what is its impact on the likelihood maximization? (Exercise Part 3)
A

As σk → 0, N(xn|µj,Σj) (with µj = xn) tends to infinity because the Gaussian becomes infinitely peaked. This causes the likelihood to be dominated by this single point, potentially leading to overfitting where the model ‘explains’ that point perfectly while harming the general fit.

31
Q
  1. Can the situation described in Flashcard 28–30 occur when there is only a single Gaussian distribution (K=1)? (Exercise Part 4)
A

No, when K=1, every data point is modeled by the same Gaussian. There is no competition between components, so even if one data point is exactly at the mean, the covariance is estimated over all data and cannot collapse to zero solely due to one point.

32
Q
  1. Propose a heuristic to avoid the issue of a covariance collapsing to zero in a GMM. (Exercise Part 5)
A

A common heuristic is to add a small regularization term (e.g., a minimum variance floor) to the covariance estimates. This prevents the covariance from becoming too small, ensuring numerical stability and avoiding overfitting to individual points.