Clustering Flashcards
What is unsupervised learning?
Learning patterns in data without labeled outputs or a “teacher”.
What is the goal of clustering?
To partition data into groups or clusters based on similarity.
What does the K-means algorithm minimize?
The within-cluster point scatter/variance.
What are the two main steps of the K-means algorithm?
1) Assign points to nearest cluster center, 2) Update cluster centers. Until convergence
What is the K-means++ algorithm used for?
To initialize the cluster centers for K-means in a way that improves convergence (by spreading out the initial cluster centers)`
How can the number of clusters K be determined in non-probabilistic models?
By computing the Mean Square Error (MSE) for different values of K and using Elbow Method or picking K value where SSE has a change of slope.
What is a mixture model?
A probabilistic model that represents the presence of subpopulations within an overall population.
What are the two steps in generating samples from a Gaussian mixture model?
1) Draw a categorical variable Z to select a component, 2) Draw an observation from the selected Gaussian component.
What is the EM algorithm used for in mixture models?
To estimate the parameters of the mixture model by maximizing the likelihood.
What are the two main steps of the EM algorithm?
The Expectation (E) step and the Maximization (M) step (Repeat until convergence as in stable assignments and parameters)
In the context of Gaussian Mixture Models, what does the E-step compute?
the E-step computes the expected value of the latent variables, specifically the posterior probabilities (responsibilities) that each data point belongs to each Gaussian component.
In the context of Gaussian Mixture Models, what does the M-step compute?
the M-step updates the parameters of the model (means, covariances, and mixing coefficients) to maximize the expected log-likelihood found in the E-step. This involves re-estimating the parameters to better fit the data based on the current responsibilities.
How does K-means compare to Gaussian Mixture Models?
K-means is usually faster due to fewer iterations and less computation. K-Means assumes spherical clusters with equal variance while GMM can have clusters with different shapes and sizes. GMM also have soft-assignment (probability of belonging to each cluster) while K-Means have hard assignments.
What is an advantage of mixture models over K-means?
Mixture models allow for distributional assumptions and can assess the fit of the data by computing likelihood.
What is an issue with mixture models in terms of identifiability?
The likelihood is invariant to permutation of class memberships, making the estimators valid only up to permutation.