Unsupervised learning Flashcards
Explain the difference between nonparametric and model-based cluster analysis.
Nonparametric cluster analysis does not assume a model for the underlying data distribution and focuses on the arrangement of the data, often using a dissimilarity measure. Model-based clustering assumes a statistical model, often a mixture model, and uses inference on this model to determine clusters.
How does k-Means clustering algorithm determine the number of clusters?
k-Means clustering determines the number of clusters based on the user’s specification, often aided by methods like the elbow plot to identify a suitable number by optimizing within-cluster sum of squares.
Describe the role of dissimilarity measures in cluster analysis and provide examples.
Dissimilarity measures, such as Euclidean or Manhattan distance, quantify how different or similar two data points are, influencing cluster formation by determining how data points are grouped.
What is the elbow plot and how is it used in determining the optimal number of clusters?
The elbow plot is a graphical tool used in k-Means to plot the sum of squared distances of samples to their closest cluster center. The number of clusters is chosen at the ‘elbow’ point where adding another cluster does not provide much better modeling of the data.
Explain the Gaussian mixture model (GMM) and how it is used in clustering.
A Gaussian mixture model represents the data as a mixture of several Gaussian distributions. Each component corresponds to a cluster, and the model parameters are estimated using maximum likelihood estimation.
Describe the Expectation-Maximization (EM) algorithm used in GMM.
The EM algorithm alternates between estimating the cluster memberships given the parameters (E-step) and updating the parameters given the cluster memberships (M-step), iterating until convergence.
How does Bayesian Information Criterion (BIC) help in selecting the number of clusters?
BIC evaluates the trade-off between the fit of the model and the complexity of the model. It penalizes models with more parameters, helping to avoid overfitting by selecting a model with fewer clusters if it offers a similar fit.
What is weighted dissimilarity and how does it affect the clustering results?
Weighted dissimilarity applies different weights to features according to their importance, which can adjust the influence of particular features on the clustering process, potentially leading to different clustering outcomes.
Compare hierarchical clustering to k-Means clustering.
Hierarchical clustering creates a tree of clusters and does not require a pre-specified number of clusters; it can be more informative for understanding data structure. k-Means is faster and more scalable but requires specifying the number of clusters.
What are silhouettes and how do they assist in evaluating clustering quality?
Silhouettes measure how similar an object is to its own cluster compared to other clusters. A higher silhouette value indicates a better fit to the cluster. It is used to assess the appropriateness of the clustering.
Explain how autoassociative neural networks function in unsupervised learning.
Autoassociative neural networks, or autoencoders, compress input data into a lower-dimensional representation and then reconstruct the output to match the input, facilitating dimensionality reduction and feature learning without labeled data.