2: Clustering Flashcards
What is meant by clustering?
Clustering means to gather data records into natural groups (i.e., clusters) of similar samples according to predefined similarity/dissimilarity metrics, which results in extracting a set of useful information about the given dataset.
What is high inter-cluster separation?
The contents of any cluster should be very different.
What is the similarity/dissimilarity metric used in UL?
A form of distance function between each pair of data records.
Distance meassure and combining A & B.
The distance measure how close A and B are to each other, and a decision is made whether to combine A and A in one cluster.
What are the forms of distance functions?
Euclidean and Manhattan.
How are the Euclidean and Manhattan distance calcualted?
… for two dimensional datasets (i.e., having two features) ….
What are the applications of Clustering?
pattern recognition, image processing, spatial data analysis, bioinformatics, crime analysis, medical imaging, climatology, and robotics, market segmentation, recommendation system (spotify)
What are the main classes of methods and techniques in UL?
Centroid-based clustering methods
Gaussian mixtures models clustering methods
Hierarchical clustering methods
Density based clustering methods
What is centroid-based clustering (K-Means)?
Centroid-based clustering searches for a pre-determined number of clusters within an unlabeled and possibly multidimensional dataset. The rule is that the distance between a data record and each of the cluster’s centroids is calculated, and this data record is assigned to the cluster achieving the minimum distance.
What steps does K-means consists of?
Initialization (nr of clusters + random centroid), assignment (the clusters are formed by connecting each data record with its nearest centroid), and update step (repeated until convergence).
K-Means + & -?
simple to employ, + can be used for large datasets smoothly + adjusts the outliers, - significantly difficult to predict the nr of clusters, - It assumes clusters are round, so it doesn’t work well for data that has groups with different shapes - Evaluating distances becomes exceedingly less informative in high-dimensional spaces.+
Explain k-means clustering.
K-means clustering is a simple and elegant algorithm to partition a dataset into k distinct, non-overlapping clusters. Choose a number of clusters .kRandomly assign a number between one and k to each observation. These serve as initial cluster assignments. For each of the k-clusters we then compute the cluster centroid. The kth cluster centroid is the vector of the p feature means for the observations in the kth cluster. Assign each observation to the cluster whose centroid is closest (where closest is defined using a distance metric). Iterate until cluster assignments stop changing.
What is a Gaussian mixture model?
Gaussian mixture models (GMM) clustering, each cluster is considered as a probabilistic generative model. A data record has a probability for belonging to each cluster, and it is assigned to the cluster returning the highest probability. As in the k-means method, GMM also initially assumes the number of clusters for the input dataset. GMM tries to fit mixtures of Gaussian distributions to the dataset, where each distribution defines one cluster.
What is the limitation of the GMM?
The algorithm can be slow because it finds the probability distribution for each cluster. It can also get stuck in a local maximum of the log-likelihood function. The steps, in general, suffer from heavy computations of the conditional probabilities.
What are the two parameters that each cluster in GMM follows?
Cluster mean and standard deviation.