Module 2: Chapter 2 - Unsupervised Learning Flashcards
What is the logic behind k-means clustering?
The purpose of K-means is not to minimize the distance between points, but rather to minimize the distance between each point and its centroid
In k-means clustering, what is determined for numerical data and for categorical data to find the centers?
Numerical data: centroids
Categorical data: medoids
Medoids = most frequently occurring points
Explain the four steps of k-means clustering algorithm
The algorithm, sometimes also known as Lloyd’s algorithm, proceeds as follows:
1) Randomly choose initial values for the centroids, µj,j = 1,…, K, which are the centers of the clusters.
2) Allocate each data point, Xi, i = 1, …, N, to its nearest centroid.
3) Recalculate the centroids to be at the centers of all the data points assigned to them.
4) Repeat steps 2 and 3 until the centroids no longer change.
What two types of distance measurement for k-means clustering exist? Explain how they work
(1) Euclidian distance (“as the crow flies”)
–> Square root of the sum of the squares of the distances in each dimension, sometimes known as the L2-norm
(2) Manhattan distance
What two types of performance measurements for k-means clustering are generally used?
(1) Inertia
The better the model’s fit, the closer the data points will be, collectively, to their respective centroids.
Calculate the Within-cluster sum of squares [WCSS] (Euclidian distance between points and centroids). The lower the inertia, the better each cluster fits the data and the more separation there would be between the clusters, and so the within cluster sum of squares should be minimized.
(2) Between Cluster sum of squares (BCSS)
The sum of squares of the distances of each centroid to the centroid of all data points weighted by the number of data points in each cluster. This measure, which increases as the number of clusters grows, gives another indication of how well the clusters are separated
(3) Variance Ration Criterion [VRC] / Calinski-Harabaz Index
VRC = BCSS(N-K) / WCSS(K-1)
What is the problem with having many Ks (centroids) chosen randomly?
The allocation of data points to clusters depends to some degree on the initial position of centroids. Hence, the initial positioning of centroids can influence the final result , which is especially the case the the higher the number of K.
Solution: Run K-means several times with different random initializations, and then select the outcome with the lowest ineartia
What is K-means++?
It involves establishing the initial positions of the centroids far from each other in feature space, so that the position of only one of the centroids is chosen randomly. This can lead to better outcomes and faster convergence to the optimal solution. In fact, with randomly chosen initial centroid locations, it is possible that their initial positions are close together, making it hard to identify distinct clusters. This means that it can take many iterations to reach convergence when the numbers of features and data points are large.
What is a problem with the measurement of inertia? (similar to R squared)
With the rise in K, inertia can only grow
What is the rule of thumb in setting K in k-means approach?
Set K equal to the square root of N divided by 2
Apart of the rule of thumb, what two other types of approaches can be used to define an appropriate number of K?
(1) Scree Plot
Calculate the value of the WCSS for different values of K and plot the result; in other words, one can plot the within-cluster sum of squares against the number of clusters.
Then examine the figure to determine whether there is an obvious point at which the WCSS starts to decline more slowly as K is further increased (a so-called “elbow”). Because by construction the inertia will scale with the square of the features, we cannot argue that a particular value of it is “good” or “bad”; rather, we can only compare across models
(2) Silhouette Scores
Compare the average distance of each observation from other points in its own cluster (known as cluster cohesion, denoted ai) with its average distance from points in the closest other cluster (known as cluster separation, denoted bi)
The quantity s measures the mean difference between the intra-cluster and inter-cluster distances, normalized by the maximum of the two. The best value of K is the one that gives the highest silhouette score s, which implies that the points within a cluster are closest together but furthest away from other clusters. The limit case s = 1 would imply that all the points allocated to each cluster were exactly on the centroid of their respective clusters whereas s ≈ 0 implies that clusters are significantly overlapping
What are three advantages of k-means clustering?
(1) Makes intuitive sense and is easy to visualize if there are up to three features.
(2) Is quite fast and scales well to large numbers of data points.
(3) Will always converge to a solution (even though it might not always be the most appropriate one).
What are three problems with k-means clustering?
(1) Non-spherical clusters. As well as the need to specify an a priori number of clusters, a further disadvantage of the technique is that because it is based on distances from a centroid, it tends to produce spherical clusters. Some of the clusters that arise in practice have non-spherical shapes, and this can cause problems. K-means might not lead to the specification of appropriate clusters in any application where the data space does not have an approximately spherical shape. Possible solutions to these issues are either to apply a non-linear transformation of the input data, known as a kernel function, or to switch to a different technique that can cope better with a wider variety of data patterns
(2) The presence of outliers. K-means can be highly sensitive to outliers, which can either cause the centroids to become distorted or, in extreme cases, outliers might end up with their own cluster. Standardizing the data using the min-max transformation will mitigate this issue to some extent, although from this perspective. it may be preferable to remove outlying data points entirely from the sample prior to beginning the analysis
(3) The curse of dimensionality. Although K-means scales well as the number of data points increases, it tends to perform poorly when the number of features is large, converging very slowly to an optimal solution. The reason is that, as the number of features increases for a fixed sample size, the data points become increasingly sparse (“spread out”) in feature space. Although the distances between each point and a centroid will increase as features are added to an equation such as, because it is a sum of squares, the distances between a given point and different centroids will become increasingly similar, which is known as distance concentration. This will make it hard for K-means to identify the clusters. Moreover, K-means struggles to construct the clusters if the features are highly correlated.
Solutions:
- Some features can be removed from the analysis.
- The features can be transformed to a smaller subset using principal components analysis or a similar technique
- A distance measure that does not always increase with respect to the number of features can be employed instead of the Euclidean distance, such as the cosine distance.4
What is Fuzzy K-Means?
Soft clustering, also known as fuzzy clustering or fuzzy C-means. Here, each data point can belong to more than one cluster, with a probability assigned to each cluster. Implementing this requires a modification of the equation for inertia so that instead of calculating the direct summation of dj2, the equation incorporates a probability, raised to a power f, where f is a fuzziness coefficient. The larger the value of f, the greater the extent to which the clusters will overlap (i.e., they will be fuzzier)
What is an advantage of hierarchical clustering?
It does not necessitate a priori the number of clusters. It utilizes every possible value of K (from 1 to N) as an integral part of the process.
What two types of hierarchical clustering exist?
(1) Divisive: This begins with just one cluster and then sequentially partitions the data by adding another cluster until every data point has its own cluster.
(2) Agglomerative: This begins with each point having its own cluster, and then the closest two clusters in feature space are merged. The process ends when all points have been merged into a single cluster.