Module 2: Chapter 2 - Unsupervised Learning Flashcards

1
Q

What is the logic behind k-means clustering?

A

The purpose of K-means is not to minimize the distance between points, but rather to minimize the distance between each point and its centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

In k-means clustering, what is determined for numerical data and for categorical data to find the centers?

A

Numerical data: centroids
Categorical data: medoids

Medoids = most frequently occurring points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the four steps of k-means clustering algorithm

A

The algorithm, sometimes also known as Lloyd’s algorithm, proceeds as follows:

1) Randomly choose initial values for the centroids, µj,j = 1,…, K, which are the centers of the clusters.

2) Allocate each data point, Xi, i = 1, …, N, to its nearest centroid.

3) Recalculate the centroids to be at the centers of all the data points assigned to them.

4) Repeat steps 2 and 3 until the centroids no longer change.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What two types of distance measurement for k-means clustering exist? Explain how they work

A

(1) Euclidian distance (“as the crow flies”)
–> Square root of the sum of the squares of the distances in each dimension, sometimes known as the L2-norm

(2) Manhattan distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What two types of performance measurements for k-means clustering are generally used?

A

(1) Inertia
The better the model’s fit, the closer the data points will be, collectively, to their respective centroids.

Calculate the Within-cluster sum of squares [WCSS] (Euclidian distance between points and centroids). The lower the inertia, the better each cluster fits the data and the more separation there would be between the clusters, and so the within cluster sum of squares should be minimized.

(2) Between Cluster sum of squares (BCSS)

The sum of squares of the distances of each centroid to the centroid of all data points weighted by the number of data points in each cluster. This measure, which increases as the number of clusters grows, gives another indication of how well the clusters are separated

(3) Variance Ration Criterion [VRC] / Calinski-Harabaz Index

VRC = BCSS(N-K) / WCSS(K-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the problem with having many Ks (centroids) chosen randomly?

A

The allocation of data points to clusters depends to some degree on the initial position of centroids. Hence, the initial positioning of centroids can influence the final result , which is especially the case the the higher the number of K.

Solution: Run K-means several times with different random initializations, and then select the outcome with the lowest ineartia

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is K-means++?

A

It involves establishing the initial positions of the centroids far from each other in feature space, so that the position of only one of the centroids is chosen randomly. This can lead to better outcomes and faster convergence to the optimal solution. In fact, with randomly chosen initial centroid locations, it is possible that their initial positions are close together, making it hard to identify distinct clusters. This means that it can take many iterations to reach convergence when the numbers of features and data points are large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a problem with the measurement of inertia? (similar to R squared)

A

With the rise in K, inertia can only grow

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the rule of thumb in setting K in k-means approach?

A

Set K equal to the square root of N divided by 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Apart of the rule of thumb, what two other types of approaches can be used to define an appropriate number of K?

A

(1) Scree Plot
Calculate the value of the WCSS for different values of K and plot the result; in other words, one can plot the within-cluster sum of squares against the number of clusters.
Then examine the figure to determine whether there is an obvious point at which the WCSS starts to decline more slowly as K is further increased (a so-called “elbow”). Because by construction the inertia will scale with the square of the features, we cannot argue that a particular value of it is “good” or “bad”; rather, we can only compare across models

(2) Silhouette Scores
Compare the average distance of each observation from other points in its own cluster (known as cluster cohesion, denoted ai) with its average distance from points in the closest other cluster (known as cluster separation, denoted bi)

The quantity s measures the mean difference between the intra-cluster and inter-cluster distances, normalized by the maximum of the two. The best value of K is the one that gives the highest silhouette score s, which implies that the points within a cluster are closest together but furthest away from other clusters. The limit case s = 1 would imply that all the points allocated to each cluster were exactly on the centroid of their respective clusters whereas s ≈ 0 implies that clusters are significantly overlapping

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are three advantages of k-means clustering?

A

(1) Makes intuitive sense and is easy to visualize if there are up to three features.

(2) Is quite fast and scales well to large numbers of data points.

(3) Will always converge to a solution (even though it might not always be the most appropriate one).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are three problems with k-means clustering?

A

(1) Non-spherical clusters. As well as the need to specify an a priori number of clusters, a further disadvantage of the technique is that because it is based on distances from a centroid, it tends to produce spherical clusters. Some of the clusters that arise in practice have non-spherical shapes, and this can cause problems. K-means might not lead to the specification of appropriate clusters in any application where the data space does not have an approximately spherical shape. Possible solutions to these issues are either to apply a non-linear transformation of the input data, known as a kernel function, or to switch to a different technique that can cope better with a wider variety of data patterns

(2) The presence of outliers. K-means can be highly sensitive to outliers, which can either cause the centroids to become distorted or, in extreme cases, outliers might end up with their own cluster. Standardizing the data using the min-max transformation will mitigate this issue to some extent, although from this perspective. it may be preferable to remove outlying data points entirely from the sample prior to beginning the analysis

(3) The curse of dimensionality. Although K-means scales well as the number of data points increases, it tends to perform poorly when the number of features is large, converging very slowly to an optimal solution. The reason is that, as the number of features increases for a fixed sample size, the data points become increasingly sparse (“spread out”) in feature space. Although the distances between each point and a centroid will increase as features are added to an equation such as, because it is a sum of squares, the distances between a given point and different centroids will become increasingly similar, which is known as distance concentration. This will make it hard for K-means to identify the clusters. Moreover, K-means struggles to construct the clusters if the features are highly correlated.

Solutions:
- Some features can be removed from the analysis.
- The features can be transformed to a smaller subset using principal components analysis or a similar technique
- A distance measure that does not always increase with respect to the number of features can be employed instead of the Euclidean distance, such as the cosine distance.4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Fuzzy K-Means?

A

Soft clustering, also known as fuzzy clustering or fuzzy C-means. Here, each data point can belong to more than one cluster, with a probability assigned to each cluster. Implementing this requires a modification of the equation for inertia so that instead of calculating the direct summation of dj2, the equation incorporates a probability, raised to a power f, where f is a fuzziness coefficient. The larger the value of f, the greater the extent to which the clusters will overlap (i.e., they will be fuzzier)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is an advantage of hierarchical clustering?

A

It does not necessitate a priori the number of clusters. It utilizes every possible value of K (from 1 to N) as an integral part of the process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What two types of hierarchical clustering exist?

A

(1) Divisive: This begins with just one cluster and then sequentially partitions the data by adding another cluster until every data point has its own cluster.

(2) Agglomerative: This begins with each point having its own cluster, and then the closest two clusters in feature space are merged. The process ends when all points have been merged into a single cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does hierarchical clustering proceed computationally?

A

A definition of the proximity between sets of points is required. Here, we will need to capture the distance between one cluster and another cluster and compare that with the distance between one cluster and a single point not in that cluster. Instead of a distance measure, a linkage criterion is used, of which there are several variants.

17
Q

What two types of distance measures (linkage criterions) are used for hierarchical clustering?

A

(1) Single linkage
Calculates the distance between clusters based on the distance between two data points from those clusters that are closest to one another.

(2) Complete linkage
Calculates the distance between the points in the clusters that are furthest apart from one another.

18
Q

What is used to represent the results of hierarchical clustering?

A

Dendograms

19
Q

What is density based clustering?

A

Another approach to clustering is the family of techniques including “DBSCAN” (density-based spatial clustering of applications with noise) and “SNN” (shared nearest neighbors) that are based on densities of points. Here, a region in feature space constitutes a cluster if the density of points exceeds a pre-specified threshold.

(1) An observation is a core point if at least a pre-specified number of other points are within a threshold distance of it.

(2) An observation is a border point if it is within the threshold distance from a core point, but it has fewer than the pre-specified number of other points close to it.

(3) An observation is considered a noise point if it is neither a core nor a border point.

20
Q

What are the advantages if density based clustering?

A

(1) it can handle non-spherical distributions of points in feature space

(2) it is considerably more robust with respect to outliers, and indeed observations identified as noise points are automatically excluded from the clusters

21
Q

How would you choose the number of clusters when using unsupervised learning?

A

The more clusters are used when fitting an unsupervised model, the better the fit of the algorithm to the data, but as the number of clusters increases, the usefulness of the model will start to diminish.

Determining the most appropriate number of clusters for a particular dataset could involve constructing a “scree plot,” which charts the inertia (sum of squared distances of each point to its centroid) against the number of clusters. We would then search for the number of clusters beyond which the inertia only declines very slowly. Silhouette scores, which compare the distance of each point (a) to points in its own cluster and (b) to points in the closest other cluster, can also be used.

22
Q

Explain the steps in using the K-means clustering algorithm

A

(1) Specify the number of centroids, K and choose a distance measure (e.g., the Euclidean or Manhattan distance).

(2) Scale the features using either standardization or normalization.

(3) Select K points at random from the training data to be the centroids

(4) Allocate each data point to its nearest centroid.

(5) Given the points allocated to each centroid, re-calculate the appropriate location of the centroids.

(6) If the positions of the centroids have changed from those in the previous iteration, then repeat step 4. If the positions of the centroids have not changed (and the clusters are not changed), then stop.

23
Q

In practice, the K-means clustering algorithm is often carried out with several different initial values for the centroids. How would you choose between clusters that result from different initial choices for the centroids?

A

You could select the centroids where the total inertia was the lowest, as this would represent the choice of centroid positions that best fitted the feature data.

24
Q

How do you use the scree plot for choosing the number of K-means clusters?

A

A scree plot has the number of clusters on the x-axis and the within-cluster sum of squares on the y-axis, which is a measure of how closely the points allocated to each cluster fit their respective centroids. We look for an ‘elbow’ in the plot, where its gradient changes from steep to almost flat and that is the optimal K.

25
Q

What are the two types of hierarchical clustering? What are the advantages of hierarchical clustering?

A

Hierarchical clustering starts with all points in one cluster and then sequentially splits them into separate clusters until an optimal allocation is reached (divisive hierarchical clustering) or starts with each data point in its own cluster and sequentially combines them until an optimal allocation is reached (agglomerative hierarchical clustering).

The advantages of hierarchical clustering are first, that it does not require pre-defining the number of clusters and second, it can uncover hierarchical relationships within the data, which can reveal nested clusters within larger groups. Third, the dendrograms produced are straightforward to interpret.

26
Q

State whether the following are true or false and explain your answers

A

True. Because K-means is based on linear Euclidean distances, it runs into problems when the cluster is not approximately spherically shaped. When the Manhattan distance measure is used with K-means, the clusters are approximately rhombus-shaped. This can result in poorly defined clusters, or some points even being allocated to the wrong cluster.

27
Q

The within cluster sum of squares will never rise when the number of clusters is increased in a K-means application

A

True. The situation is analogous to what happens to the residual sum of squares when more features are added to a linear regression model. When additional clusters are added (i.e., the value of K is increased), the fit of the model to the data cannot get worse. Therefore, the within-cluster sum of squares must fall as the new cluster will capture one or more of the data points better than the cluster(s) to which it(they) was(were) previously allocated.

28
Q

What is a dendrogram and how would we interpret one?

A

A dendrogram is a pictorial representation of the steps in a hierarchical clustering application, which shows how the clusters are split or combined. Each bifurcation (for divisive clustering) or combination (for agglomerative clustering) of the lines shows a cluster being formed or removed, respectively. The heights of the vertical lines show the impact of the marginal cluster on the distance of the points affected by it to their nearest cluster. If a particular vertical line is long, this would suggest that the additional cluster has a considerable effect on the model fit and therefore is worth incorporating.