K-means And Hierarchical Clustering Flashcards

1
Q

What is the purpose of K-means and Hierarchical clustering?

A

Both are unsupervised learning techniques used to find groups or clusters of data points. Their main goal is to partition data into clusters such that similar data points are grouped together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Is it best for supervised or unsupervised learning?

A

They are best for unsupervised learning, finding clustering tendencies without knowing the true label (y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are they both based on?

A

They are both based on euclidean distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the three different clustering techniques?

A

Well-separated, Center-based and Contiguity-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the two elaborated clustering techniques?

A

Densit-based and Conceptual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is well-separated clustering technique?

A

Each point is closer to all in its cluster than any other cluster. Similiar to maximum-linkage of hierarchical clustering, assuming well-separated clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is center-based clustering technique?

A

Each point is closer to the center of its cluster than to the center of any other cluster. K-means and Ward-clustering, takes a center-based apporach to finding clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is contiguity-based clustering technique?

A

Each point is closer to at least one point in its cluster than to any point in another cluster. Hierarchical clustering with minimum-linkage takes a contiguity-based approach to finding clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Density-based clustering technique?

A

Clusters are regions of high density separated by regions of low density. The Gaussian mixture model, takes a density-based approach to find clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is conceptual clustering technique?

A

Points in a cluster share some general property that is derived from the entire set of points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the goal of K-means?

A

The goal of K-means is to partition the dataset into K clusters, where the distance between data points within a cluster is smaller than the distance between data points in different clusters. It aims to minimize the sum of squared distances (error) between points and their assigned cluster centroids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How are the μ_k update?

A

In K-means, the centroids μ_k (the mean of each cluster) are updated by calculating the mean of all the points assigned to each cluster. The centroid is the point that minimizes the distance to all points within the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what is μ_k?

A

The center or mean of a cluster (or another reference point) in the same space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is z_(ik)?

A

binary vector of each x_i, determining whether they belong to the given cluster (1 = yes, 0 = no)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is euclidean distance?

A

Euclidean distance is a measure of the straight-line distance between two points in Euclidean space.
|| x_i - μ_k ||

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How is the distance to μ_k determined?

A

If the distance of x_i is lowest to the given cluster k, then z_(ik) is 1 else it is 0 and not a part of this cluster and will therefore not be included in the updated calculations for μ_k (centroid)

17
Q

How do we know we have set optimal number of k-clusters?

A

By the sum of squared error, based on the euclidean distance and x_i. Calculating the squared error for each cluster and then summing them all to get overall sum of squared error. We need low error but not overfitted with too many clusters.

18
Q

Issues with K-means?

A
  • Sensitivity to initialization: Different initial centroids can lead to different results.
  • Requires predefined K: The number of clusters K needs to be specified in advance, which can be difficult to determine.
  • Sensitive to outliers: Outliers can distort the centroids significantly.
  • Assumes spherical clusters: Works best with well-separated and roughly spherical clusters.
19
Q

How to ensure centroids are well spread out?

A

To ensure centroids are well spread out, use better initialization methods like the farthest-first initialization. These methods select initial centroids that are far apart from each other, reducing the likelihood of poor cluster assignments.

20
Q

What is hierarchical clustering?

A

Hierarchical clustering is an unsupervised method that builds a hierarchy of clusters by iteratively merging or splitting clusters. It doesn’t require specifying the number of clusters in advance and produces a dendrogram, a tree-like diagram that shows the relationships between clusters.

21
Q

What is minimum linkage?

A

Minimum (or single) linkage is a method in hierarchical clustering where the distance between two clusters is defined as the distance between the closest pair of observations in the two clusters.

22
Q

What is maximum linkage?

A

Maximum (or complete) linkage in hierarchical clustering defines the distance between two clusters as the distance between the most distant pair of observations in the two clusters.

23
Q

What is average linkage?

A

Average linkage calculates the distance between two clusters as the average distance between all pairs of points in the two clusters.

24
Q

What is the Ward method?

A

The Ward method is a hierarchical clustering method that minimizes the variance within clusters. It merges the two clusters whose combination results in the smallest increase in the total sum of squared distances.

25
When is Hierarchical clustering more suited than K-means?
- The number of clusters is not known in advance. - The data has a nested structure (i.e., clusters within clusters). - You need a detailed, visual representation of data relationships (via a dendrogram).
26
What is the purpose of comparing partitions?
The purpose of comparing partitions is to evaluate how well different clustering methods or different numbers of clusters match or agree with each other. This comparison helps determine which clustering solution best captures the true structure of the data.
27
What is the purpose with the Rand index?
The Rand index measures the similarity between two different clusterings of the same dataset. It compares how well two clusterings agree on the assignment of points to clusters by counting the number of true positives and true negatives.
28
What limitations is there to the Rand index?
If there is too many clusters, there will probably not be many overlaps (s, similiar) and therefore D will be higher than S and the Rand index is close to 1. As the Rand index evaluates overall clustering accuracy, affecting the accuracy of the result and becomes similiar to that of Similiar mathcing coefficient (SMC).
29
What is the purpose of Jaccard similiarity?
Jaccard similarity measures the similarity between two sets by comparing the intersection and union of the sets. For clustering, it compares the pairs of points that are in the same cluster versus those that are not.
30
How is Jaccard similiarity different from Rand index?
Jaccard similarity focuses on the pairs of points that are in the same cluster and ignores the true negatives (pairs that are in different clusters). The Rand index considers both pairs that are in the same cluster and pairs that are in different clusters. Thus, Jaccard similarity is more focused on the positive matches, while the Rand index gives a more overall measure of agreement.
31
What is the purpose of Similiarity Matching Coefficient (SMC)?
The Similarity Matching Coefficient (SMC) is used to evaluate the similarity between two sets. It measures the proportion of matching pairs (either both in the same cluster or both in different clusters) relative to all possible pairs in the dataset.
32
What is the limitations to Hierarchical clustering?
- Computationally expensive: It has a time complexity, making it infeasible for large datasets. - Memory-intensive: Requires storing a distance matrix, which becomes impractical as the dataset grows. - Sensitive to outliers: Outliers can heavily influence the hierarchy and clustering results. - Irrevocable merging: Once clusters are merged, the decision cannot be undone, which may lead to suboptimal clusters.