Clustering Flashcards

1
Q

What is clustering?

A

Clustering is grouping
“similar” objects together.
*
To establish prototypes, or detect outliers.
*
To simplify data for further analysis/learning.
*
To visualize data (in conjunction with dimensionality reduction)
The notion of similarity must be given to the clustering algorithm.
Clusterings are not
“right” or “wrong” –different clusterings/clustering criteria can reveal different things about the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How do clustering algorithms work?

A

Clustering algorithms:
*
Employ some notion of similarity between objects
*
Have an explicit or implicit criterion defining what a good cluster is
*
Heuristically optimize that criterion to determine the clustering
Some clustering criteria/algorithms have natural probabilistic
interpretations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is K-means clustering?

A

One of the most commonly-used clustering algorithms, because it
is easy to implement and quick to run.
Assumes the objects (instances) to be clustered are 𝑝-dimensional
vectors, 𝐱𝑖 .
Uses a distance measure between the instances (typically Euclidean
distance)
The goal is to partition the data into 𝐾 disjoint subsets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the loss function of K-means clustering

A

The loss function in K-means clustering is known as the sum of squared errors (SSE) or within-cluster sum of squares (WCSS). It measures the total variance within each cluster, aiming to minimize the distance between data points and their respective cluster centroids. Here’s a detailed explanation:

Loss Function in K-means Clustering:
Cluster Centroids:

K-means clustering partitions the data into
k
k clusters, each represented by a centroid. The centroid is the mean of all data points within the cluster.
Distance Calculation:

For each data point, the algorithm calculates the distance to its assigned cluster centroid. Typically, the Euclidean distance is used.
Sum of Squared Errors (SSE):

The loss function is the sum of the squared distances between each data point and its cluster centroid.
Minimization:

The goal of K-means clustering is to minimize the SSE. The algorithm iteratively updates the cluster centroids and reassigns data points to the nearest centroid until the SSE converges to a minimum value.
Key Characteristics:
Objective: Minimize the total variance within clusters.
Iterative Process: K-means uses an iterative process to refine cluster assignments and centroids.
Convergence: The algorithm stops when the change in SSE between iterations falls below a threshold or after a fixed number of iterations.

Minimizing the loss of K-means is an NP hard problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Break down the K-means algorithm

A

Inputs:
◦ A set of 𝑝-dimensional real vectors {𝐱1, 𝐱2, … , 𝐱𝑛}.
◦𝐾, the desired number of clusters.
* Output: A mapping of the vectors into 𝐾 clusters (disjoint subsets),
𝐶: {1, … , 𝑛} ↦ {1, … , 𝐾}.
* Initialize 𝐶 randomly.
* Repeat:
◦ Compute the centroid of each cluster (the mean of all the instances in the cluster)
◦ Reassign each instance to the cluster with closest centroid
* until 𝐶 stops changing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the elbow method?

A

The elbow method is a popular technique used to determine the optimal number of clusters in K-means clustering. Here’s how it works:

Elbow Method Explained:
Calculate SSE for Different Numbers of Clusters:

Run K-means clustering for a range of cluster numbers k
k (e.g., from 1 to 10).
For each k
k, calculate the sum of squared errors (SSE), which measures the total variance within clusters.
Plot SSE vs. Number of Clusters:

Create a plot with the number of clusters k
k on the x-axis and the SSE on the y-axis.
The plot typically shows a decreasing SSE as the number of clusters increases, because more clusters mean less variance within each cluster.
Identify the “Elbow” Point:

Look for the point on the plot where the SSE starts to level off, forming an “elbow” shape.
The “elbow” point represents the optimal number of clusters. Adding more clusters beyond this point doesn’t significantly reduce the SSE, indicating diminishing returns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the silhouette method?

A

The silhouette method is another technique used to evaluate the quality of clustering and determine the optimal number of clusters. Here’s how it works:

Silhouette Method Explained:
Silhouette Score Calculation:

For each data point, calculate the silhouette score. This score measures how similar a point is to its own cluster compared to other clusters.

Interpretation of Silhouette Scores:

The silhouette score ranges from -1 to 1:
Close to 1: Indicates that the data point is well-clustered and far from neighboring clusters.
Close to 0: Indicates that the data point is on or very close to the decision boundary between two neighboring clusters.
Negative values: Indicate that the data point might have been assigned to the wrong cluster.
Average Silhouette Score:

Calculate the average silhouette score for all data points in the dataset. This gives an overall measure of how well the data has been clustered.
Optimal Number of Clusters:

Run the clustering algorithm for different numbers of clusters k
k (e.g., from 2 to 10).
Calculate the average silhouette score for each k
k.
Plot the average silhouette score against the number of clusters k
k.
The optimal number of clusters is the one that maximizes the average silhouette score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is agglomerative clustering?

A

Agglomerative clustering is a type of hierarchical clustering method that builds a hierarchy of clusters by progressively merging smaller clusters into larger ones. Here’s a detailed explanation of how it works:

How Agglomerative Clustering Works:
Initialization:

Start with each data point as its own individual cluster. If you have
n
n data points, you begin with
n
n clusters.
Merging Clusters:

At each step, merge the two clusters that are closest to each other based on a chosen distance metric. Common distance metrics include Euclidean distance, Manhattan distance, and others.
Distance Calculation:

The distance between clusters can be calculated in several ways:
Single Linkage: Distance between the closest points of the two clusters.
Complete Linkage: Distance between the farthest points of the two clusters.
Average Linkage: Average distance between all points in the two clusters.
Centroid Linkage: Distance between the centroids of the two clusters.
Repeat:

Continue merging clusters until all data points are grouped into a single cluster or until a stopping criterion is met (e.g., a predefined number of clusters).
Dendrogram:
The result of agglomerative clustering is often visualized using a dendrogram, which is a tree-like diagram that shows the order in which clusters were merged. The height of the branches represents the distance at which clusters were merged.

Key Characteristics:
Hierarchical: Builds a hierarchy of clusters, providing a detailed view of the clustering process.
Versatile: Can use different linkage criteria to define cluster distances.
No Need for Predefined Number of Clusters: Unlike K-means, agglomerative clustering doesn’t require specifying the number of clusters in advance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain spectral clustering

A

Spectral clustering is a powerful technique for clustering data based on the eigenvalues (spectrum) of a similarity matrix. Here’s a detailed explanation of how it works:

How Spectral Clustering Works:
Constructing the Similarity Matrix:

Start by constructing a similarity matrix
S
S that quantifies the similarity between pairs of data points. This matrix is symmetric, with Sij
representing the similarity between data points i and j

Creating the Laplacian Matrix:

From the similarity matrix, construct the Laplacian matrix L
There are different ways to define the Laplacian matrix, but a common form is:
L=D-S

where D is the degree matrix (a diagonal matrix where each entry Dii is the sum of similarities for data point i

Eigenvalue Decomposition:

Perform eigenvalue decomposition on the Laplacian matrix to obtain its eigenvalues and eigenvectors. The eigenvectors corresponding to the smallest non-zero eigenvalues are used for clustering1.
Embedding in Low-Dimensional Space:

Use the selected eigenvectors to embed the data points into a lower-dimensional space. Each data point is represented by a vector in this new space.
Clustering:

Apply a standard clustering algorithm, such as K-means, to the embedded data points in the lower-dimensional space. The clusters formed in this space correspond to the clusters in the original high-dimensional space2.

Key Characteristics:

Flexibility: Spectral clustering can handle complex, non-convex cluster shapes that traditional methods like K-means might struggle with3.

Graph-Based: Treats clustering as a graph partitioning problem, making it suitable for data that can be represented as a graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly