Clustering Flashcards

Question

How does K-means clustering work?

Answer 1

K-means clustering works by randomly assigning each data point to a cluster and then iteratively updating the centroids of the clusters until convergence is reached. The algorithm minimizes the sum of the distances of each data point to the centroid of its assigned cluster. This process is repeated until the centroids no longer change or a predetermined number of iterations is reached.

Answer 2

The objective of K-means clustering is to partition the data into K clusters such that the sum of the distances of each data point to the centroid of its assigned cluster is minimized. The value of K is chosen by the user and represents the number of clusters desired.

Answer 3

Some advantages of K-means clustering include its simplicity and efficiency, making it easy to implement and scalable to large datasets. It is also widely used in various fields such as marketing, biology, and computer science for data analysis and pattern recognition.

Answer 4

Scaling/normalizing the data is important in this algorithm to ensure that all features are on the same scale and that no single feature dominates the clustering process. This helps to prevent bias towards certain features and allows for a more accurate and unbiased clustering.

Answer 5

The value of K is chosen based on the number of clusters desired or the number of distinct groups in the data. In this case, the value of K was chosen as 2 based on the visual inspection of the data.

Answer 6

Centroids are selected at random from the dataset if this is the first iteration of the algorithm. In subsequent iterations, the centroids are updated based on the mean of the data points assigned to each cluster.

Answer 7

The purpose of associating each point to the nearest centroid is to assign each data point to its closest cluster and to form initial clusters based on the location of the centroids.

Answer 8

Centroids are updated by taking the mean (vector average) of the data points assigned to each cluster. This generates a new proposition for each centroid, which is used in the next iteration of the algorithm.

Answer 9

Choose the number of clusters, K, to create. Initialize K cluster centroids (center points) either randomly or based on some heuristics. Assign each data point to the nearest centroid. The most common distance metric used is Euclidean distance. Update each centroid by calculating the mean of all data points assigned to it. Repeat steps 3-4 until convergence (i.e. centroids no longer move or a maximum number of iterations is reached). Optionally, evaluate the quality of the clustering using a clustering metric such as silhouette score or inertia.

Answer 10

The Confusion matrix based Fowles-Mallows score (FMI) is a performance evaluation metric defined as the geometric mean of the pairwise and recall.

Answer 11

In a true clustering problem, we would not know the class labels.

Answer 12

If we have the labels in a clustering problem, we should deploy a classification technique.

Answer 13

The Silhouette Coefficient is used for evaluating the quality of clusters when the true class labels are unknown.

Answer 14

The Silhouette Coefficient measures how well each data point fits into its assigned cluster based on both the mean intra-cluster distance (a) and the mean distance between a sample and all other points in the next nearest cluster (b).

Answer 15

The Silhouette Coefficient is advantageous because it can detect if the clustering is incorrect, if the clusters are overlapping, or if the clusters are highly dense and well-separated. The score is also conceptually sound, as a higher score indicates a more optimal clustering.

Answer 16

The formula for Silhouette Coefficient is: s(i) = (b(i) - a(i)) / max{a(i), b(i)} where: s(i) is the Silhouette Coefficient for the i-th data point a(i) is the mean distance between the i-th data point and all other points in the same cluster b(i) is the mean distance between the i-th data point and all other points in the next nearest cluster.

Answer 17

Inertia measures the sum of squared distances of all samples to their closest cluster center, and can be used to evaluate the quality of clustering. A lower inertia value generally indicates better clustering, but it may not be the best metric for all cases.

Answer 18

The silhouette score measures how similar an object is to its own cluster compared to other clusters. It ranges from -1 to 1, with higher scores indicating better clustering. To determine the best K, one can plot the silhouette scores for different K values and select the K with the highest average score.

Answer 19

Other metrics that can be used include the Calinski-Harabasz index, Davies-Bouldin index, and Gap statistic. These metrics can help evaluate the quality of clustering based on different criteria such as compactness, separation, and cluster size. It's important to select the appropriate metric based on the data and clustering goals.

Answer 20

To determine the best value of K using the elbow method, we compute the inertia for different values of K, ranging from 2 to a high number (up to as many data points we have). Then, we plot the inertia against K and look for the point where the inertia starts to decrease at a slower rate, forming an elbow shape. The value of K at the elbow point is considered to be the best value for K.

Answer 21

The main idea is to group the nearest points in their clusters, and then recursively merge these clusters until all points are in a single cluster.

Answer 22

A dendrogram is a tree-like diagram that shows the order in which clusters are merged. It displays the distance between each pair of clusters and can be used to determine the optimal number of clusters.

Answer 23

The two types are divisive and agglomerative. Divisive clustering involves starting with all points in one cluster and recursively dividing them into smaller clusters. Agglomerative clustering involves starting with each point in its own cluster and then merging the nearest clusters.

Answer 24

The linkage criterion is a rule for determining the distance between two clusters. There are several different linkage criteria, including single linkage, complete linkage, and average linkage.

Clustering Flashcards

(48 cards)