lecture 4 - clustering Flashcards

Question

DTW: keogh bound

Answer 1

allows you to **estimate** what the cheapest path will be this makes DTW less computationally expensive

Answer 2

1. k-means 2. k-medoids 3. hierarchical (divisive & agglomerative) 4. subspace clustering

Answer 3

- The goal of k-means clustering is to partition a set of data points into k clusters, where each data point belongs to the cluster with the nearest mean. 1. **initialization**: Start by selecting k *random* points in the data space. These points will serve as the initial centers of the clusters. 2. **assign points to clusters**: Each data point is assigned to the cluster whose center is nearest. This is typically done using the Euclidean distance. 3. **update centers**: Recalculate the centers (centroids) of the clusters. The new center of each cluster is the mean of all the data points assigned to that cluster. 4. **assign points to clusters**: Data points are reassigned based on the updated centers. This process repeats until convergence.

Answer 4

using the **silhouette score**

Answer 5

- measure used to determine how similar an object is to its own cluster (cohesion) compared to other clusters (separation). - The silhouette score ranges from -1 to 1, where a value closer to 1 indicates better clustering. - you want to be close to point in your own cluster, and far away from points in other clusters: b > a = score closer to 1 - based on the silhouette score for each k, we can decide which k is best

Answer 6

- a: average distance from x_i to all other points in the same cluster - b: average distance from x_i to all points in the nearest neighboring cluster

Answer 7

- whereas k-means uses random points as centers, k-medoids takes actual points as centers - This makes it more robust to noise and outliers.

Answer 8

1. divisive clustering 2. agglomerative clustering these can take a more iterative approach than k-means and k-medoids

Answer 9

- start with one big cluster C, make a split each step - calculate *dissimilarity* of a point to other points inside cluster C - create a new cluster C' and remove the most dissimilar points from C to C' until there is no point left in C that is less dissimilar to the points in C' - select the C with the *largest diameter* for this process --> the diameter of a cluster is the maximum distance between points in the cluster

Answer 10

sum(distance(x_i, x_j)) / |C| - i.e., this is just the average distance in a cluster

Answer 11

for all the points in C diameter(C) = max distance(x_i, x_j) - i.e., largest cluster

Answer 12

- start with one cluster per instance and merge into larger clusters 1. **Initialization**: Start with each data point in its own cluster. 2. **Merge Clusters**: At each step, find the pair of clusters that are closest to each other and merge them (based on critera). 3. Continue this process until the desired number of clusters is achieved or a stopping criterion is met.

Answer 13

1. **single linkage**: distance between two clusters is defined as the minimum distance between two points in separate clusters 2. **complete linkage**: distance between two clusters is defined as the maximum distance between two points in separate clusters 3. **group average**: distance between two clusters is defined as the average distance between all pairs of points, one from each cluster. This method balances between single and complete linkage. 4. **ward's criterion**: defines the distance between clusters as the increase in the standard deviation when the clusters are merged.

Answer 14

- handles a large number of features (high dimensional data) - uses the CLIQUE algorithm

Answer 15

1. create units (u): we split the range of each feature up into ε distinct intervals. --> a unit u is defined by means of boundaries per dimension: u = {u1,...,uk}. 2. define **selectivity** and **density** of each unit u

Answer 16

- defined by upper and lower boundaries per feature - u = {u_1, ... ,u_p} - an instance x is part of this unit when its value falls within the boundaries *for all features*

Answer 17

[number of points in u] / [total number of points] - defines the proportion of points inside a unit

Answer 18

- 1: when selectivity(u) is **larger** than threshold τ - 0: otherwise - a density score of 1 tells you that the unit holds a lot of points and therefore how relevant/how much information a unit holds

Answer 19

- we want subspaces (subsets of attributes) so our units do not have to cover all attributes - we can have units that cover p-k attributes

Answer 20

- units have a common face when all specifications of units are the same, except for one. for this one, either the upper bound is the same as the lower bound of another unit, or the lower bound is the same as the upper bound of another unit. - i.e., they are adjacent - we define a cluster as a maximal set of connected dense units.

Answer 21

units are connected when; 1. they share a common face (they are each other's common face) or 2. when they share a unit that is a common face to both (i.e., connected through the common face)

Answer 22

- dendogram - can be done for both divisive and agglomerative clustering

Answer 23

1. define clusters A, B, and the merged cluster AB 2. take the sum of squared differences between all points in each cluster and the center of that cluster 3. subtract the within cluster variances (AB - A - B) 4. if the difference is small, it indicates that clusters A and B should be merged.

Answer 24

1. will take a long time to compute 2. calculating distances over a large number of attributes can be problematic and distances might not distinguish cases very clearly 3. the results will not be very insightful due to the high dimensionality. Hence, we need to define a subset of the attributes (or subspace) to perform clustering (CLIQUE)

lecture 4 - clustering Flashcards

(48 cards)