lecture 4 - clustering Flashcards

1
Q

two types of setup

A
  1. per instance
  2. per person
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

setup per instance

A
  • per timepoint, over all QS
  • Each instance (time point) is treated as a separate observation.
  • For each instance, there is a feature vector.
  • These feature vectors are combined into a large matrix
    X where each row corresponds to an instance (measurement at a particular time). (X_N, qs_n)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

setup per person

A
  • for finding types of people
  • data is organized by individual, where each person’s data is grouped together
  • each person has multiple sets of features, representing different measurements or time points
  • matrix X now corresponds to submatrices, each corresponding to a different person (X_qs_n)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

individual distance metrics (instance based)

A
  1. euclidean distance
  2. manhattan distance
  3. minkowski distance
  4. gower’s similarity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

euclidean distance

A

shortest straight path

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

manhattan distance

A

block based structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

minkowski distance

A
  • generalized form of euclidean and manhattan distance
  • q =1: manhattan distance
  • q = 2: euclidean distance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

important to consider for euclidean, manhattan, and minkowski distance

A

scaling the data, since all of these metrics assume numeric values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

gower’s similarity

A
  • does not assume numeric values, so we can use this to find distances for different types of features
  1. dichotomous attributes
  2. categorical attributes
  3. numerical attributes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

value of s(x^k_i, x^k_j) for dichotomous attributes

A
  • 1 when x^k_i and x^k_j are both present
  • 0 otherwise
  • i.e., similar when both instances indicate presence
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

value of s(x^k_i, x^k_j) for categorical attributes

A
  • 1 when x^k_i = x^k_j
  • 0 otherwise
  • i.e., similar when instances are of the same category
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

value of s(x^k_i, x^k_j) for numerical attributes

A
  • 1 - ((absolute difference between x^k_i and x^k_j) / (range of the attribute))
  • 1 - normalized absolute difference
  • automatically scaled!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

gower’s similarity: final similarity

A

gower’s similarity of two instances

[sum over all instances s(x^k_i, x^k_j)] / [sum of all times when x^k_i and x^k_j can be compared]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

person level distance metrics (person-dataset based)

A
  • how do we compare similarity between datasets (qs1, qs2)
  1. without explicit ordering
  2. with temporal ordering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

person-dataset similarity: without explicit ordering

A
  1. summarize values per attribute over the entire dataset into a single value with the same distance metrics as before
    –> you lose a lot of information this way
  2. estimate parameters of distribution per attribute and compare parameter values with same distance metrics as before
  3. compare distributions of values for an attribute with a statistical test (e.g., kolmogorov-smirnov). take 1-p as distance metric.
    –> low p = very different distributions = distance metric will be close to 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

person-dataset similarity: datasets with temporal ordering

A
  1. raw-data based
  2. feature based: same as non-temporal case. extract features from temporal data set and compare those values.
  3. model-based: fit a time series model and use those parameters. again in line with the non-temporal case, except for the type of model being different
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

raw-data based similarity

A
  1. simplest case: assume equal number of points, and compute the euclidean distance of qs1 and qs2 per point of an attribute, then sum over attributes
    –> i.e., calculate the euclidean distance of each time point
  2. if the time series are more or less the same, but shifted in time. To handle this, we use the concept of lag and the cross-correlation coefficient to get the cc_distance.
  3. for different frequencies at which different people perform their activities, we can use dynamic time warping
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

shifted time series: lag

A
  • Lag τ is the amount of time by which one time series dataset is shifted relative to another.
  • The goal is to find the best lag τ that maximizes the similarity between the two time series. This is an optimisation problem
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

shifted time series: cross-correlation coefficient (ccc)

A
  • measures the similarity between two time series attributes after shifting one of them by τ
  • for each time point of an attribute, multiply that value of qs1 with the time point + τ value of qs2, then sum these products.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

shifted time series: cross-correlation distance (cc_distance)

A
  • gives us the distance between two time series as one is shifted by a certain time lag
  • we are testing all possible shifts τ from 1 up to the smaller length of the two data sets
  • We sum the inverse of the cross correlation coefficient (ccc) between two qs for each attribute: 1/ ccc
  • the best time lag is related to the smallest cc_distance
21
Q

dynamic time warping

A
  • for different frequencies at which different persons perform their activities
  • finds the best pairs of instances in sequences to find the minimum distance
22
Q

dynamic time warping: pairing conditions

A
  1. monotonicity condition: time order should be preserved.
    –> i.e., we can’t go back to a previous instance. you can only move left, up, or up-diagonal, but not backwards
  2. boundary condition: first and last points should be matched. this requires that we start at the bottom left and end at the top right. we cannot move outside of our time series
23
Q

dyamic time warping: cheapest path in the matrix

A
  1. start at (0,0)
  2. per pair, compute the cheapest way to get there given the costraints and the distance between each pair
  • cost = [distance between the two points] + [cheapest previous path (from left, from below, or diagonally)
24
Q

dynamic time warping: DTW distance

A
  • the value in top-right cell of the matrix is the DTW distance
  • this represents the minimum cost to align the two series.
  • finding this distance is computationally expensive: solved with the keogh bound
25
Q

DTW: keogh bound

A

allows you to estimate what the cheapest path will be

this makes DTW less computationally expensive

26
Q

clustering approaches

A
  1. k-means
  2. k-medoids
  3. hierarchical (divisive & agglomerative)
  4. subspace clustering
27
Q

k-means clustering

A
  • The goal of k-means clustering is to partition a set of data points into k clusters, where each data point belongs to the cluster with the nearest mean.
  1. initialization: Start by selecting k random points in the data space. These points will serve as the initial centers of the clusters.
  2. assign points to clusters: Each data point is assigned to the cluster whose center is nearest. This is typically done using the Euclidean distance.
  3. update centers: Recalculate the centers (centroids) of the clusters. The new center of each cluster is the mean of all the data points assigned to that cluster.
  4. assign points to clusters: Data points are reassigned based on the updated centers. This process repeats until convergence.
28
Q

k-means: selecting the best value for k

A

using the silhouette score

29
Q

k-means: silhouette score

A
  • measure used to determine how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
  • The silhouette score ranges from -1 to 1, where a value closer to 1 indicates better clustering.
  • you want to be close to point in your own cluster, and far away from points in other clusters: b > a = score closer to 1
  • based on the silhouette score for each k, we can decide which k is best
30
Q

silhouette score a(x_i) and b(x_i)

A
  • a: average distance from x_i to all other points in the same cluster
  • b: average distance from x_i to all points in the nearest neighboring cluster
31
Q

k-medoids clustering

A
  • whereas k-means uses random points as centers, k-medoids takes actual points as centers
  • This makes it more robust to noise and outliers.
32
Q

types of hierarchical clustering

A
  1. divisive clustering
  2. agglomerative clustering

these can take a more iterative approach than k-means and k-medoids

33
Q

divisive clustering

A
  • start with one big cluster C, make a split each step
  • calculate dissimilarity of a point to other points inside cluster C
  • create a new cluster C’ and remove the most dissimilar points from C to C’ until there is no point left in C that is less dissimilar to the points in C’
  • select the C with the largest diameter for this process
    –> the diameter of a cluster is the maximum distance between points in the cluster
34
Q

divisive clustering: dissimilarity of a point to other points within cluster C

A

sum(distance(x_i, x_j)) / |C|

  • i.e., this is just the average distance in a cluster
35
Q

divisive clustering: largest diameter

A

for all the points in C

diameter(C) = max distance(x_i, x_j)

  • i.e., largest cluster
36
Q

agglomerative clustering

A
  • start with one cluster per instance and merge into larger clusters
  1. Initialization: Start with each data point in its own cluster.
  2. Merge Clusters: At each step, find the pair of clusters that are closest to each other and merge them (based on critera).
  3. Continue this process until the desired number of clusters is achieved or a stopping criterion is met.
37
Q

agglomerative clustering: criteria for which clusters to merge

A
  1. single linkage: distance between two clusters is defined as the minimum distance between two points in separate clusters
  2. complete linkage: distance between two clusters is defined as the maximum distance between two points in separate clusters
  3. group average: distance between two clusters is defined as the average distance between all pairs of points, one from each cluster. This method balances between single and complete linkage.
  4. ward’s criterion: defines the distance between clusters as the increase in the standard deviation when the clusters are merged.
38
Q

subspace clustering

A
  • handles a large number of features (high dimensional data)
  • uses the CLIQUE algorithm
39
Q

CLIQUE algorithm

A
  1. create units (u): we split the range of each feature up into ε distinct intervals.
    –> a unit u is defined by means of boundaries per dimension: u = {u1,…,uk}.
  2. define selectivity and density of each unit u
40
Q

CLIQUE algorithm: units

A
  • defined by upper and lower boundaries per feature
  • u = {u_1, … ,u_p}
  • an instance x is part of this unit when its value falls within the boundaries for all features
41
Q

CLIQUE algorithm: selectivity(u)

A

[number of points in u] / [total number of points]

  • defines the proportion of points inside a unit
42
Q

CLIQUE algorithm: dense(u)

A
  • 1: when selectivity(u) is larger than threshold τ
  • 0: otherwise
  • a density score of 1 tells you that the unit holds a lot of points and therefore how relevant/how much information a unit holds
43
Q

CLIQUE algorithm: subspaces

A
  • we want subspaces (subsets of attributes) so our units do not have to cover all attributes
  • we can have units that cover p-k attributes
44
Q

CLIQUE algorithm: common face

A
  • units have a common face when all specifications of units are the same, except for one. for this one, either the upper bound is the same as the lower bound of another unit, or the lower bound is the same as the upper bound of another unit.
  • i.e., they are adjacent
  • we define a cluster as a maximal set of connected dense units.
45
Q

CLIQUE algorithm: connected

A

units are connected when;

  1. they share a common face (they are each other’s common face)

or

  1. when they share a unit that is a common face to both (i.e., connected through the common face)
46
Q

visualizing hierarchical clustering

A
  • dendogram
  • can be done for both divisive and agglomerative clustering
47
Q

ward’s criterion algorithm

A
  1. define clusters A, B, and the merged cluster AB
  2. take the sum of squared differences between all points in each cluster and the center of that cluster
  3. subtract the within cluster variances (AB - A - B)
  4. if the difference is small, it indicates that clusters A and B should be merged.
48
Q

problems with k-means, k-medoids, and hierarchical clustering (+ solution)

A
  1. will take a long time to compute
  2. calculating distances over a large number of attributes can be problematic and distances might not distinguish cases very clearly
  3. the results will not be very insightful due to the high dimensionality.

Hence, we need to define a subset of the attributes (or subspace) to perform clustering (CLIQUE)