lecture 4 - clustering Flashcards
two types of setup
- per instance
- per person
setup per instance
- per timepoint, over all QS
- Each instance (time point) is treated as a separate observation.
- For each instance, there is a feature vector.
- These feature vectors are combined into a large matrix
X where each row corresponds to an instance (measurement at a particular time). (X_N, qs_n)
setup per person
- for finding types of people
- data is organized by individual, where each person’s data is grouped together
- each person has multiple sets of features, representing different measurements or time points
- matrix X now corresponds to submatrices, each corresponding to a different person (X_qs_n)
individual distance metrics (instance based)
- euclidean distance
- manhattan distance
- minkowski distance
- gower’s similarity
euclidean distance
shortest straight path
manhattan distance
block based structure
minkowski distance
- generalized form of euclidean and manhattan distance
- q =1: manhattan distance
- q = 2: euclidean distance
important to consider for euclidean, manhattan, and minkowski distance
scaling the data, since all of these metrics assume numeric values
gower’s similarity
- does not assume numeric values, so we can use this to find distances for different types of features
- dichotomous attributes
- categorical attributes
- numerical attributes
value of s(x^k_i, x^k_j) for dichotomous attributes
- 1 when x^k_i and x^k_j are both present
- 0 otherwise
- i.e., similar when both instances indicate presence
value of s(x^k_i, x^k_j) for categorical attributes
- 1 when x^k_i = x^k_j
- 0 otherwise
- i.e., similar when instances are of the same category
value of s(x^k_i, x^k_j) for numerical attributes
- 1 - ((absolute difference between x^k_i and x^k_j) / (range of the attribute))
- 1 - normalized absolute difference
- automatically scaled!
gower’s similarity: final similarity
gower’s similarity of two instances
[sum over all instances s(x^k_i, x^k_j)] / [sum of all times when x^k_i and x^k_j can be compared]
person level distance metrics (person-dataset based)
- how do we compare similarity between datasets (qs1, qs2)
- without explicit ordering
- with temporal ordering
person-dataset similarity: without explicit ordering
-
summarize values per attribute over the entire dataset into a single value with the same distance metrics as before
–> you lose a lot of information this way - estimate parameters of distribution per attribute and compare parameter values with same distance metrics as before
-
compare distributions of values for an attribute with a statistical test (e.g., kolmogorov-smirnov). take 1-p as distance metric.
–> low p = very different distributions = distance metric will be close to 1
person-dataset similarity: datasets with temporal ordering
- raw-data based
- feature based: same as non-temporal case. extract features from temporal data set and compare those values.
- model-based: fit a time series model and use those parameters. again in line with the non-temporal case, except for the type of model being different
raw-data based similarity
-
simplest case: assume equal number of points, and compute the euclidean distance of qs1 and qs2 per point of an attribute, then sum over attributes
–> i.e., calculate the euclidean distance of each time point - if the time series are more or less the same, but shifted in time. To handle this, we use the concept of lag and the cross-correlation coefficient to get the cc_distance.
- for different frequencies at which different people perform their activities, we can use dynamic time warping
shifted time series: lag
- Lag τ is the amount of time by which one time series dataset is shifted relative to another.
- The goal is to find the best lag τ that maximizes the similarity between the two time series. This is an optimisation problem
shifted time series: cross-correlation coefficient (ccc)
- measures the similarity between two time series attributes after shifting one of them by τ
- for each time point of an attribute, multiply that value of qs1 with the time point + τ value of qs2, then sum these products.
shifted time series: cross-correlation distance (cc_distance)
- gives us the distance between two time series as one is shifted by a certain time lag
- we are testing all possible shifts τ from 1 up to the smaller length of the two data sets
- We sum the inverse of the cross correlation coefficient (ccc) between two qs for each attribute: 1/ ccc
- the best time lag is related to the smallest cc_distance
dynamic time warping
- for different frequencies at which different persons perform their activities
- finds the best pairs of instances in sequences to find the minimum distance
dynamic time warping: pairing conditions
-
monotonicity condition: time order should be preserved.
–> i.e., we can’t go back to a previous instance. you can only move left, up, or up-diagonal, but not backwards - boundary condition: first and last points should be matched. this requires that we start at the bottom left and end at the top right. we cannot move outside of our time series
dyamic time warping: cheapest path in the matrix
- start at (0,0)
- per pair, compute the cheapest way to get there given the costraints and the distance between each pair
- cost = [distance between the two points] + [cheapest previous path (from left, from below, or diagonally)
dynamic time warping: DTW distance
- the value in top-right cell of the matrix is the DTW distance
- this represents the minimum cost to align the two series.
- finding this distance is computationally expensive: solved with the keogh bound