lecture 4 - clustering Flashcards
two types of setup
- per instance
- per person
setup per instance
- per timepoint, over all QS
- Each instance (time point) is treated as a separate observation.
- For each instance, there is a feature vector.
- These feature vectors are combined into a large matrix
X where each row corresponds to an instance (measurement at a particular time). (X_N, qs_n)
setup per person
- for finding types of people
- data is organized by individual, where each person’s data is grouped together
- each person has multiple sets of features, representing different measurements or time points
- matrix X now corresponds to submatrices, each corresponding to a different person (X_qs_n)
individual distance metrics (instance based)
- euclidean distance
- manhattan distance
- minkowski distance
- gower’s similarity
euclidean distance
shortest straight path
manhattan distance
block based structure
minkowski distance
- generalized form of euclidean and manhattan distance
- q =1: manhattan distance
- q = 2: euclidean distance
important to consider for euclidean, manhattan, and minkowski distance
scaling the data, since all of these metrics assume numeric values
gower’s similarity
- does not assume numeric values, so we can use this to find distances for different types of features
- dichotomous attributes
- categorical attributes
- numerical attributes
value of s(x^k_i, x^k_j) for dichotomous attributes
- 1 when x^k_i and x^k_j are both present
- 0 otherwise
- i.e., similar when both instances indicate presence
value of s(x^k_i, x^k_j) for categorical attributes
- 1 when x^k_i = x^k_j
- 0 otherwise
- i.e., similar when instances are of the same category
value of s(x^k_i, x^k_j) for numerical attributes
- 1 - ((absolute difference between x^k_i and x^k_j) / (range of the attribute))
- 1 - normalized absolute difference
- automatically scaled!
gower’s similarity: final similarity
gower’s similarity of two instances
[sum over all instances s(x^k_i, x^k_j)] / [sum of all times when x^k_i and x^k_j can be compared]
person level distance metrics (person-dataset based)
- how do we compare similarity between datasets (qs1, qs2)
- without explicit ordering
- with temporal ordering
person-dataset similarity: without explicit ordering
-
summarize values per attribute over the entire dataset into a single value with the same distance metrics as before
–> you lose a lot of information this way - estimate parameters of distribution per attribute and compare parameter values with same distance metrics as before
-
compare distributions of values for an attribute with a statistical test (e.g., kolmogorov-smirnov). take 1-p as distance metric.
–> low p = very different distributions = distance metric will be close to 1
person-dataset similarity: datasets with temporal ordering
- raw-data based
- feature based: same as non-temporal case. extract features from temporal data set and compare those values.
- model-based: fit a time series model and use those parameters. again in line with the non-temporal case, except for the type of model being different
raw-data based similarity
-
simplest case: assume equal number of points, and compute the euclidean distance of qs1 and qs2 per point of an attribute, then sum over attributes
–> i.e., calculate the euclidean distance of each time point - if the time series are more or less the same, but shifted in time. To handle this, we use the concept of lag and the cross-correlation coefficient to get the cc_distance.
- for different frequencies at which different people perform their activities, we can use dynamic time warping
shifted time series: lag
- Lag τ is the amount of time by which one time series dataset is shifted relative to another.
- The goal is to find the best lag τ that maximizes the similarity between the two time series. This is an optimisation problem
shifted time series: cross-correlation coefficient (ccc)
- measures the similarity between two time series attributes after shifting one of them by τ
- for each time point of an attribute, multiply that value of qs1 with the time point + τ value of qs2, then sum these products.