Clustering Flashcards

1
Q

Unsupervised learning introduction

A

We have the input features X, but we do not have the labels y.
We are interested in finding some interesting structure in the data, or , equivalently, to organize it in some meaningful way.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Clustering definition and difficulties

A

Clustering: task of grouping objects such that similar objects end up in the same group and dissimilar objects are separated into different groups.

Difficulties:

  • Similarity (or proximity) is not a transitive relation: Example, sequence of objects, x1,…, xm such that each xi is very similar to its two neiighbors, xi-1 and xi+1, but x1 and xm are very dissimilar
  • Lack of ground truth: a given set of objects can be clustered in various different meaningful ways
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Model for Clustering

A

Let’s formulate the clustering problem more formally:

  • Input: set of elements X and distance function d: X x X -> R+ that is
    symmetric
    d(x,x) = 0
    d satisfies the triangle inequality
  • Output: a partition of X into clusters that is C = (C1,…,Ck) with
    their union Ci = X
    for all i not j; Ci intersection Cj = 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

k-means clustering and Lloyd’s algorithm

A

Cost minimization in Clustering

  • define a cost function over possible partitions of the objects
  • find the partition (=clustering) of minimal cost

assumptions:
- data points x in X come from a larger space X’, that is X proper subset X’
- distance function d(x,x’) for x,x’ in X

for simplicity X’ = R^d and d(x,x’) = ||x-x’||

k-means clustering

input: data points x1, … , xm; k in N+
Goal: find
- partition C = (C1, … , Ck) of x1, … , xm
- centers mu1,…, mui in X’ center for Ci

that minimizes the k-means objective (cost)

SUMSUM d(x,mui)^2

to solve k means:
- Llyod’s Algorithm O(tkmd), where t = #ofiterations, m = #datapoints

Input: data points X = {x1,…xm}
Output: clustering C of X; centers

Pseudocode

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

k-means++

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Linkage-based clustering

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly