Clustering Flashcards

Question 1

Q

Unsupervised learning introduction

Answer

A

We have the input features X, but we do not have the labels y.
We are interested in finding some interesting structure in the data, or , equivalently, to organize it in some meaningful way.

Question 2

Q

Clustering definition and difficulties

Answer

A

Clustering: task of grouping objects such that similar objects end up in the same group and dissimilar objects are separated into different groups.

Difficulties:

Similarity (or proximity) is not a transitive relation: Example, sequence of objects, x1,…, xm such that each xi is very similar to its two neiighbors, xi-1 and xi+1, but x1 and xm are very dissimilar
Lack of ground truth: a given set of objects can be clustered in various different meaningful ways

Question 3

Q

Model for Clustering

Answer

A

Let’s formulate the clustering problem more formally:

Input: set of elements X and distance function d: X x X -> R+ that is
symmetric
d(x,x) = 0
d satisfies the triangle inequality
Output: a partition of X into clusters that is C = (C1,…,Ck) with
their union Ci = X
for all i not j; Ci intersection Cj = 0

Question 4

Q

k-means clustering and Lloyd’s algorithm

Answer

A

Cost minimization in Clustering

define a cost function over possible partitions of the objects
find the partition (=clustering) of minimal cost

assumptions:
- data points x in X come from a larger space X’, that is X proper subset X’
- distance function d(x,x’) for x,x’ in X

for simplicity X’ = R^d and d(x,x’) = ||x-x’||

k-means clustering

input: data points x1, … , xm; k in N+
Goal: find
- partition C = (C1, … , Ck) of x1, … , xm
- centers mu1,…, mui in X’ center for Ci

that minimizes the k-means objective (cost)

SUMSUM d(x,mui)^2

to solve k means:
- Llyod’s Algorithm O(tkmd), where t = #ofiterations, m = #datapoints

Input: data points X = {x1,…xm}
Output: clustering C of X; centers

Pseudocode

Question 5

Q

k-means++

Question 6

Q

Linkage-based clustering

Clustering Flashcards

(6 cards)