Representative Based Flashcards

1
Q

Representative Based

A

N instances, k clusters, partition C of N in K
centroid_i = (1/n_i) * sum(xj)
where n_i = |Ci|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Brute Force

A

All possible clustering
Select best one
O(k^N/k!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

K-means

A
Greedy iterative approach
	SSE = sum(sum(||xj-centroid_i||^2))
	Cā€™ = arg min SSE
Most widely know rep.
Complexity O(nkd), centroid recompute O(nd)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

K-means Centroid init

A

1) Pick points far away
2) First point at random. add point larga distance
3) Hierarchically so k clusters/ Pick point of each cluster
4) more than K centroids and then merge

P = K! / K^K

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Bisecting

A

one cluster, then divide, then divide.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Updating Center incrementally

A

After each assignment

More expensive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Pre processing

A

Normalize data

Eliminate outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Post processing

A

Eliminate small clusters
Split loose clusters
Merge clusters closer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Problems of Kmeans

A
Densities
Non globular shapes
Sizes
Need to specify k
Sensitive to outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

BFR Algorithm

A

Very large data sets
Clusters normally distributed around centroid
Axis aligned ellipses
O(clusters) and not O(data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

BFR Steps

A

K random points
Small random sample and cluster optimally
Take sample, pick random point, and then k-1 more points, as far from previous
Discard Set, Compression Set, Retained Set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Mean Shift Clustering

A
Iterative, non parametric algorithm
It searches for the mode
Find densest region
Number of modes gives number of clusters
Can handle arbitrarily shaped clusters
Robust to initializations
O(T*n^2) expensive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Mean Shift Pseudo

A

1) choose bandwidthh
2) initial location of search window
3) compute mean location
4) center search window point 3
5) repeat 3 and 4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Expectation Max - Clustering

A

soft assignment. each point has a probability
Mean vector u_i, covariance matrix E_i
phi = {u_i * E_i * P(C_i)}
Goal: maximize phi

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Density Based

A

Arbitrary Shape
Handle noise
One Scan

Neighborhood within a radius E
Core object: if E-neig contains at least minpts
Direct density reachable from y if x is within e-neigh
Density Reachable, Density Connected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly