Clustering Flashcards

1
Q

Basic Clustering problem

A

Given set of objects X, then partition them into groups.

Inter-object distance D(X,Y) = D(Y,X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hierarchical Agglomerative clustering

A

starts with treating each observation as an individual cluster, and then iteratively(n-k) merges clusters until all the data points are merged into a single cluster. Dendrograms are used to represent hierarchical clustering results.

Clusters are merged based on the distance between them and to calculate the distance between the clusters we have different types of linkages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Linkage Criteria?

A

It determines the distance between sets of observations as a function of the pairwise distance between observations.

Single Linkage -> minimum distance between member of two cluster (produces long chains)
Complete Linkage -> maximum distance (spherical clusters with consistent diameter)
Average Linkage -> the average of all distances (less affected by outliers)
Centroid Linkage -> the distance between their centroids (In Euclidean space)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Running time of SLC?

A
  • It is deterministic. Doesn’t require randomize optimization
  • If you’re doing this process in a space where distance represent edge length of the graph, then it the same as minimum spanning tree algorithm.
  • Runtime friendly o(N^3)
  • could be more clever if we store these distance

repeat k times (N/2)
look at all the distances to find closest pair o(n^2)
that have different labels .

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Could we do SLC linear ?

A

No, could be little down than O^3 for small and very big k. reduce time to o(N^2 log N) if we using priority queue

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Issue with SLC ?

A

premature merging of groups with close pairs, even if those groups are quite dissimilar overall.
(chaining phenomenon) .
No working fine with large dataset. but o(N^3)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How K-mean work?

A

1) Pick k centers at random
2) Each center claims its closest points
3) Re-compute the centers by averaging the clustered points
4) Repeat until converge

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is the center of k-mean should be a point in the data?

and how to picking the initial k points?

A

Not necessary.

1) sampling (sample the data using Hierarchical clustering, then picking point from each cluster) ,
2) pick dispersed set of points. (first point random, afterward, pick whose min distance as large as possible )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to optimize K-mean ?

A

It looks like hill climbing, it evaluate the score and choose the neighbor that maximize the score.
* ) can only go down or stay the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How to pick the right k for k-means?

A

Elbow curve, the smallest value for average distance to centroid before converge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What the complexity of K-means?

A

*) In each round, we have to examine each input point exactly once to find closest centroid.

  • ) Each round O(KN) from N points, K clusters.
  • ) But he number of rounds to convergence can be very large O(K^N)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

In km-means, Why when we move the center could be up?

A

because average is the best way to represent set of points. (Average minimized squared error)
Monotonically non-increasing in error!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How K-mean converge?

A

There are finite number of configuration, because there are finite number of object, and there is finite number of label they could have, once you’ve chosen a label for a bunch of object, the centers are defined deterministically from that .
you have to have some way to beak tie. I will never back to old configuration. then I will be run out of configuration.

There are K^N configuration.

Finite number of configuration -> Never getting worse in error metrics –> you have some way of breaking ties -> you have to stop -> this how to get convergence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Can K-mean Stuck?

A

Yes. depend on the first random point chosen. (Local optimal)
We could avoid that by repeat it. (Random restarts).
There are smart way to choose the start points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Is k-mean hard of soft clustering ?

A

No, Hard clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is soft-clustering?

A

Assume the was generate by :

1) select one of K Gaussians [fixed known variance] uniformly
2) sample Xi from that given Gaussian
3) repeat n times

Task : find a hypothesis h = that maximized the probability of the data (ML)

17
Q

Maxim Like hood Gaussian?

A

The Ml mean of the Gaussian M is the mean of the data.

18
Q

What is properties of EM?

A

1) Monotonically non-decreasing likelihood
2) does not converge (practically does)
3) will not diverge
4) can get stuck
5) works with any distribution

19
Q

How EM converge?

A

The configuration is probabilities and there are infinite number of those. You never do worse. But you always trying to move close and closer. But you never approach the final best configuration