Clustering Flashcards

Question 1

Q

Basic Clustering problem

Answer

A

Given set of objects X, then partition them into groups.

Inter-object distance D(X,Y) = D(Y,X)

Question 2

Q

Hierarchical Agglomerative clustering

Answer

A

starts with treating each observation as an individual cluster, and then iteratively(n-k) merges clusters until all the data points are merged into a single cluster. Dendrograms are used to represent hierarchical clustering results.

Clusters are merged based on the distance between them and to calculate the distance between the clusters we have different types of linkages.

Question 3

Q

What is the Linkage Criteria?

Answer

A

It determines the distance between sets of observations as a function of the pairwise distance between observations.

Single Linkage -> minimum distance between member of two cluster (produces long chains)
Complete Linkage -> maximum distance (spherical clusters with consistent diameter)
Average Linkage -> the average of all distances (less affected by outliers)
Centroid Linkage -> the distance between their centroids (In Euclidean space)

Question 4

Q

What is the Running time of SLC?

Answer

A

It is deterministic. Doesn’t require randomize optimization
If you’re doing this process in a space where distance represent edge length of the graph, then it the same as minimum spanning tree algorithm.
Runtime friendly o(N^3)
could be more clever if we store these distance

repeat k times (N/2)
look at all the distances to find closest pair o(n^2)
that have different labels .

Question 5

Q

Could we do SLC linear ?

Answer

A

No, could be little down than O^3 for small and very big k. reduce time to o(N^2 log N) if we using priority queue

Question 6

Q

Issue with SLC ?

Answer

A

premature merging of groups with close pairs, even if those groups are quite dissimilar overall.
(chaining phenomenon) .
No working fine with large dataset. but o(N^3)

Question 7

Q

How K-mean work?

Answer

A

1) Pick k centers at random
2) Each center claims its closest points
3) Re-compute the centers by averaging the clustered points
4) Repeat until converge

Question 8

Q

Is the center of k-mean should be a point in the data?

and how to picking the initial k points?

Answer

A

Not necessary.

1) sampling (sample the data using Hierarchical clustering, then picking point from each cluster) ,
2) pick dispersed set of points. (first point random, afterward, pick whose min distance as large as possible )

Question 9

Q

How to optimize K-mean ?

Answer

A

It looks like hill climbing, it evaluate the score and choose the neighbor that maximize the score.
* ) can only go down or stay the same.

Question 10

Q

How to pick the right k for k-means?

Answer

A

Elbow curve, the smallest value for average distance to centroid before converge.

Question 11

Q

What the complexity of K-means?

Answer

A

*) In each round, we have to examine each input point exactly once to find closest centroid.

) Each round O(KN) from N points, K clusters.
) But he number of rounds to convergence can be very large O(K^N)

Question 12

Q

In km-means, Why when we move the center could be up?

Answer

A

because average is the best way to represent set of points. (Average minimized squared error)
Monotonically non-increasing in error!

Question 13

Q

How K-mean converge?

Answer

A

There are finite number of configuration, because there are finite number of object, and there is finite number of label they could have, once you’ve chosen a label for a bunch of object, the centers are defined deterministically from that .
you have to have some way to beak tie. I will never back to old configuration. then I will be run out of configuration.

There are K^N configuration.

Finite number of configuration -> Never getting worse in error metrics –> you have some way of breaking ties -> you have to stop -> this how to get convergence

Question 14

Q

Can K-mean Stuck?

Answer

A

Yes. depend on the first random point chosen. (Local optimal)
We could avoid that by repeat it. (Random restarts).
There are smart way to choose the start points.

Question 15

Q

Is k-mean hard of soft clustering ?

Answer

A

No, Hard clustering.

Question 16

Q

What is soft-clustering?

Answer

Study These Flashcards

A

Assume the was generate by :

1) select one of K Gaussians [fixed known variance] uniformly
2) sample Xi from that given Gaussian
3) repeat n times

Task : find a hypothesis h = that maximized the probability of the data (ML)

Question 17

Q

Maxim Like hood Gaussian?

Answer

Study These Flashcards

A

The Ml mean of the Gaussian M is the mean of the data.

Question 18

Q

What is properties of EM?

Answer

Study These Flashcards

A

1) Monotonically non-decreasing likelihood
2) does not converge (practically does)
3) will not diverge
4) can get stuck
5) works with any distribution

Question 19

Q

How EM converge?

Answer

Study These Flashcards

A

The configuration is probabilities and there are infinite number of those. You never do worse. But you always trying to move close and closer. But you never approach the final best configuration

Clustering Flashcards

(19 cards)