Clustering Flashcards

1
Q

K-means clustering algorithm?

A

-Randomly assign cluster number to each observation

-Until clusters stop changing:
-for each cluster, compute centroid
-assign each observation to cluster with center closest to observation (using Euclidean distance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What kind of distance does KMeans Clustering require and why?

A

-euclidean distance
- minimizes euclidean error
-without euclidean space, the centroid is meaningless

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to use kmeans with Manhattan distance?

A

Use PAM (partition around medoids)
- medoids - median data point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to use kmeans with Cosine distance?

A

First normalize every data row, then apply kmeans (data points end up on unit circle)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the different specifications for linkage in hierarchical clustering?

A
  • Single linkage - dist between closest points
  • Complete linkage - dist between farthest points
  • Avg linkage - mean dist between all points in the 2 clusters
  • Ward distance - diff between total within cluster sum of squares for the two clusters separately and the within cluster sum of squared resulting from merging the two clusters together
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which linking method needs the distance to be metric?

A
  • ward, centroid based need euclidean distance (they min squared error)
  • single, complete, avg need symmetry and non-negative
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

DBScan core?

A

When nb of points in circle >= t

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

DBScan border?

A

When nb of points in circle < t, but has core point in circle

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

DBScan noise?

A

When nb of points < t, no core points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

DBScan algortihm?

A

Determine core, border, noise points
Connect core points within distance e
Find all connected components
Assign border points to closest connected component
Return all components as clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Does DBScan need Euclidean?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Should we normalize observations/ features first?

A

No! Influences the distances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Centroid-based clustering issues?

A

-stuck in local minima
-initialization might be problematic
-how to handle streaming data?
-cannot handle non-spherical clusters?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Hierarchical, density, graph based clustering issues?

A

-Requires computing distance matrix -> expensive
-streaming data?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe kmeans++

A

centroid c assigned randomly
for each datapoint x calculate min dist to centroid
assign next c to random point using proportional probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe CURE

A

select few datapoints to represent cluster
run expensive (DBScan) on it to find initialization
select representatives for each cluster, make sure they show behaviour of cluster
maintain mini clusters that later are merged into larger ones

17
Q

Potential problems of CURE?

A

representatives may be too far to merge and represent outliers, move representatives towards the centroids

18
Q

Does CURE need euclidean distance?

A

Yes, because of modification

19
Q

What are the three sets used by BFR?

A
  • Discard set - adds to preexisting cluster
  • Compressed set - mini clusters (remaining points)
  • Retained set - outliers
    only retained set is stored in memory
20
Q

What sufficient statistics do we need for a cluster?

A
  • n number of datapoints
  • SUM -vec with sums of all feature values
  • SUMQ -vec with sums of squared feature values
21
Q

BFR clustering alg

A

For each batch
-if point in discard, update stat, discard
- else
-use standard clustering to get minis
-compute stats for minis, discard points
-keep unclustered in retained
- merge minis and retained with older minis, update stats

22
Q

Does BFR need Euclidean?

A

Theoretically yes (uses centroids), but has been done without, if careful

23
Q

What does Silhouette coefficient tell us?

A
  • between -1 and 1
    large positive values good clustering
    negative value somewhat mixed
24
Q

How to evaluate kmeans clusters?

A

Using WCV

25
Q

Can BFR represent non spherical clusters?

A

No

26
Q

can CURE represent non spherical clusters?

A

Yes

27
Q

Does BFR discard outliers (points that do not belong to pre-existing clusters?

A

No

28
Q

t-SNE advantages?

A

-can capture nonlinear relationships
-preserves local/ global structures well for visualization
-effective in practice

29
Q

t-SNE disadvantages?

A

-only used for visualization
-parameters need to be set by hand (sensitive to initial conditions)
-slow for large datasets

30
Q

what does the k in UMAP do?

A

large k- preserves global structure
small k- preserves local structure
k-nb of nearest neighbours

31
Q

UMAP advantages?

A

-control over local vs global preservation
-fast
-well separated clusters

32
Q

UMAP disadvantages?

A

-k should be optimized
-distance info still lost (only visualization)
-sensitive to optim. parameters and initial conditions