clustering Flashcards

1
Q

What is the goal of clustering?

A

To try and capture which data points belong to which generating distribution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

(T/F) Clustering is a form of unsupervised learning

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the hypothesis behind clustering?

A

That there is a set (K) of generating distributions from which the data were created

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is clustering sensitive to?

A

Noise and the scale of each feature; if one feature has a range of 0-100 and others are between 0 and 1, then the distance calculations will be dominated by the distance between the feature with a larger range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the four different methods for clustering?

A

1) Partitioning
2) Hierarchical methods
3) Density-based methods
4) Grid-based methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the general characteristics of partitioning methods?

A
  • find mutually exclusive clusters of spherical shape
  • distance-based
  • may use mean or medoid to represent cluster center
  • effective for small- to medium- sized data sets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the general characteristics of hierarchical methods?

A
  • clustering is a hierarchical decomposition
  • cannot correct erroneous merges or split
  • may incorporate other techniques like microclustering or consider object “linkages”
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the general characteristics of density-based methods?

A
  • can find arbitrarily shaped clusters
  • clusters are dense regions of objects in space that are separated by low-density regions
  • cluster density: each point must have a minimum number of points within its “neighborhood”
  • may filter out outliers
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the general characteristics of grid-based methods?

A
  • use a multiresolution grid data structure
  • fast processing time (typically independent of the number of data objects, yet dependent on grid size)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the difference between extrinsic and intrinsic measures of cluster goodness?

A

The core task of extrinsic measures is to assign a score to a clustering, given the ground truth. Intrinsic measures are taken when the ground truth is unavailable and involves evaluating a clustering by examining how well clusters are separated and how compact the clusters are

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is another name for SSE?

A

Within-cluster variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the SSE?

A

can be used to compare quality of different clusterings of the same K

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

(T/F) The higher the SSE the higher quality the clustering

A

False; the LOWER the sum the higher quality the clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a silhouette score?

A

An intrinsic measure of cluster goodness that wants data points in a cluster to be close together and far apart from other clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the range of the silhouette score?

A

[-1, 1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What happens when the coefficient approaches 1?

A

The cluster containing o is compact and far from other clusters

17
Q

What happens if b(o) < a(o)?

A

Then o is closer to objects in another cluster than objects in the same cluster as o

18
Q

What is the goal of K-means algorithm?

A

High intracluster similarity and low intercluster similarity

19
Q

What is iterative relocation?

A

The process of iteratively reassigning objects to the clusters to improve partitioning.

20
Q

Why is the choice of K important?

A

Choosing the appropriate number of clusters controls for the granularity of cluster analysis. It also helps to find good balance between compressibility and accuracy

21
Q

What is the simple method of choosing K

A

Set number of clusters to root n/2 so each cluster has root 2n points

22
Q

What are the x and y axes of the elbow method?

A

x: number of clusters
y: avg within-cluster sum of squares

23
Q

What is the heuristic for the elbow method?

A

The turning point in the curve of the sum of within-cluster variances wrt # clusters. Should choose k that leads to biggest decrease

24
Q

What is the elbow method based on?

A

The observation that increasing the number of clusters can help reduce the sum of within-cluster variance of each cluster

25
Q

What are three ways of choosing K?

A

1) root n/2
2) elbow method
3) cross-validation