Question 1

What is the goal of clustering?

Accepted Answer

To try and capture which data points belong to which generating distribution

Question 2

(T/F) Clustering is a form of unsupervised learning

Accepted Answer

True

Question 3

What is the hypothesis behind clustering?

Accepted Answer

That there is a set (K) of generating distributions from which the data were created

Question 4

What is clustering sensitive to?

Accepted Answer

Noise and the scale of each feature; if one feature has a range of 0-100 and others are between 0 and 1, then the distance calculations will be dominated by the distance between the feature with a larger range

Question 5

What are the four different methods for clustering?

Accepted Answer

1) Partitioning 2) Hierarchical methods 3) Density-based methods 4) Grid-based methods

Question 6

What are the general characteristics of partitioning methods?

Accepted Answer

- find mutually exclusive clusters of spherical shape - distance-based - may use mean or medoid to represent cluster center - effective for small- to medium- sized data sets

Question 7

What are the general characteristics of hierarchical methods?

Accepted Answer

- clustering is a hierarchical decomposition - cannot correct erroneous merges or split - may incorporate other techniques like microclustering or consider object "linkages"

Question 8

What are the general characteristics of density-based methods?

Accepted Answer

- can find arbitrarily shaped clusters - clusters are dense regions of objects in space that are separated by low-density regions - cluster density: each point must have a minimum number of points within its "neighborhood" - may filter out outliers

Question 9

What are the general characteristics of grid-based methods?

Accepted Answer

- use a multiresolution grid data structure - fast processing time (typically independent of the number of data objects, yet dependent on grid size)

Question 10

What is the difference between extrinsic and intrinsic measures of cluster goodness?

Accepted Answer

The core task of extrinsic measures is to assign a score to a clustering, given the ground truth. Intrinsic measures are taken when the ground truth is unavailable and involves evaluating a clustering by examining how well clusters are separated and how compact the clusters are

Question 11

What is another name for SSE?

Accepted Answer

Within-cluster variance

Question 12

What is the SSE?

Accepted Answer

can be used to compare quality of different clusterings of the same K

Question 13

(T/F) The higher the SSE the higher quality the clustering

Accepted Answer

False; the LOWER the sum the higher quality the clustering

Question 14

What is a silhouette score?

Accepted Answer

An intrinsic measure of cluster goodness that wants data points in a cluster to be close together and far apart from other clusters

Question 15

What is the range of the silhouette score?

Accepted Answer

[-1, 1]

clustering Flashcards

(25 cards)