clustering Flashcards
What is the goal of clustering?
To try and capture which data points belong to which generating distribution
(T/F) Clustering is a form of unsupervised learning
True
What is the hypothesis behind clustering?
That there is a set (K) of generating distributions from which the data were created
What is clustering sensitive to?
Noise and the scale of each feature; if one feature has a range of 0-100 and others are between 0 and 1, then the distance calculations will be dominated by the distance between the feature with a larger range
What are the four different methods for clustering?
1) Partitioning
2) Hierarchical methods
3) Density-based methods
4) Grid-based methods
What are the general characteristics of partitioning methods?
- find mutually exclusive clusters of spherical shape
- distance-based
- may use mean or medoid to represent cluster center
- effective for small- to medium- sized data sets
What are the general characteristics of hierarchical methods?
- clustering is a hierarchical decomposition
- cannot correct erroneous merges or split
- may incorporate other techniques like microclustering or consider object “linkages”
What are the general characteristics of density-based methods?
- can find arbitrarily shaped clusters
- clusters are dense regions of objects in space that are separated by low-density regions
- cluster density: each point must have a minimum number of points within its “neighborhood”
- may filter out outliers
What are the general characteristics of grid-based methods?
- use a multiresolution grid data structure
- fast processing time (typically independent of the number of data objects, yet dependent on grid size)
What is the difference between extrinsic and intrinsic measures of cluster goodness?
The core task of extrinsic measures is to assign a score to a clustering, given the ground truth. Intrinsic measures are taken when the ground truth is unavailable and involves evaluating a clustering by examining how well clusters are separated and how compact the clusters are
What is another name for SSE?
Within-cluster variance
What is the SSE?
can be used to compare quality of different clusterings of the same K
(T/F) The higher the SSE the higher quality the clustering
False; the LOWER the sum the higher quality the clustering
What is a silhouette score?
An intrinsic measure of cluster goodness that wants data points in a cluster to be close together and far apart from other clusters
What is the range of the silhouette score?
[-1, 1]