Clustering Flashcards
What is clustering?
process of grouping a set of data objects into multiple groups or clusters. Objects within a cluster have high similarity but very disimilar to objects in other cluster.
What is clustering or cluster analysis?
process of partitioning a set of data objects (or observations) into subsets where each subste is a cluster
What is the other name for clustering?
data segmentation
Is clustering supervised or unsupervised learning? Why?
unsupervised, because class label information is not present.
Is clustering learning by observation or by examples?
Learning by observation
What are the 8 requirements for clustering in data mining?
Scalability, Ability to deal with multitype attributes, discover clusters with a different shape, domain knowledge to determine input parameters, deal with noisy data, and incremental clustering and insensitivity to input order.
What are the two main distances that clustering determines? What type of shapes do they usually identify?
Euclidean or Manhattan distance measures. spherical clusters with similar size and density.
What general input do clustering algorithms need from user?
desired number of clusters.
Are clusters sensitivie or insensitive to noisy data generally?
Sensitive
What happens to clusters with incremental updates? What other effect does it have?
They usually have to recompute clusters from scratch. If data order is changed clusters may be completly different.
What are the name of algorithms wich can take incremental updates?
Incremental clustering algorithms
Clustering methods can be compared with what orthogonal aspects?
Paritioning criteria, separation of clusters, similarity measure and clustering space
What two types of partitioning exist?
Hierarchy and non-Hierarchy partitions
What to types of separation of clusters exist?
Mutually exclusive (only belong to one group) and non-exclusive (can belong to two or more).
How can the distance be defined in terms of measuring similarity of two objects?
Euclidean space, vector space, or any other space
What to types of measures exist for similarity
distance based methods and density- and continuity - based methods
Do basic partitioning methods adobe exclusive or non exclusive cluster separation?
exclusive
are partitioning methods usually distance or density based?
distance based
When do heurestic clustering methods work well? What type of shpaes?
spherical shape clusters in small to medium size databases
What are two popular heurestic methods?
k-means and k-medioids
What are the two classifications of hierarchical methods?
agglomerative and divisive
What does the aggloerative approach in hierarchical methods consist off?
Bootom-up approach, succesively merges objects that are close together until all groups are merged into one
What does the divisive approach in hierarchical methods consist off?
top-down approach, with each iteration they split whole data into smaller clusters
Hierrarchical methods can be based on what two methods
distance or density and continuity
What is the main disadvatage of hierrarchical methods?
once merge or split is done it cannot be undone therefore cannot correct erroneous desicions
What do density based methods consist of?
continues growing clusters as long as density in the neighborhood exheeds a certain threshold
What are density based methods good for?
detecting outliers and to discover clusters of arbitrary shape
What do grid based methods consist of?
quantize the object space into a finite number of cells that form a grid structure
What are some advantages of grid baesd-methods ?
fast processing time and possible integratino with other clustering methods such as density based methods and hierarchical methods
What are the three main tasks when evaluating clustering?
assesing cluster tendency, determine number of clusters in a dataset, measuring the quality of the clustering
What does the clustering tendecy asses?
determines if there is a non-random structure which may lead to meaningful clusters.
Clustering requieres what type of distributed data?
non uniform distribution of data
What is the hopkins statistic ?
spatial statistic that tests the spatial randomness of a variable as distributed in a space
What are the homogeneous and non-homogeneous hypothesis in evaluating cluster tendency?
that D is uniformly (not meaningufl) or non uniformily (meaningful clusters) respectiveley distributed
Why is it important to determine the right number of cluters in a data set?
1 - INPUT parameter for some algorithms
2 - controls proper granularity of cluster analysis
The right number of clusters for a data set often depends on?
distribution shape and scale in the data set as well as clustering resolution needed by the user
What is the popular number of clusters used? How many points would that cluster have?
sqrt(n/2) where n is the # of objects, it would have sqrt(2n) points
What is the elbow method based on?
observation that increasing the number of clusters helps reduce the sum of within-cluster variance of each cluster
What is the heurestic for selecting the appropiate number of k ?
Selecting the turning point in the curve of the sum of within-cluster variance with respect to number of clusters
What happens in the elbow method to the sum of within-cluster variance when increasing k?
the effect of reducing the sum of within cluster variance may drop
What other method is appropiate for determining the appropiate numbeer of clusters (not sqrt())?
cross validation
What is cross validation?
building cluster with n-1 dataset objects and using remaining to test quality of clustering by calculating within-clustering variance of test points to centroids.
What method do we use to measure clusteirng quality if we have ground truth? What if we dont?
Extrinsic method, Intrinsic method
What do extrinsic method consist of?
comparing clustering against the group truth and measure
What do intrinsic method consist of?
eavluate goodnes of clustering by considering how well the clusters are separated
The clustering quality, Q is effective if it satisfies what four qualityes?
Cluster homogeneity, cluster completeness, rag bag, and small cluster preservation
What does cluster homogeneity consist of in clustering quality?
That the more pure the clusters in a clustering are the better (using ground truth)
What does cluster completeness requires in clustering quality?
requires that a clustering should assign objects belonging to the same category (according to ground truth) to the same cluster
What does the small cluster preservation states?
splitting a small category into pieces is more harmful than splitting a large category into pieces.
What clustering quality measures satisfy all four requirements? Name 2.
BCubed precision and recall
What does BCubed evaluates?
the precisionand recall for every object in a clustering on a given data set according to ground truth
What is precision in clustering?
how many other objects in the same cluster belong to the same category as the object
What is recall in clustering?
The recall of an object reflects how many objects of the same category are assigned to the same cluster
In the silhouette coefficient what are a(o) and b(o)?
a(o) is the average distance between o and all other objects in the cluster to which o belongs
b(o) is the minimum average distance from o to all clusters to which o does not belong
What is a method of evaluation of intriscic methods? how do the value ranges? what do the value tells you
silhouette coefficient and ranges from -1 and 1 where a negative value means the point is closer to a point in another cluster and positive means it is compact (good)
What is the difference in grouping in classification and clustering?
classification groups data points with respect to a target and clustering with respect to a similarity metric
Cluster labels may be
existing feature included in the clustering, existing feature not included (target), latent attribute you dont have acccess to
What is the within cluster variance?
sum of squared error between data points ant respective cluster center
What are the three main steps in k-mean algorithm?
- randomly chooses k data points to serve as initial centroids
- runs until a max iteration or when there is no change in cluster asignments
- returns cluster membership of all data points
What is the elbow method?
choos a range of k, run for every k, calculate SSE, calculate change in slope between consecutive sums, choose the k where the largest difference in slope is calculated