Clustering Flashcards

1
Q

What is clustering?

A

process of grouping a set of data objects into multiple groups or clusters. Objects within a cluster have high similarity but very disimilar to objects in other cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is clustering or cluster analysis?

A

process of partitioning a set of data objects (or observations) into subsets where each subste is a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the other name for clustering?

A

data segmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is clustering supervised or unsupervised learning? Why?

A

unsupervised, because class label information is not present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is clustering learning by observation or by examples?

A

Learning by observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 8 requirements for clustering in data mining?

A

Scalability, Ability to deal with multitype attributes, discover clusters with a different shape, domain knowledge to determine input parameters, deal with noisy data, and incremental clustering and insensitivity to input order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two main distances that clustering determines? What type of shapes do they usually identify?

A

Euclidean or Manhattan distance measures. spherical clusters with similar size and density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What general input do clustering algorithms need from user?

A

desired number of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are clusters sensitivie or insensitive to noisy data generally?

A

Sensitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens to clusters with incremental updates? What other effect does it have?

A

They usually have to recompute clusters from scratch. If data order is changed clusters may be completly different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the name of algorithms wich can take incremental updates?

A

Incremental clustering algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Clustering methods can be compared with what orthogonal aspects?

A

Paritioning criteria, separation of clusters, similarity measure and clustering space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What two types of partitioning exist?

A

Hierarchy and non-Hierarchy partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What to types of separation of clusters exist?

A

Mutually exclusive (only belong to one group) and non-exclusive (can belong to two or more).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can the distance be defined in terms of measuring similarity of two objects?

A

Euclidean space, vector space, or any other space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What to types of measures exist for similarity

A

distance based methods and density- and continuity - based methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Do basic partitioning methods adobe exclusive or non exclusive cluster separation?

A

exclusive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

are partitioning methods usually distance or density based?

A

distance based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When do heurestic clustering methods work well? What type of shpaes?

A

spherical shape clusters in small to medium size databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are two popular heurestic methods?

A

k-means and k-medioids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the two classifications of hierarchical methods?

A

agglomerative and divisive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the aggloerative approach in hierarchical methods consist off?

A

Bootom-up approach, succesively merges objects that are close together until all groups are merged into one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the divisive approach in hierarchical methods consist off?

A

top-down approach, with each iteration they split whole data into smaller clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Hierrarchical methods can be based on what two methods

A

distance or density and continuity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
What is the main disadvatage of hierrarchical methods?
once merge or split is done it cannot be undone therefore cannot correct erroneous desicions
26
What do density based methods consist of?
continues growing clusters as long as density in the neighborhood exheeds a certain threshold
27
What are density based methods good for?
detecting outliers and to discover clusters of arbitrary shape
28
What do grid based methods consist of?
quantize the object space into a finite number of cells that form a grid structure
29
What are some advantages of grid baesd-methods ?
fast processing time and possible integratino with other clustering methods such as density based methods and hierarchical methods
30
What are the three main tasks when evaluating clustering?
assesing cluster tendency, determine number of clusters in a dataset, measuring the quality of the clustering
31
What does the clustering tendecy asses?
determines if there is a non-random structure which may lead to meaningful clusters.
32
Clustering requieres what type of distributed data?
non uniform distribution of data
33
What is the hopkins statistic ?
spatial statistic that tests the spatial randomness of a variable as distributed in a space
34
What are the homogeneous and non-homogeneous hypothesis in evaluating cluster tendency?
that D is uniformly (not meaningufl) or non uniformily (meaningful clusters) respectiveley distributed
35
Why is it important to determine the right number of cluters in a data set?
1 - INPUT parameter for some algorithms | 2 - controls proper granularity of cluster analysis
36
The right number of clusters for a data set often depends on?
distribution shape and scale in the data set as well as clustering resolution needed by the user
37
What is the popular number of clusters used? How many points would that cluster have?
sqrt(n/2) where n is the # of objects, it would have sqrt(2n) points
38
What is the elbow method based on?
observation that increasing the number of clusters helps reduce the sum of within-cluster variance of each cluster
39
What is the heurestic for selecting the appropiate number of k ?
Selecting the turning point in the curve of the sum of within-cluster variance with respect to number of clusters
40
What happens in the elbow method to the sum of within-cluster variance when increasing k?
the effect of reducing the sum of within cluster variance may drop
41
What other method is appropiate for determining the appropiate numbeer of clusters (not sqrt())?
cross validation
42
What is cross validation?
building cluster with n-1 dataset objects and using remaining to test quality of clustering by calculating within-clustering variance of test points to centroids.
43
What method do we use to measure clusteirng quality if we have ground truth? What if we dont?
Extrinsic method, Intrinsic method
44
What do extrinsic method consist of?
comparing clustering against the group truth and measure
45
What do intrinsic method consist of?
eavluate goodnes of clustering by considering how well the clusters are separated
46
The clustering quality, Q is effective if it satisfies what four qualityes?
Cluster homogeneity, cluster completeness, rag bag, and small cluster preservation
47
What does cluster homogeneity consist of in clustering quality?
That the more pure the clusters in a clustering are the better (using ground truth)
48
What does cluster completeness requires in clustering quality?
requires that a clustering should assign objects belonging to the same category (according to ground truth) to the same cluster
49
What does the small cluster preservation states?
splitting a small category into pieces is more harmful than splitting a large category into pieces.
50
What clustering quality measures satisfy all four requirements? Name 2.
BCubed precision and recall
51
What does BCubed evaluates?
the precisionand recall for every object in a clustering on a given data set according to ground truth
52
What is precision in clustering?
how many other objects in the same cluster belong to the same category as the object
53
What is recall in clustering?
The recall of an object reflects how many objects of the same category are assigned to the same cluster
54
In the silhouette coefficient what are a(o) and b(o)?
a(o) is the average distance between o and all other objects in the cluster to which o belongs b(o) is the minimum average distance from o to all clusters to which o does not belong
55
What is a method of evaluation of intriscic methods? how do the value ranges? what do the value tells you
silhouette coefficient and ranges from -1 and 1 where a negative value means the point is closer to a point in another cluster and positive means it is compact (good)
56
What is the difference in grouping in classification and clustering?
classification groups data points with respect to a target and clustering with respect to a similarity metric
57
Cluster labels may be
existing feature included in the clustering, existing feature not included (target), latent attribute you dont have acccess to
58
What is the within cluster variance?
sum of squared error between data points ant respective cluster center
59
What are the three main steps in k-mean algorithm?
1. randomly chooses k data points to serve as initial centroids 2. runs until a max iteration or when there is no change in cluster asignments 3. returns cluster membership of all data points
60
What is the elbow method?
choos a range of k, run for every k, calculate SSE, calculate change in slope between consecutive sums, choose the k where the largest difference in slope is calculated