Clustering Flashcards

1
Q

What is clustering?

A

process of grouping a set of data objects into multiple groups or clusters. Objects within a cluster have high similarity but very disimilar to objects in other cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is clustering or cluster analysis?

A

process of partitioning a set of data objects (or observations) into subsets where each subste is a cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the other name for clustering?

A

data segmentation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is clustering supervised or unsupervised learning? Why?

A

unsupervised, because class label information is not present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is clustering learning by observation or by examples?

A

Learning by observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the 8 requirements for clustering in data mining?

A

Scalability, Ability to deal with multitype attributes, discover clusters with a different shape, domain knowledge to determine input parameters, deal with noisy data, and incremental clustering and insensitivity to input order.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the two main distances that clustering determines? What type of shapes do they usually identify?

A

Euclidean or Manhattan distance measures. spherical clusters with similar size and density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What general input do clustering algorithms need from user?

A

desired number of clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are clusters sensitivie or insensitive to noisy data generally?

A

Sensitive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What happens to clusters with incremental updates? What other effect does it have?

A

They usually have to recompute clusters from scratch. If data order is changed clusters may be completly different.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the name of algorithms wich can take incremental updates?

A

Incremental clustering algorithms

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Clustering methods can be compared with what orthogonal aspects?

A

Paritioning criteria, separation of clusters, similarity measure and clustering space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What two types of partitioning exist?

A

Hierarchy and non-Hierarchy partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What to types of separation of clusters exist?

A

Mutually exclusive (only belong to one group) and non-exclusive (can belong to two or more).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can the distance be defined in terms of measuring similarity of two objects?

A

Euclidean space, vector space, or any other space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What to types of measures exist for similarity

A

distance based methods and density- and continuity - based methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Do basic partitioning methods adobe exclusive or non exclusive cluster separation?

A

exclusive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

are partitioning methods usually distance or density based?

A

distance based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

When do heurestic clustering methods work well? What type of shpaes?

A

spherical shape clusters in small to medium size databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are two popular heurestic methods?

A

k-means and k-medioids

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the two classifications of hierarchical methods?

A

agglomerative and divisive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the aggloerative approach in hierarchical methods consist off?

A

Bootom-up approach, succesively merges objects that are close together until all groups are merged into one

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What does the divisive approach in hierarchical methods consist off?

A

top-down approach, with each iteration they split whole data into smaller clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Hierrarchical methods can be based on what two methods

A

distance or density and continuity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the main disadvatage of hierrarchical methods?

A

once merge or split is done it cannot be undone therefore cannot correct erroneous desicions

26
Q

What do density based methods consist of?

A

continues growing clusters as long as density in the neighborhood exheeds a certain threshold

27
Q

What are density based methods good for?

A

detecting outliers and to discover clusters of arbitrary shape

28
Q

What do grid based methods consist of?

A

quantize the object space into a finite number of cells that form a grid structure

29
Q

What are some advantages of grid baesd-methods ?

A

fast processing time and possible integratino with other clustering methods such as density based methods and hierarchical methods

30
Q

What are the three main tasks when evaluating clustering?

A

assesing cluster tendency, determine number of clusters in a dataset, measuring the quality of the clustering

31
Q

What does the clustering tendecy asses?

A

determines if there is a non-random structure which may lead to meaningful clusters.

32
Q

Clustering requieres what type of distributed data?

A

non uniform distribution of data

33
Q

What is the hopkins statistic ?

A

spatial statistic that tests the spatial randomness of a variable as distributed in a space

34
Q

What are the homogeneous and non-homogeneous hypothesis in evaluating cluster tendency?

A

that D is uniformly (not meaningufl) or non uniformily (meaningful clusters) respectiveley distributed

35
Q

Why is it important to determine the right number of cluters in a data set?

A

1 - INPUT parameter for some algorithms

2 - controls proper granularity of cluster analysis

36
Q

The right number of clusters for a data set often depends on?

A

distribution shape and scale in the data set as well as clustering resolution needed by the user

37
Q

What is the popular number of clusters used? How many points would that cluster have?

A

sqrt(n/2) where n is the # of objects, it would have sqrt(2n) points

38
Q

What is the elbow method based on?

A

observation that increasing the number of clusters helps reduce the sum of within-cluster variance of each cluster

39
Q

What is the heurestic for selecting the appropiate number of k ?

A

Selecting the turning point in the curve of the sum of within-cluster variance with respect to number of clusters

40
Q

What happens in the elbow method to the sum of within-cluster variance when increasing k?

A

the effect of reducing the sum of within cluster variance may drop

41
Q

What other method is appropiate for determining the appropiate numbeer of clusters (not sqrt())?

A

cross validation

42
Q

What is cross validation?

A

building cluster with n-1 dataset objects and using remaining to test quality of clustering by calculating within-clustering variance of test points to centroids.

43
Q

What method do we use to measure clusteirng quality if we have ground truth? What if we dont?

A

Extrinsic method, Intrinsic method

44
Q

What do extrinsic method consist of?

A

comparing clustering against the group truth and measure

45
Q

What do intrinsic method consist of?

A

eavluate goodnes of clustering by considering how well the clusters are separated

46
Q

The clustering quality, Q is effective if it satisfies what four qualityes?

A

Cluster homogeneity, cluster completeness, rag bag, and small cluster preservation

47
Q

What does cluster homogeneity consist of in clustering quality?

A

That the more pure the clusters in a clustering are the better (using ground truth)

48
Q

What does cluster completeness requires in clustering quality?

A

requires that a clustering should assign objects belonging to the same category (according to ground truth) to the same cluster

49
Q

What does the small cluster preservation states?

A

splitting a small category into pieces is more harmful than splitting a large category into pieces.

50
Q

What clustering quality measures satisfy all four requirements? Name 2.

A

BCubed precision and recall

51
Q

What does BCubed evaluates?

A

the precisionand recall for every object in a clustering on a given data set according to ground truth

52
Q

What is precision in clustering?

A

how many other objects in the same cluster belong to the same category as the object

53
Q

What is recall in clustering?

A

The recall of an object reflects how many objects of the same category are assigned to the same cluster

54
Q

In the silhouette coefficient what are a(o) and b(o)?

A

a(o) is the average distance between o and all other objects in the cluster to which o belongs
b(o) is the minimum average distance from o to all clusters to which o does not belong

55
Q

What is a method of evaluation of intriscic methods? how do the value ranges? what do the value tells you

A

silhouette coefficient and ranges from -1 and 1 where a negative value means the point is closer to a point in another cluster and positive means it is compact (good)

56
Q

What is the difference in grouping in classification and clustering?

A

classification groups data points with respect to a target and clustering with respect to a similarity metric

57
Q

Cluster labels may be

A

existing feature included in the clustering, existing feature not included (target), latent attribute you dont have acccess to

58
Q

What is the within cluster variance?

A

sum of squared error between data points ant respective cluster center

59
Q

What are the three main steps in k-mean algorithm?

A
  1. randomly chooses k data points to serve as initial centroids
  2. runs until a max iteration or when there is no change in cluster asignments
  3. returns cluster membership of all data points
60
Q

What is the elbow method?

A

choos a range of k, run for every k, calculate SSE, calculate change in slope between consecutive sums, choose the k where the largest difference in slope is calculated