Clustering FINAL Flashcards

1
Q

Cluster analysis divides data into

A

groups (clusters) that are meaningful, useful, or both

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Objects in a cluster

A

share the same characteristics

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What fields is clustering used in

A

a variety of fields, health and medicine, business

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How is clustering used in health and medicing?

A

Cluster patients according to symptoms presented upon evaluation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How is clustering used in business?

A

Cluster stores according to sales/customer satisfaction/…

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How is clustering used in computer networking?

A

Cluster traffic according to application type

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does it mean that clusters are meaningful?

A

Clusters should capture the natural structure of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does it mean that clusters are useful? Using the cluster prototype

A

Using the cluster prototype, clusters can be used as a starting point for data summarization, compression (vector quantization), classification (nearest neighbor)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Clustering groups data objects based on

A

information found in the data that describes the objects and their relationships

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What type of learning is clustering

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Clustering goal

A

Objects within a cluster be similar to one another, but different from objects in other clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Is the notion of a cluster well defined?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Does clustering have to know exactly what it is sorting

A

No, can sort things like pennies nickels dimes without knowing how much they are worth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Partitional clustering

A

Divide objects into non-overlapping clusters such that each object belongs to exactly one cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Hierarchical clustering

A

Clusters can have subclusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Exclusive Clustering

A

1:1 relationship between object and cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Why need hyper parameter for clusteiring

A

Tells us how many clusters we are expecting from dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Overlapping clustering

A

1:n relationship between object and cluster; an object can belong to > 1 cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Fuzzy clustering

A

n:n relationship, all objects belong to all clusters with a certain probability (or membership weight)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

In fuzzy clustering, each object’s probability of belonging to all clusters should sum to

A

1.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Complete clustering

A

Assign every point to at least one cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Partial clustering

A

Some objects may not be assigned to any cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What might some objects not assigned to a cluster represent?

A

Noise or outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Well-separated clusters

A

Each point is closer to all of the points in its cluster than to any point in another cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Center-based clusters

A

Each point is closer to the center of its cluster than to the center of any other cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Density based clusters

A

identifies clusters as regions of high data point density separated by regions of low density. It is useful for discovering clusters of arbitrary shapes and can handle noise effectively.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Conceptual clusters

A

Points in a cluster share some general property that derives from the entire set of points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

K-means clustering definition

A

A prototype-based partitional clustering techniques to find patterns in the data to create k clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Hierarchical clustering

A

Build a hierarchy of clusters starting from singleton clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

DBSCAN

A

Density based clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

K-means clustering process

A

Takes n observations and partitions them into k clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

In k means clustering, relationship between k and n

A

k«n

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What type of clustering scheme is K-means

A

Prototype-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How to use supervised techniques from unsupervised learning

A

Derive labels from unsupervised learning, then create a data set with the labels

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Meaning of prototype based clustering scheme

A

it has a notion of a centroid, which in the literature is called a prototype, and we want all of our points to be closer to the centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How difficult is k means clustering?

A

Computationally difficult (NP-hard)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Prerequisites of K-means algorithm

A

Attributes must be numeric
Attributes must be standardized (scaled)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

K-means algorithm

A

Select K points as initial centroids
Form K clusters by assigning each point to its closest centroid
Recompute the centroid of each cluster
repeat until centroids do not change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How to compute distances for k-means clustering

A

Euclidean distance
Manhattan Distance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

When do we stop algorithm for K means clustering

A

When you have an iteration and the SSE doesn’t change from before, we know we have converged, minimized SSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Does K-means always converge?

A

Yes, it will find minima (perhaps not global) because SSE will always go down

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How to choose initial centroids?

A

Random selection
In natural cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

Random selection of initial centroid problem

A

May lead to sub optimal clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

Best way to initially choose centroids

A

Put all of the centroids close to each other in some dense area of your space

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Bisecting K-means goal

A

Form best (optimal) clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Bisecting K-means idea

A

To obtain K clusters, split the set of all points into two clusters, select one of the clusters to split, and so on until you have k clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Which cluster to split?

A

The largest one or the one with largest error (SSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Bisecting K-means algorithm

A

Initialize the list of clusters to contain the cluster consisting of all points
Remove a cluster from the list of clusters
Bisect the selected cluster using basic K-means
Select the two clusters from the bisection with the lowest total SSE
Add these two clusters to the list of clusters
Repeat until the list of clusters contains K clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

How to choose the k in K-means?

A

We want the total within cluster sum of squares to be minimal

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Clustering should exhibit

A

Low intra-cluster SS
high inter-cluster SS

51
Q

Low intra-cluster SS

A

Cohesion

52
Q

High inter-cluster SS

A

Separation

53
Q

Within cluster SS

A

Sum of squares of each point in the cluster to the cluster centroid

54
Q

Total SS

A

Sum of squares of each point in the dataset to the global cluster mean

55
Q

Between SS

A

Total SS - total within SS

56
Q

Variance of clustering

A

Ratio of between SS/Total SS (between 0.0 and 1.0)

57
Q

What does a high ration of between SS/Total SS mean

A

The more variance is explained by the clusters

58
Q

The higher the variance

A

the better it is because you are explaining that much percentage of the variance

59
Q

How does K-means handle outliers

A

Not gracefully, it will try to include these in a cluster

60
Q

What can help K-means

A

Outlier detection and removal prior

61
Q

in Kmeans how many points are assigned to a cluster?

A

Every one

62
Q

Kmeans only creates what type of clusters

A

Globular clusters (round)

63
Q

K-means may result in an

A

empty cluster

64
Q

How to handle empty cluster problem

A

Choose the point that is farthest away from any centroid, basically assign outlier to the last empty cluster
Another option is to choose a replacement centroid at random from the cluster that has the highest SSE and split it

65
Q

K means has problems when clusters are of differing

A

sizes, densities, non-globular

66
Q

K-means complexity

A

Space: O((m+K)n)
Time: O(IKm*n)

67
Q

What might outliers lead to in k means

A

Higher SSE and non-representative cluster centroids

68
Q

What are hierarchical clustering inspired by?

A

The area of taxonomy, where hierarchical structures are common and elements under the same hierarchy automatically constitute a cluster

69
Q

What is not required of hierarchical clustering unlike k means

A

Choose k a-priori (prior). You look at what it creates and then try to choose a K according to your understanding of the problem

70
Q

Is not having to choose a k a-prior an advantage or disadvantage?

A

Both

71
Q

Two approaches to hierarchical clustering

A

Agglomerative (bottom up)
Divisive

72
Q

Agglomerative

A

Each point starts off as individual cluster, and at each step, merge closest pairs of clusters (Need cluster proximity metric)

73
Q

Divisive

A

All points in one cluster, at each step, split a cluster until singleton clusters of individual points remain (Which cluster to split and how to do the splitting)

74
Q

How is hierarchical clustering depicted

A

Depicted as tree-like structure called Dendrograms

75
Q

Dendrograms y axis x axis

A

Labeled with the proximity between clusters, x axis is labels of data points

76
Q

How to interpret dendrogram

A

Look for closest cluster (in terms of squared distance) and join them. Continue until one cluster left

77
Q

How to pick clusters in dendrogram

A

Draw horizontal line that intersects k lines, k being the number of clusters you want

78
Q

Hierarchical clustering algorithm

A

Compute proximity matrix if necessary
Merge the closest two clusters
Update the proximity matrix to reflect the proximity between the new cluster and original clusters
Repeat until only one cluster remains

79
Q

3 ways for merge to be done in hierarchical clustering algorithm

A

MIN (single link), MAX (complete link), Group average

80
Q

MIN (single link)

A

Defines the distance between two clusters as the shortest distance between any pair of points in the two clusters

81
Q

MAX (complete Link)

A

Measures the distance between two clusters as the greatest distance between any pair of points in the two clusters

82
Q

Group Average

A

Defines the distance between two clusters as the average of all pairwise distances between points in the two clusters

83
Q

Which merging is sensitive to outliers, which one is not sensitive to outliers

A

MIN(single Link), MAX (complete link)

84
Q

Hierarchical clustering complexity

A

O(m^2) Space
O(m^2) Time
using heap –> O(m^2 log m)

85
Q

Does scaling matter in hierarchical clustering

A

Yes if attributes are wide ranges

86
Q

What dissimilarity measure should be used in HC?

A

Euclidean
Manhattan

87
Q

What linkage to be used in HC?

A

Same as before, MIN, MAX, Average, depends whether or not your data is susceptible to outliers

88
Q

Unlike k-means, merging decisions are

A

final, observations may not belong to different centroids over time

89
Q

Even though clustering in general is considered an unsupervised learning method,

A

it can be used in supervised learning mode

90
Q

Cluster indicate a _ and each object in a certain cluster _

A

class label, belongs to that class

91
Q

Error measures for supervised clustering are

A

the same as classification

92
Q

What is DBSCAN?

A

A density-based clustering algorithms parametrized by a radius/neighborhood and number of neighboring points

93
Q

e-Neighbourhood

A

Objects within a radius e from a source object

94
Q

Density (DBSCAN)

A

If e-neighborhood of a source point (object) contains at least MinPts other points (object), then the source point is in a “high-density” area

95
Q

DBSCAN divides all points into

A

Core point
Border point
Noise point

96
Q

Core point

A

A point that has at least MinPts within an e

97
Q

Border point

A

A point that is not a core point, but is in the neighborhood of a core point

98
Q

Noise point

A

Points that are neither core or border points

99
Q

Points idea

A

I pick a point
I draw a circle of radius e around that point
Anything that falls within the circle is a core point
Anything that falls on the boundary is a border point
Anything that falls outside the circle is a noise point

100
Q

Density-based clustering: dbscan algorithm steps

A

Label all points as core, border, or noise points
Eliminate all noise points
Put an edge between all core points that are within Eps of each other
Make each group of connected core points into a separate cluster
Assign each border point to one of the clusters of its associated core points

101
Q

How to pick epsilon

A

Graph of taking the distance of each point in the data set to its nearest 5 neighbors. It does that across all points in data set. Take distance and sort them then look at curve. See where it starts to rise and choose epsilon value of where it starts to rise

102
Q

dbscan complexity

A

O(m) space
O(m^2) time worse case
O(m log m) in low dimension space

103
Q

What happens if neighborhood (MinPts) is too small

A

Then a sparse cluster may be erroneously labeled as noise

104
Q

What happens if neighborhood (MinPts) is too large

A

Then dense clusters may be merged together, and small clusters may be labeled as noise

105
Q

How many MinPts satisfy for 2 dimensions

A

4

106
Q

For greater than 2 dimensions, how many MinPts

A

2*dimensions

107
Q

How many MinPts for large and noisy datasets

A

large

108
Q

What to do if clusters are too large

A

Decrease epsilon

109
Q

Too much noise

A

increase epsilon

110
Q

How to choose epsilon concise

A

Calculate the average of distances of every point to its k nearest neighbors, k-dist will be large for noise points and look for value where it starts rising

111
Q

Why validate clustering?

A

Avoid finding random patterns in data
Compare two clusters
Compare two clustering algorithms

112
Q

Types of validations

A

Internal
External
Relative

113
Q

Internal validation

A

Used to measure the goodness of a clustering structure without respect to external information (SSE for example)

114
Q

External Validation

A

Match cluster result with external results (class labels)

115
Q

Relative Validation

A

Used to compare two different clustering algorithms or clusters

116
Q

Two measures of internal validation

A

Cohesion
Separation

117
Q

Cohesion

A

Measures how close are objects in the same cluster, measure by within cluster SSE)

118
Q

Separation

A

Measures how well separated are the clusters

119
Q

Silhouette value

A

Measures the degree of confidence in the clustering assignment of a particular observation

120
Q

Silhouette formula

A

Si = (Bi-Ai)/max(Bi,Ai)

121
Q

Silhouette width

A

[-1,1], large Si means observations are well clustered, small Si means observations lies between two clusters, Negative Si means observations probably placed in wrong cluster

122
Q

what is ai

A

the average distance from the ith point to all the other points in the same cluster as the ith point

123
Q

what is bi

A

average distance from the ith point to all the other points in the other cluster