Cluster Analysis Flashcards

1
Q

What is Cluster Analysis and what is its Goal

A
  • Intra-cluster distances should be minimized
  • Intra-cluster distances should be maximized

Goal: Get better understanding of the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of clustering and the different between them

A
  • Partitional clustering
  • Divides records in non-overlapping clusters so that each record is exactly one subset
  • Hierarchical clustering
  • a set of nested clusters (overlap)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Application areas of cluster analysis

A

Market Research:
-Identify similar customers
E-Commerce:
-Identify offers of the same product

Image Recognition:
-Identify parts of an image that belong to the same object (1. Cluster identifies region 2. classifiers identifies region as an object)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Which type of learning tasks exist and what type of is cluster analysis

A

Supervised Learning:

  • Discover patterns in data that relate data attributes with a target (class) attribute
  • Use patterns on unseen data to predict target attribute
  • Set of classes is known before
  • Training data

Unsupervised Learning (Cluster analysis):

  • Explore data to find intrinsic patterns
  • Set of classes/clusters not known before
  • No training data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

List cluster analysis methods

A
  • Partitional
  • K-Means
  • DBSCAN
  • Hierarchical
  • Agglomerative
  • Divisive
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe K-Means Clustering

A
  • Each cluster has a centroid
  • Assign data point to closest centroid
  • Number of clusters (centroids) need to be specified manually
  1. Select k initial points as centroids
    repeat
    - form k clusters by assigning all points to their closest centroid
    - recompute centroid (to the mean of each cluster)
    until centroids don’t change
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a Convergence Criterion and list them for K-means

A

Convergence criterion: Stopping condition for algorithm (a setting that fits the tasks requirements)

Default:
- no or minimum change of centroids
Alternative:
- no or minimum re-assignments of data points to different clusters
- stop after x iterations
- min decrease in sum of squared errors (SSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a cohesion measure

A

Similarity of the data points in the cluster

how close are the data points in a cluster to their centroid

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How to evaluate K-Means Clustering

A

Most common: Sum of squared Errors

Error: Distance of a point to the nearest centroid
SSE: Square the errors and sum them for the k clusters

SUM Cluster ( SUM Points(Error))

Minimization task

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Weaknesses of K-Means

A
  • Need to determine number of clusters
  • All items are assigned to a clusters
  • Sensitive to outliers
  • Depends on position of initial seeds
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you increase the chance to find good clusters

A

Option 1) Restart several times with different random seeds

Option 2) Run k-means with different k’s
- SSE of clusters with different k’s can’t be compared (The higher k’ the lower the SSE)
Workaround:
1. Choose k where the SSE improvement decreases
(knee value of k)
2. Use X-Means (automatically determines k)
- start with small k, split large clusters and check if
results improve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How to handle outliers with K-Means

A

Use K-Medoids

  • Uses the median instead of mean
  • median is less affected by extreme values

Use DBSCAN

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Advantages of K-Means

A
  • Simple
  • Efficient (Complexity: O (tkn)

t: iterations k: clusters n: data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Characteristics of DBSCAN

A
  • Density based algorithm

Density: number of points within a specified radius (eps)

Three different classes of data points:

1) Core points: Neighboring points (within eps) >= MinPts
2) Border points: Neighboring points (within eps) < MinPts but a neighbor of a core points
3) Noise point: any point that is not 1 or 2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does DBSCAN work

A
  1. Label all points (core, border, or noise point)
  2. Eliminate noise points
  3. Group connected core points within their range (eps) to clusters
  4. Assign border point to a cluster based on the closest core points
    if a border point is in the border of multiple clusters:
    • use voting or random assignment (if equal vote)

Time complexity of O(n log n): dominated by neighborhood search fo reach point

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How to determine suitable EPS and MinPts values

A

Ideal: Points in a cluster should have their kth nearest neighbours at roughly the same distance. Noise points have their kth nearest neighbor at farther distance

  1. MinPts = 4 (rule of thumb)
  2. Plot sorted distance of every point to its kth nearest neighbors
  3. Set eps to the point where the distance sharply increases (start of noise points)

Adjusting:

  • Decrease k if small clusters are labeled as noise (reduces density)
  • Increase k if outliers are included into the clusters (increases density)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

When does DBSCAN work well

A
  • Resistant to noise

- If clusters have different shapes and sizes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When does DBSCAN wont work well

A
  • Datasets of varying densities
19
Q

What is Hierarchical Clustering

A
  • Produces a set of nested clusters

- Can be visualized as a Dendrogram

20
Q

What is a Dendrogram

A
  • Tree like diagram that records sequences of merges or splits
  • Y axis displays the former distance between merged clusters
21
Q

Strength of Hierarchical Clustering

A
  • Don’t need to assume any particular number of clusters
  • Descired number of clusters can be obtained by cutting the dendrogram at the proper level
  • can be used to discover meaningful taxonomies (biological species / customer groups)
22
Q

What are the two main types of hierarchical clustering

A

Agglomerative (bottom-up):

  • Start with single points and merges in iterations until only one cluster is left
  • requires similarity measure

Divisive (Top-down):

  • Start with on large cluster
  • divide clusters till each cluster contains one point
  • requires distance measure

Agglomerative is more often used

23
Q

The basic agglomerative algorithm

A
  1. computy proximity matrix
  2. Each data point is a cluster
    Repeat:
    - Merge the two closest cluster
    - Update the proximity matrix
    Till one cluster remains

Key operation: calculating proximity matrix

24
Q

How to determine Inter-Cluster Similarity

A
  1. Single Link
  2. Complete Link
  3. Group Average
  4. Distance between centroids
25
Q

How to calculate the cluster similarity: Single Link

A
  • Similarity is based on two most similar (closest) points of different clusters
  • Determined by one pair of points
26
Q

How to calculate the cluster similarity: Complete Linkage

A
  • Similarity is based on the two least similar (most distant) points in different clusters
  • Shortest longest distance
  • Determined by all points in the two clusters
27
Q

Differences between Single Link and Complete Linkage

A

Single Link (Shortest Distance):
+ Can handle non-elliptic shapes
- Sensitive to noise and outliers

Complete Linkage (Shortest Longest Distance):
+ Less sensitive to noise and outliers
- Biased towards globular clusters
- Tends to break large clusters (because decisions can’t be undone)

28
Q

General characteristic of agglomerative clustering

A

They make good local decision about merging two clusters but global optimisation is not guaranteed as a cluster merge can’t be undone afterwards.

29
Q

How to calculate the cluster similarity: Group Average

A
  • Proximity is the average of pair-wise proximity between all points in the two clusters
30
Q

Comparison between Group Average and Single / Complete Link

A

Compromise between single and complete link
+ Less sensitive to noise and outliers than single link
- Biased towards globular clusters

31
Q

General Problems and Limitations of hierarchical clustering

A
  1. Sensitivity to noise and outliers (except complete link)
  2. Difficulty handling non-elliptic shapes (except single link)
  3. Breaks large clusters

High Space and Time Complexity

  • At least O(N^2) due to proximity matrix
  • Often O(N^3) N steps which require search & update of proximity matrix
  • Workaround: Apply hierarchical clustering to a random sample of original data (< 10.000 examples)
32
Q

List the proximity measures

A

Similarity:

  • Numerical value of how alike two data objects are
  • Often Range between [0,1]

Dissimilarity / Distance:
- Numerical value how different two data objects are

There are proximity measures for single attributes and for multidimensional data points (records)

33
Q

Levenshtein Distance

A
  • Measures Dissimilarity of two strings
  • Minimum number of edits to transform one string into the other

Edit operations:

  • insert a character
  • delete a character
  • replace a character
34
Q

Proximity of Multidimensional Data Points

A
  • A record with many attributes (age, height, weight)
  • Euclidean Distance
  • Simple Matching Coefficient
  • Jaccard Coefficient
35
Q

Formula of Euclidean Distance

A

Sqrt ( SUM (pk - qk)^2)

pk and qk are the kth attributes of data points p and q

Sum of squared distance of all attributes of two records

Euclidean Distance is for Numerical Values

36
Q

Why is normalization important

A
  • To ensure that all attributes can have equal importance on the computed distance
  • Normalized attributes have a common value range e.g. [0,1]
37
Q

How to obtain the similarity of Binary Attributes

A

E.g. products in shopping basket or courses attended by students

  • SMC
  • Jaccard
Uses the following quanitites:
M11 = p is 1, q is 1
M00 = p is 0, q is 0
M10 = p is 1, q is 0
M01 = p is 0, q is 1
38
Q

How to handle symmetric binary attributes and how to calculate the similarity

A

Symmetric binary attribute: both of its states have equal importance (male / female, agreement with political parties)

Similarity Measure SMC:
Number of matches / number of all attributes values
(M11 + M00) / (M11 + M00 + M10 + M01)

39
Q

How to handle asymmetric binary attributes and how to calculate the similarity

A

Asymmetric binary attributes: one of the states is more important (item is basket / item not in basket); 1 is the more important state by convention

Similarity Measure Jaccard Coefficient:

Number of 11 matches / number of not both-zero attribute values

M11 / (M01 + M10 + M11)

40
Q

SMC vs Jaccard

A

Dating agency (many attributes):

  • Jaccard
  • only the 11 matches are important
  • can ignore 00 combinations

Wahl-O-Mat (fewer attributes):

  • SMC
  • Also the 00 matches are important to consider disagreements
41
Q

How can you combine similarities ?

A
  • Weight an attribute according to its importance
  • In sum, weight should sum up to 1
    E.g. weighted euclidean distance
42
Q

How to choose a good clustering algorithm

A

Good algorithm (best is misleading) depends on

  • analytical goal of the use case
  • distribution of the data
  1. run several algorithm using different hyperparameter settings, feature subsets
  2. Visualize and interpret the results based on application knowledge
43
Q

High influence on the cluster results have:

A
  • Standardization
  • Feature Selection
  • Distance Function
  • Parameter Settings