1. Label all points (core, border, or noise point) 2. Eliminate noise points 3. Group connected core points within their range (eps) to clusters 4. Assign border point to a cluster based on the closest core points if a border point is in the border of multiple clusters: - use voting or random assignment (if equal vote) Time complexity of O(n log n): dominated by neighborhood search fo reach point

Cluster Analysis Flashcards by Jan Kiefer

What is Cluster Analysis and what is its Goal

Intra-cluster distances should be minimized
Intra-cluster distances should be maximized

Goal: Get better understanding of the data

How well did you know this?

Not at all

Perfectly

Types of clustering and the different between them

Partitional clustering
Divides records in non-overlapping clusters so that each record is exactly one subset
Hierarchical clustering
a set of nested clusters (overlap)

How well did you know this?

Not at all

Perfectly

Application areas of cluster analysis

Market Research:
-Identify similar customers
E-Commerce:
-Identify offers of the same product

Image Recognition:
-Identify parts of an image that belong to the same object (1. Cluster identifies region 2. classifiers identifies region as an object)

How well did you know this?

Not at all

Perfectly

Which type of learning tasks exist and what type of is cluster analysis

Supervised Learning:

Discover patterns in data that relate data attributes with a target (class) attribute
Use patterns on unseen data to predict target attribute
Set of classes is known before
Training data

Unsupervised Learning (Cluster analysis):

Explore data to find intrinsic patterns
Set of classes/clusters not known before
No training data

How well did you know this?

Not at all

Perfectly

List cluster analysis methods

Partitional
K-Means
DBSCAN
Hierarchical
Agglomerative
Divisive

How well did you know this?

Not at all

Perfectly

Describe K-Means Clustering

Each cluster has a centroid
Assign data point to closest centroid
Number of clusters (centroids) need to be specified manually

Select k initial points as centroids
repeat
- form k clusters by assigning all points to their closest centroid
- recompute centroid (to the mean of each cluster)
until centroids don’t change

How well did you know this?

Not at all

Perfectly

What is a Convergence Criterion and list them for K-means

Convergence criterion: Stopping condition for algorithm (a setting that fits the tasks requirements)

Default:
- no or minimum change of centroids
Alternative:
- no or minimum re-assignments of data points to different clusters
- stop after x iterations
- min decrease in sum of squared errors (SSE)

How well did you know this?

Not at all

Perfectly

What is a cohesion measure

Similarity of the data points in the cluster

how close are the data points in a cluster to their centroid

How well did you know this?

Not at all

Perfectly

How to evaluate K-Means Clustering

Most common: Sum of squared Errors

Error: Distance of a point to the nearest centroid
SSE: Square the errors and sum them for the k clusters

SUM Cluster ( SUM Points(Error))

Minimization task

How well did you know this?

Not at all

Perfectly

Weaknesses of K-Means

Need to determine number of clusters
All items are assigned to a clusters
Sensitive to outliers
Depends on position of initial seeds

How well did you know this?

Not at all

Perfectly

How can you increase the chance to find good clusters

Option 1) Restart several times with different random seeds

Option 2) Run k-means with different k’s
- SSE of clusters with different k’s can’t be compared (The higher k’ the lower the SSE)
Workaround:
1. Choose k where the SSE improvement decreases
(knee value of k)
2. Use X-Means (automatically determines k)
- start with small k, split large clusters and check if
results improve

How well did you know this?

Not at all

Perfectly

How to handle outliers with K-Means

Use K-Medoids

Uses the median instead of mean
median is less affected by extreme values

Use DBSCAN

How well did you know this?

Not at all

Perfectly

Advantages of K-Means

Simple
Efficient (Complexity: O (tkn)

t: iterations k: clusters n: data points

How well did you know this?

Not at all

Perfectly

Characteristics of DBSCAN

Density based algorithm

Density: number of points within a specified radius (eps)

Three different classes of data points:

1) Core points: Neighboring points (within eps) >= MinPts
2) Border points: Neighboring points (within eps) < MinPts but a neighbor of a core points
3) Noise point: any point that is not 1 or 2

How well did you know this?

Not at all

Perfectly

How does DBSCAN work

Label all points (core, border, or noise point)
Eliminate noise points
Group connected core points within their range (eps) to clusters
Assign border point to a cluster based on the closest core points
if a border point is in the border of multiple clusters:
- use voting or random assignment (if equal vote)

Time complexity of O(n log n): dominated by neighborhood search fo reach point

How well did you know this?

Not at all

Perfectly

How to determine suitable EPS and MinPts values

Ideal: Points in a cluster should have their kth nearest neighbours at roughly the same distance. Noise points have their kth nearest neighbor at farther distance

MinPts = 4 (rule of thumb)
Plot sorted distance of every point to its kth nearest neighbors
Set eps to the point where the distance sharply increases (start of noise points)

Adjusting:

Decrease k if small clusters are labeled as noise (reduces density)
Increase k if outliers are included into the clusters (increases density)

How well did you know this?

Not at all

Perfectly

When does DBSCAN work well

Resistant to noise

- If clusters have different shapes and sizes

How well did you know this?

Not at all

Perfectly

When does DBSCAN wont work well

Study These Flashcards

Datasets of varying densities

What is Hierarchical Clustering

Study These Flashcards

Produces a set of nested clusters

- Can be visualized as a Dendrogram

What is a Dendrogram

Study These Flashcards

Tree like diagram that records sequences of merges or splits
Y axis displays the former distance between merged clusters

Strength of Hierarchical Clustering

Study These Flashcards

Don’t need to assume any particular number of clusters
Descired number of clusters can be obtained by cutting the dendrogram at the proper level
can be used to discover meaningful taxonomies (biological species / customer groups)

What are the two main types of hierarchical clustering

Study These Flashcards

Agglomerative (bottom-up):

Start with single points and merges in iterations until only one cluster is left
requires similarity measure

Divisive (Top-down):

Start with on large cluster
divide clusters till each cluster contains one point
requires distance measure

Agglomerative is more often used

The basic agglomerative algorithm

Study These Flashcards

computy proximity matrix
Each data point is a cluster
Repeat:
- Merge the two closest cluster
- Update the proximity matrix
Till one cluster remains

Key operation: calculating proximity matrix

How to determine Inter-Cluster Similarity

Study These Flashcards

Single Link
Complete Link
Group Average
Distance between centroids

How to calculate the cluster similarity: Single Link

- Similarity is based on two most similar (closest) points of different clusters - Determined by one pair of points

How to calculate the cluster similarity: Complete Linkage

- Similarity is based on the two least similar (most distant) points in different clusters - Shortest longest distance - Determined by all points in the two clusters

Differences between Single Link and Complete Linkage

Single Link (Shortest Distance): + Can handle non-elliptic shapes - Sensitive to noise and outliers Complete Linkage (Shortest Longest Distance): + Less sensitive to noise and outliers - Biased towards globular clusters - Tends to break large clusters (because decisions can't be undone)

General characteristic of agglomerative clustering

They make good local decision about merging two clusters but global optimisation is not guaranteed as a cluster merge can't be undone afterwards.

How to calculate the cluster similarity: Group Average

- Proximity is the average of pair-wise proximity between all points in the two clusters

Comparison between Group Average and Single / Complete Link

Compromise between single and complete link + Less sensitive to noise and outliers than single link - Biased towards globular clusters

General Problems and Limitations of hierarchical clustering

1. Sensitivity to noise and outliers (except complete link) 2. Difficulty handling non-elliptic shapes (except single link) 3. Breaks large clusters High Space and Time Complexity - At least O(N^2) due to proximity matrix - Often O(N^3) N steps which require search & update of proximity matrix - Workaround: Apply hierarchical clustering to a random sample of original data (< 10.000 examples)

List the proximity measures

Similarity: - Numerical value of how alike two data objects are - Often Range between [0,1] Dissimilarity / Distance: - Numerical value how different two data objects are There are proximity measures for single attributes and for multidimensional data points (records)

Levenshtein Distance

- Measures Dissimilarity of two strings - Minimum number of edits to transform one string into the other Edit operations: - insert a character - delete a character - replace a character

Proximity of Multidimensional Data Points

- A record with many attributes (age, height, weight) - Euclidean Distance - Simple Matching Coefficient - Jaccard Coefficient

Formula of Euclidean Distance

Sqrt ( SUM (pk - qk)^2) pk and qk are the kth attributes of data points p and q Sum of squared distance of all attributes of two records Euclidean Distance is for Numerical Values

Why is normalization important

- To ensure that all attributes can have equal importance on the computed distance - Normalized attributes have a common value range e.g. [0,1]

How to obtain the similarity of Binary Attributes

E.g. products in shopping basket or courses attended by students - SMC - Jaccard ``` Uses the following quanitites: M11 = p is 1, q is 1 M00 = p is 0, q is 0 M10 = p is 1, q is 0 M01 = p is 0, q is 1 ```

How to handle symmetric binary attributes and how to calculate the similarity

Symmetric binary attribute: both of its states have equal importance (male / female, agreement with political parties) Similarity Measure SMC: Number of matches / number of all attributes values (M11 + M00) / (M11 + M00 + M10 + M01)

How to handle asymmetric binary attributes and how to calculate the similarity

Asymmetric binary attributes: one of the states is more important (item is basket / item not in basket); 1 is the more important state by convention Similarity Measure Jaccard Coefficient: Number of 11 matches / number of not both-zero attribute values M11 / (M01 + M10 + M11)

SMC vs Jaccard

Dating agency (many attributes): - Jaccard - only the 11 matches are important - can ignore 00 combinations Wahl-O-Mat (fewer attributes): - SMC - Also the 00 matches are important to consider disagreements

How can you combine similarities ?

- Weight an attribute according to its importance - In sum, weight should sum up to 1 E.g. weighted euclidean distance

How to choose a good clustering algorithm

Good algorithm (best is misleading) depends on - analytical goal of the use case - distribution of the data 1. run several algorithm using different hyperparameter settings, feature subsets 2. Visualize and interpret the results based on application knowledge

High influence on the cluster results have:

- Standardization - Feature Selection - Distance Function - Parameter Settings

Cluster Analysis Flashcards

(43 cards)