Cluster Analysis Flashcards
What is Cluster Analysis and what is its Goal
- Intra-cluster distances should be minimized
- Intra-cluster distances should be maximized
Goal: Get better understanding of the data
Types of clustering and the different between them
- Partitional clustering
- Divides records in non-overlapping clusters so that each record is exactly one subset
- Hierarchical clustering
- a set of nested clusters (overlap)
Application areas of cluster analysis
Market Research:
-Identify similar customers
E-Commerce:
-Identify offers of the same product
Image Recognition:
-Identify parts of an image that belong to the same object (1. Cluster identifies region 2. classifiers identifies region as an object)
Which type of learning tasks exist and what type of is cluster analysis
Supervised Learning:
- Discover patterns in data that relate data attributes with a target (class) attribute
- Use patterns on unseen data to predict target attribute
- Set of classes is known before
- Training data
Unsupervised Learning (Cluster analysis):
- Explore data to find intrinsic patterns
- Set of classes/clusters not known before
- No training data
List cluster analysis methods
- Partitional
- K-Means
- DBSCAN
- Hierarchical
- Agglomerative
- Divisive
Describe K-Means Clustering
- Each cluster has a centroid
- Assign data point to closest centroid
- Number of clusters (centroids) need to be specified manually
- Select k initial points as centroids
repeat
- form k clusters by assigning all points to their closest centroid
- recompute centroid (to the mean of each cluster)
until centroids don’t change
What is a Convergence Criterion and list them for K-means
Convergence criterion: Stopping condition for algorithm (a setting that fits the tasks requirements)
Default:
- no or minimum change of centroids
Alternative:
- no or minimum re-assignments of data points to different clusters
- stop after x iterations
- min decrease in sum of squared errors (SSE)
What is a cohesion measure
Similarity of the data points in the cluster
how close are the data points in a cluster to their centroid
How to evaluate K-Means Clustering
Most common: Sum of squared Errors
Error: Distance of a point to the nearest centroid
SSE: Square the errors and sum them for the k clusters
SUM Cluster ( SUM Points(Error))
Minimization task
Weaknesses of K-Means
- Need to determine number of clusters
- All items are assigned to a clusters
- Sensitive to outliers
- Depends on position of initial seeds
How can you increase the chance to find good clusters
Option 1) Restart several times with different random seeds
Option 2) Run k-means with different k’s
- SSE of clusters with different k’s can’t be compared (The higher k’ the lower the SSE)
Workaround:
1. Choose k where the SSE improvement decreases
(knee value of k)
2. Use X-Means (automatically determines k)
- start with small k, split large clusters and check if
results improve
How to handle outliers with K-Means
Use K-Medoids
- Uses the median instead of mean
- median is less affected by extreme values
Use DBSCAN
Advantages of K-Means
- Simple
- Efficient (Complexity: O (tkn)
t: iterations k: clusters n: data points
Characteristics of DBSCAN
- Density based algorithm
Density: number of points within a specified radius (eps)
Three different classes of data points:
1) Core points: Neighboring points (within eps) >= MinPts
2) Border points: Neighboring points (within eps) < MinPts but a neighbor of a core points
3) Noise point: any point that is not 1 or 2
How does DBSCAN work
- Label all points (core, border, or noise point)
- Eliminate noise points
- Group connected core points within their range (eps) to clusters
- Assign border point to a cluster based on the closest core points
if a border point is in the border of multiple clusters:- use voting or random assignment (if equal vote)
Time complexity of O(n log n): dominated by neighborhood search fo reach point
How to determine suitable EPS and MinPts values
Ideal: Points in a cluster should have their kth nearest neighbours at roughly the same distance. Noise points have their kth nearest neighbor at farther distance
- MinPts = 4 (rule of thumb)
- Plot sorted distance of every point to its kth nearest neighbors
- Set eps to the point where the distance sharply increases (start of noise points)
Adjusting:
- Decrease k if small clusters are labeled as noise (reduces density)
- Increase k if outliers are included into the clusters (increases density)
When does DBSCAN work well
- Resistant to noise
- If clusters have different shapes and sizes
When does DBSCAN wont work well
- Datasets of varying densities
What is Hierarchical Clustering
- Produces a set of nested clusters
- Can be visualized as a Dendrogram
What is a Dendrogram
- Tree like diagram that records sequences of merges or splits
- Y axis displays the former distance between merged clusters
Strength of Hierarchical Clustering
- Don’t need to assume any particular number of clusters
- Descired number of clusters can be obtained by cutting the dendrogram at the proper level
- can be used to discover meaningful taxonomies (biological species / customer groups)
What are the two main types of hierarchical clustering
Agglomerative (bottom-up):
- Start with single points and merges in iterations until only one cluster is left
- requires similarity measure
Divisive (Top-down):
- Start with on large cluster
- divide clusters till each cluster contains one point
- requires distance measure
Agglomerative is more often used
The basic agglomerative algorithm
- computy proximity matrix
- Each data point is a cluster
Repeat:
- Merge the two closest cluster
- Update the proximity matrix
Till one cluster remains
Key operation: calculating proximity matrix
How to determine Inter-Cluster Similarity
- Single Link
- Complete Link
- Group Average
- Distance between centroids
How to calculate the cluster similarity: Single Link
- Similarity is based on two most similar (closest) points of different clusters
- Determined by one pair of points
How to calculate the cluster similarity: Complete Linkage
- Similarity is based on the two least similar (most distant) points in different clusters
- Shortest longest distance
- Determined by all points in the two clusters
Differences between Single Link and Complete Linkage
Single Link (Shortest Distance):
+ Can handle non-elliptic shapes
- Sensitive to noise and outliers
Complete Linkage (Shortest Longest Distance):
+ Less sensitive to noise and outliers
- Biased towards globular clusters
- Tends to break large clusters (because decisions can’t be undone)
General characteristic of agglomerative clustering
They make good local decision about merging two clusters but global optimisation is not guaranteed as a cluster merge can’t be undone afterwards.
How to calculate the cluster similarity: Group Average
- Proximity is the average of pair-wise proximity between all points in the two clusters
Comparison between Group Average and Single / Complete Link
Compromise between single and complete link
+ Less sensitive to noise and outliers than single link
- Biased towards globular clusters
General Problems and Limitations of hierarchical clustering
- Sensitivity to noise and outliers (except complete link)
- Difficulty handling non-elliptic shapes (except single link)
- Breaks large clusters
High Space and Time Complexity
- At least O(N^2) due to proximity matrix
- Often O(N^3) N steps which require search & update of proximity matrix
- Workaround: Apply hierarchical clustering to a random sample of original data (< 10.000 examples)
List the proximity measures
Similarity:
- Numerical value of how alike two data objects are
- Often Range between [0,1]
Dissimilarity / Distance:
- Numerical value how different two data objects are
There are proximity measures for single attributes and for multidimensional data points (records)
Levenshtein Distance
- Measures Dissimilarity of two strings
- Minimum number of edits to transform one string into the other
Edit operations:
- insert a character
- delete a character
- replace a character
Proximity of Multidimensional Data Points
- A record with many attributes (age, height, weight)
- Euclidean Distance
- Simple Matching Coefficient
- Jaccard Coefficient
Formula of Euclidean Distance
Sqrt ( SUM (pk - qk)^2)
pk and qk are the kth attributes of data points p and q
Sum of squared distance of all attributes of two records
Euclidean Distance is for Numerical Values
Why is normalization important
- To ensure that all attributes can have equal importance on the computed distance
- Normalized attributes have a common value range e.g. [0,1]
How to obtain the similarity of Binary Attributes
E.g. products in shopping basket or courses attended by students
- SMC
- Jaccard
Uses the following quanitites: M11 = p is 1, q is 1 M00 = p is 0, q is 0 M10 = p is 1, q is 0 M01 = p is 0, q is 1
How to handle symmetric binary attributes and how to calculate the similarity
Symmetric binary attribute: both of its states have equal importance (male / female, agreement with political parties)
Similarity Measure SMC:
Number of matches / number of all attributes values
(M11 + M00) / (M11 + M00 + M10 + M01)
How to handle asymmetric binary attributes and how to calculate the similarity
Asymmetric binary attributes: one of the states is more important (item is basket / item not in basket); 1 is the more important state by convention
Similarity Measure Jaccard Coefficient:
Number of 11 matches / number of not both-zero attribute values
M11 / (M01 + M10 + M11)
SMC vs Jaccard
Dating agency (many attributes):
- Jaccard
- only the 11 matches are important
- can ignore 00 combinations
Wahl-O-Mat (fewer attributes):
- SMC
- Also the 00 matches are important to consider disagreements
How can you combine similarities ?
- Weight an attribute according to its importance
- In sum, weight should sum up to 1
E.g. weighted euclidean distance
How to choose a good clustering algorithm
Good algorithm (best is misleading) depends on
- analytical goal of the use case
- distribution of the data
- run several algorithm using different hyperparameter settings, feature subsets
- Visualize and interpret the results based on application knowledge
High influence on the cluster results have:
- Standardization
- Feature Selection
- Distance Function
- Parameter Settings