Cluster Analysis Flashcards
What is Cluster Analysis and what is its Goal
- Intra-cluster distances should be minimized
- Intra-cluster distances should be maximized
Goal: Get better understanding of the data
Types of clustering and the different between them
- Partitional clustering
- Divides records in non-overlapping clusters so that each record is exactly one subset
- Hierarchical clustering
- a set of nested clusters (overlap)
Application areas of cluster analysis
Market Research:
-Identify similar customers
E-Commerce:
-Identify offers of the same product
Image Recognition:
-Identify parts of an image that belong to the same object (1. Cluster identifies region 2. classifiers identifies region as an object)
Which type of learning tasks exist and what type of is cluster analysis
Supervised Learning:
- Discover patterns in data that relate data attributes with a target (class) attribute
- Use patterns on unseen data to predict target attribute
- Set of classes is known before
- Training data
Unsupervised Learning (Cluster analysis):
- Explore data to find intrinsic patterns
- Set of classes/clusters not known before
- No training data
List cluster analysis methods
- Partitional
- K-Means
- DBSCAN
- Hierarchical
- Agglomerative
- Divisive
Describe K-Means Clustering
- Each cluster has a centroid
- Assign data point to closest centroid
- Number of clusters (centroids) need to be specified manually
- Select k initial points as centroids
repeat
- form k clusters by assigning all points to their closest centroid
- recompute centroid (to the mean of each cluster)
until centroids don’t change
What is a Convergence Criterion and list them for K-means
Convergence criterion: Stopping condition for algorithm (a setting that fits the tasks requirements)
Default:
- no or minimum change of centroids
Alternative:
- no or minimum re-assignments of data points to different clusters
- stop after x iterations
- min decrease in sum of squared errors (SSE)
What is a cohesion measure
Similarity of the data points in the cluster
how close are the data points in a cluster to their centroid
How to evaluate K-Means Clustering
Most common: Sum of squared Errors
Error: Distance of a point to the nearest centroid
SSE: Square the errors and sum them for the k clusters
SUM Cluster ( SUM Points(Error))
Minimization task
Weaknesses of K-Means
- Need to determine number of clusters
- All items are assigned to a clusters
- Sensitive to outliers
- Depends on position of initial seeds
How can you increase the chance to find good clusters
Option 1) Restart several times with different random seeds
Option 2) Run k-means with different k’s
- SSE of clusters with different k’s can’t be compared (The higher k’ the lower the SSE)
Workaround:
1. Choose k where the SSE improvement decreases
(knee value of k)
2. Use X-Means (automatically determines k)
- start with small k, split large clusters and check if
results improve
How to handle outliers with K-Means
Use K-Medoids
- Uses the median instead of mean
- median is less affected by extreme values
Use DBSCAN
Advantages of K-Means
- Simple
- Efficient (Complexity: O (tkn)
t: iterations k: clusters n: data points
Characteristics of DBSCAN
- Density based algorithm
Density: number of points within a specified radius (eps)
Three different classes of data points:
1) Core points: Neighboring points (within eps) >= MinPts
2) Border points: Neighboring points (within eps) < MinPts but a neighbor of a core points
3) Noise point: any point that is not 1 or 2
How does DBSCAN work
- Label all points (core, border, or noise point)
- Eliminate noise points
- Group connected core points within their range (eps) to clusters
- Assign border point to a cluster based on the closest core points
if a border point is in the border of multiple clusters:- use voting or random assignment (if equal vote)
Time complexity of O(n log n): dominated by neighborhood search fo reach point
How to determine suitable EPS and MinPts values
Ideal: Points in a cluster should have their kth nearest neighbours at roughly the same distance. Noise points have their kth nearest neighbor at farther distance
- MinPts = 4 (rule of thumb)
- Plot sorted distance of every point to its kth nearest neighbors
- Set eps to the point where the distance sharply increases (start of noise points)
Adjusting:
- Decrease k if small clusters are labeled as noise (reduces density)
- Increase k if outliers are included into the clusters (increases density)
When does DBSCAN work well
- Resistant to noise
- If clusters have different shapes and sizes