Clustering Flashcards
Clustering Critereon
1) Distance
2) conceptual (shared attributes)
3) density
Applications of Clustering
Pattern recognition, spatial data analysis, image processing, economic science (market research), WWW
Good clustering characteristics (optimized)
High intra class similarity - minimize intra cluster distance
Low inter class similarity - high intercluster distances
Major clustering approaches
- partitioning algorithms (k-means)
- hierarchical methods - cluster tree
- density based
Partitioning Clustering
Must define the number of clusters you want
Global optimal - exhaustively runs all clusters
Heuristic methods - k-means / k-mediods
Hierarchical agglomerative clutering
every point starts as its own cluster. At each consecutive layer they merge until only one big cluster is left at the top. It produces a dendrogram that you can cut at any point. Height of bars indicate how close items are.
agglomerative = bottom up
Similarity measures in clustering
Distance based - euclidean, manhattan, minkowski
Correlation distance - the degree two which variables are related
Inter cluster similarity
Inter - between clusters.
Min, max, group average, distance between cetnroids, or some other novel measure
Hierarchical clustering issues
Distinct cluters are not produced
Methods to cut the dendrogram into clusters exist, but they are somewhat arbitrary
If original data does not have a hierarchical structure, it may be the completely wrong fit
K Means clustering algorithm
Step 0: Start with random partition into k clusters (pick k datapoints as starting cluster)
Step 1: Generate a new cluster by assigning each data point to its closest cluster center
Step 2: Compute new centroids
Step 3: repeat steps 1 and 2 until there is no change in membership
K means - cluster # optimizaiton
Elbow plot
Reduction in variation
Number of clusters
_________
| /
| /
| /
|/
|_________________________________
Properties of K-means
Guaranteed to converge
Guaranteed to achieve local optimal but not necessarily global optimal
Pros
Low Complexity
Cons
Must specify K
Sensitive to noise / outliers
Clusters sensitive to initial random points chosen - can result in different clusters
Density based clustering method
Local cluster criterion
Major features:
can handle data of arbitrary shape
handles noisy data well
one scan needed
Need density parameters as termination condition
DB Scan Process
Define neighborhood distance N
Defnine minpts (denisty to become a papa cluster)
Categorize points as
Core: has minputs in its N neighborhood
Border: has at least one core point in its neighborhood, but does not meet the minpts criteria)
Noise - all other points
Do a DFS from each core and assign to the same core if it meets the criteria
Density Clutering Drawbacks
Very sensitive to user chooses - N and minpts