Part 2: Clustering Flashcards
Clustering (unsupervised)
- Similarity
- Partitioning (combin. and k-means)
- Hierarchical
- Model based
Application of clustering
- Market segmentation: groups of customers based on past purchasing behavior, demographic characteristics or other features.
- City-planning: groups of houses according to their house type, value, and geographical location.
- Medical: divide lab data in two clusters; healthy and suspicious.
Objective of clustering
Divide the objects in the data into natural groups (clusters) such that:
- Objects in the same cluster are similar.
- Objects in different clusters are dissimilar.
Unsupervised: no class and examples determine how the data can be grouped together.
Cluster analysis
Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. The distance in a cluster between data points are minimized and the distance between data points of different clusters are maximized. - Goal is to group together similar data, what does that mean? \+ Similarity measure is often more important that the clustering algorithm used. \+ Define a distance function like in k-nearest neighbors but: in clustering no class label available: discriminating attributes.
Distance
- For numerical values: \+ Euclidean distance \+ Manhattan distance - For nominal/binary attributes: \+ Proportion of unequal attributes out of the number of attributes. W
When you have mixed attributes, you need to use weights, because you don’t want one attribute to dominate everything.
Types of clustering methods
- Partitioning methods
- Hierarchical methods
- Model based clustering
Partitioning - Combinatorical
- Assign cluster to each object (k clusters)
- Minimize sum of distances within clusters
- Maximize sum of distances between clusters
The distances (combinatorical)
- W contains all distances between points in the same cluster.
- B contains the sum of all distances of points not in the same cluster.
- W + B = constant -> when you combine W and B, you have the sum of all distances, counted twice.
The algorithm minimizes W and maximizes B. Minimizing W takes exponential time in the number of data points.
K-means clustering
Because minimizing W takes a large amount of time, the k-means clustering is developed. This is a good approximation, not exact, but fast.
- Partitional clustering approach.
- Each cluster is associated with a centroid (center point)
- Each point is assigned to the cluster with the closest centroid.
Steps:
- Select K points as the initial centroids.
- Repeat
- Form K clusters by assigning all points to the closest centroids.
- Recomputed the centroid for each cluster until the centroids don’t change.
How many clusters are needed/wanted?
- Algorithmic approach
- Decide yourself
Selecting the right K
- Try different k, looking at the change in the average distance to centroids as k increases.
- Average falls rapidly until right k, then changes little.
Hierarchical clustering
- Agglomerative -> bottom-up
- Divisive -> top-down
Hierarchical clustering - Bottom-up
- Start with lot of clusters, because every data point has its own cluster.
- At each level we decide to merge two clusters into one.
- From the distance matrix compute the nearest clusters and merge.
- Update distance matrix.
Hierarchical clustering - cluster distance
- Single linkage
- Complete linkage
- Average linkage
Hierarchical clustering - Top-down
- Apply 2 means to the root.
- Apply k-means to new clusters.
- Repeat k-means algorithm.