Chapter 5: Cluster Analysis Flashcards
What is cluster analysis?
Cluster analysis groups data objects based on information found only in the data that describes the objects and their relationships.
The goal is that the objects from within a group be similar (or related) to one another and different from (or unrelated to) the objects in other groups.
The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the better or more distinct the clustering.
Hierarchical vs Partitional
Partitional Clustering
A division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.
Hierarchical vs Partitional
Hierarchical Clustering
If we permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested clusters that are organized as a tree.
Each node (cluster) in the tree (except for the leaf nodes) is the union of its children (subclusters), and the root of the tree is the cluster containing all the objects.
Often, but not always, the leaves of the tree are singleton clusters of individual data objects.
Exclusive vs Overlapping vs Fuzzy
Exclusive Clustering
Exclusive clustering assigns each object to a single cluster.
Exclusive vs Overlapping vs Fuzzy
Overlapping Clustering
aka Non-Exclusive Clustering
An object can simultaneously belong to more than one group (class).
Exclusive vs Overlapping vs Fuzzy
Fuzzy Clustering
Every object belongs to every cluster with a membership weight that is between 0 (absolutely doesn’t belong) and 1 (absolutely belongs).
In other words, clusters are treated as fuzzy sets.
Complete versus Partial Clustering
A complete clustering assigns every object to a cluster, whereas a partial cluster does not.
The motivation for a partial clustering is that some objects in a data set may not belong to well-defined groups.
5 Different Types of Clusters
- Well-separated
- Prototype-based
- Graph-based
- Density-based
- Shared-Property (Conceptual Clusters)
Different Types of Clusters
Well-Separated
A cluster is a set of objects in which each object is closer (or more similar) to every other object in the cluster than to any object not in the cluster.
Sometimes a threshold is used to specify that all the objects in a cluster must be sufficiently close (or similar) to one another.
Different Types of Clusters
Prototype-Based
A cluster is a set of objects in which each object is closer (more similar) to the prototype that defines the cluster than to the prototype of any other cluster.
For data with continuous attributes, the prototype of cluster is often a centroid, i.e., the average of all the points in the cluster.
When a centroid is not meaningful, such as when the data has categorical attributes, the prototype is often a medoid, i.e. the most representative point of a cluster.
For many types of data, the prototype can be regarded as the most central point, and in instances, we commonly refer to prototype-based clusters as center-based clusters.
Different Types of Clusters
Graph-based
If the data is represented as a graph, where the nodes are objects and the links represent connections among objects, then a cluster can be defined as a connected component; i.e. a group of objects that are connected to one another, but that have no connection to objects outside the group.
Graph-based cluster
Contiguity-based cluster
Where two objects are connected only if they are within a specified distance of each other.
This implies that each object in a contiguity-based cluster is closer to some other object in the cluster than to any point in a different cluster.
This definition of a cluster is useful when clusters are irregular or intertwined.
Graph-based cluster
Clique
A set of nodes in a graph that are completely connected to each other.
Specifically, if we add connections between objects in the order of their distance from one another, a cluster is formed when a set of objects forms a clique.
Different Types of Clusters
Density-Based
A cluster is a dense region of objects that is surrounded by a region of low density.
Different Types of Clusters
Shared-Property (Conceptual Clusters)
More generally, we can define a cluster as a set of objects that share some property.
This definition encompasses all the previous definition of a cluster; e.g., objects in a center-based cluster share the property that they are all closest to the same centroid or medoid.
K-means
This is a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters (K), which are represented by their centroids.
Agglomerative Hierarchical Clustering
This clustering approach refers to a collection of closely-related clustering techniques that produce a hierarchical clustering by starting with each point as a singleton cluster and then repeatedly merging the two closest clusters until a single, all-encompassing cluster remains.
DBSCAN
This is a density-based clustering algorithm that produces a partitional clustering, in which the number of clusters is automatically determined by the algorithm.
Points in low-density regions are classified as noise and omitted; thus, DBSCAN does not produce a complete clustering.
Algorithm 5.1
Basic K-means algorithm
- Select K points as initial centroids.
- repeat
- . . Form K clusters by assigning each point to its closest centroid.
- . . Recompute the centroid of each cluster.
- until Centroids do not change.
Algorithm 5.2
K-means++ initialization algorithm
- For the first centroid, pick one of the points at random.
- for i=1 to number of trials do
- . . Compute the distance, d(x), of each point to its closest centroid.
- . . Assign each point a probability proportional to each point’s d(x)².
- . . Pick a new centroid from the remaining points using the weighted probabilities.
- end for
K Means
Time and Space Complexity
The space requirements for K-means are modest because only the data points and centroids are stored.
Specifically, the storage required is O((m + K) n), where m is the number of points and n is the number of attributes.
The time requirements for K-means are also modest - basically linear in the number of data points. In particular the time required is O(I x K x m x n), where I is the number of iterations required for convergence.
I is often small and can usually be safely bounded, as most changes typically occur in the first few iterations. Therefore, K-means is linear in m, the number of points, and is efficient as well as simple provided that K, the number of clusters, is significantly less than m.
K-Means: Additional Issues
Handling Empty Clusters
One of the problems with the basic K-means algorithm is that empty clusters can be obtained if no points are allocated to a cluster during the assignment step.
If this happens, then a strategy is needed to choose a replacement centroid, since otherwise, the squared error will be larger than necessary.
One approach is to choose the point tha tis farthest away from any current centroid. If nothing else, this eliminates the point that currently contributes most to the total squared error.
K-Means: Additional Issues
Outliers
When the squared error criterion is used, outliers can unduly influence the clusters that are found.
In particular, when outliers are present, the resulting cluster centroids (prototypes) are typically not as representative as they otherwise would be and thus, the SSE will be higher. Because of this, it is often useful to discover outliers and eliminate them beforehand.
It is important, however, to appreciate that there are certain clustering applications for which outliers should not be eliminated. When clustering is used for data compression, every point must be clustered, and in some cases, such as financial analysis, apparent outliers, e.g., unusually profitable customers, can be the most interesting points.
K-Means: Reducing the SSE with Postprocessing
2 Ways to reduce SSE by increasing the number of clusters
- Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster with the largest standard deviation for one particular attribute.
- Introduce a new cluster centroid: Often the point that is farthest from any cluster center is chosen. We can easily determine this if we keep track of the SSE contributed by each point. Another approach is to choose randomly from all points or from the points with the highest SSE with respect to their closest centroids.
K-Means: Reducing the SSE with Postprocessing
2 Strategies to decrease the number of clusters
- Disperse a cluster: This is accomplished by removing the centroid that corresponds to the cluster and reassigning the points to the other clusters. Ideally, the cluster that is dispersed should be the one that increases the total SSE the least.
- Merge two clusters: The clusters with the closest centroids are typically chosen, although another, perhaps better, approch is to merge the two clusters that result in the smallest increase in total SSE. These two merging strategies are the same ones that are used in the hierarchical clustering techniques known as the centroid method and Ward’s method, respectively.
Algorithm 5.3.
Bisecting K-means algorithm
- Initialize the list of clusters to contain the cluster consisting of all points.
- repeat
- . . Remove a cluster from the list of clusters.
- . . {Perform several “trial” bisections of the chosen cluster.}
- . . for i=1 to number of trials do
- . . . . Bisect the selected cluster using basic K-means.
- . . end for
- . . Select the two clusters from the bisection with the lowest total SSE.
- . . Add these two clusters to the list of clusters.
- until The list of clusters contains K cluster.