Chapter 5: Cluster Analysis Flashcards

1
Q

What is cluster analysis?

A

Cluster analysis groups data objects based on information found only in the data that describes the objects and their relationships.

The goal is that the objects from within a group be similar (or related) to one another and different from (or unrelated to) the objects in other groups.

The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the better or more distinct the clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Hierarchical vs Partitional

Partitional Clustering

A

A division of the set of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Hierarchical vs Partitional

Hierarchical Clustering

A

If we permit clusters to have subclusters, then we obtain a hierarchical clustering, which is a set of nested clusters that are organized as a tree.

Each node (cluster) in the tree (except for the leaf nodes) is the union of its children (subclusters), and the root of the tree is the cluster containing all the objects.

Often, but not always, the leaves of the tree are singleton clusters of individual data objects.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Exclusive vs Overlapping vs Fuzzy

Exclusive Clustering

A

Exclusive clustering assigns each object to a single cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Exclusive vs Overlapping vs Fuzzy

Overlapping Clustering

aka Non-Exclusive Clustering

A

An object can simultaneously belong to more than one group (class).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Exclusive vs Overlapping vs Fuzzy

Fuzzy Clustering

A

Every object belongs to every cluster with a membership weight that is between 0 (absolutely doesn’t belong) and 1 (absolutely belongs).

In other words, clusters are treated as fuzzy sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Complete versus Partial Clustering

A

A complete clustering assigns every object to a cluster, whereas a partial cluster does not.

The motivation for a partial clustering is that some objects in a data set may not belong to well-defined groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

5 Different Types of Clusters

A
  • Well-separated
  • Prototype-based
  • Graph-based
  • Density-based
  • Shared-Property (Conceptual Clusters)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Different Types of Clusters

Well-Separated

A

A cluster is a set of objects in which each object is closer (or more similar) to every other object in the cluster than to any object not in the cluster.

Sometimes a threshold is used to specify that all the objects in a cluster must be sufficiently close (or similar) to one another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Different Types of Clusters

Prototype-Based

A

A cluster is a set of objects in which each object is closer (more similar) to the prototype that defines the cluster than to the prototype of any other cluster.

For data with continuous attributes, the prototype of cluster is often a centroid, i.e., the average of all the points in the cluster.

When a centroid is not meaningful, such as when the data has categorical attributes, the prototype is often a medoid, i.e. the most representative point of a cluster.

For many types of data, the prototype can be regarded as the most central point, and in instances, we commonly refer to prototype-based clusters as center-based clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Different Types of Clusters

Graph-based

A

If the data is represented as a graph, where the nodes are objects and the links represent connections among objects, then a cluster can be defined as a connected component; i.e. a group of objects that are connected to one another, but that have no connection to objects outside the group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Graph-based cluster

Contiguity-based cluster

A

Where two objects are connected only if they are within a specified distance of each other.

This implies that each object in a contiguity-based cluster is closer to some other object in the cluster than to any point in a different cluster.

This definition of a cluster is useful when clusters are irregular or intertwined.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Graph-based cluster

Clique

A

A set of nodes in a graph that are completely connected to each other.

Specifically, if we add connections between objects in the order of their distance from one another, a cluster is formed when a set of objects forms a clique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Different Types of Clusters

Density-Based

A

A cluster is a dense region of objects that is surrounded by a region of low density.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Different Types of Clusters

Shared-Property (Conceptual Clusters)

A

More generally, we can define a cluster as a set of objects that share some property.

This definition encompasses all the previous definition of a cluster; e.g., objects in a center-based cluster share the property that they are all closest to the same centroid or medoid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

K-means

A

This is a prototype-based, partitional clustering technique that attempts to find a user-specified number of clusters (K), which are represented by their centroids.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Agglomerative Hierarchical Clustering

A

This clustering approach refers to a collection of closely-related clustering techniques that produce a hierarchical clustering by starting with each point as a singleton cluster and then repeatedly merging the two closest clusters until a single, all-encompassing cluster remains.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

DBSCAN

A

This is a density-based clustering algorithm that produces a partitional clustering, in which the number of clusters is automatically determined by the algorithm.

Points in low-density regions are classified as noise and omitted; thus, DBSCAN does not produce a complete clustering.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Algorithm 5.1

Basic K-means algorithm

A
  1. Select K points as initial centroids.
  2. repeat
  3. . . Form K clusters by assigning each point to its closest centroid.
  4. . . Recompute the centroid of each cluster.
  5. until Centroids do not change.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Algorithm 5.2

K-means++ initialization algorithm

A
  1. For the first centroid, pick one of the points at random.
  2. for i=1 to number of trials do
  3. . . Compute the distance, d(x), of each point to its closest centroid.
  4. . . Assign each point a probability proportional to each point’s d(x)².
  5. . . Pick a new centroid from the remaining points using the weighted probabilities.
  6. end for
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

K Means

Time and Space Complexity

A

The space requirements for K-means are modest because only the data points and centroids are stored.

Specifically, the storage required is O((m + K) n), where m is the number of points and n is the number of attributes.

The time requirements for K-means are also modest - basically linear in the number of data points. In particular the time required is O(I x K x m x n), where I is the number of iterations required for convergence.

I is often small and can usually be safely bounded, as most changes typically occur in the first few iterations. Therefore, K-means is linear in m, the number of points, and is efficient as well as simple provided that K, the number of clusters, is significantly less than m.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

K-Means: Additional Issues

Handling Empty Clusters

A

One of the problems with the basic K-means algorithm is that empty clusters can be obtained if no points are allocated to a cluster during the assignment step.

If this happens, then a strategy is needed to choose a replacement centroid, since otherwise, the squared error will be larger than necessary.

One approach is to choose the point tha tis farthest away from any current centroid. If nothing else, this eliminates the point that currently contributes most to the total squared error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

K-Means: Additional Issues

Outliers

A

When the squared error criterion is used, outliers can unduly influence the clusters that are found.

In particular, when outliers are present, the resulting cluster centroids (prototypes) are typically not as representative as they otherwise would be and thus, the SSE will be higher. Because of this, it is often useful to discover outliers and eliminate them beforehand.

It is important, however, to appreciate that there are certain clustering applications for which outliers should not be eliminated. When clustering is used for data compression, every point must be clustered, and in some cases, such as financial analysis, apparent outliers, e.g., unusually profitable customers, can be the most interesting points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

K-Means: Reducing the SSE with Postprocessing

2 Ways to reduce SSE by increasing the number of clusters

A
  1. Split a cluster: The cluster with the largest SSE is usually chosen, but we could also split the cluster with the largest standard deviation for one particular attribute.
  2. Introduce a new cluster centroid: Often the point that is farthest from any cluster center is chosen. We can easily determine this if we keep track of the SSE contributed by each point. Another approach is to choose randomly from all points or from the points with the highest SSE with respect to their closest centroids.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

K-Means: Reducing the SSE with Postprocessing

2 Strategies to decrease the number of clusters

A
  1. Disperse a cluster: This is accomplished by removing the centroid that corresponds to the cluster and reassigning the points to the other clusters. Ideally, the cluster that is dispersed should be the one that increases the total SSE the least.
  2. Merge two clusters: The clusters with the closest centroids are typically chosen, although another, perhaps better, approch is to merge the two clusters that result in the smallest increase in total SSE. These two merging strategies are the same ones that are used in the hierarchical clustering techniques known as the centroid method and Ward’s method, respectively.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Algorithm 5.3.

Bisecting K-means algorithm

A
  1. Initialize the list of clusters to contain the cluster consisting of all points.
  2. repeat
  3. . . Remove a cluster from the list of clusters.
  4. . . {Perform several “trial” bisections of the chosen cluster.}
  5. . . for i=1 to number of trials do
  6. . . . . Bisect the selected cluster using basic K-means.
  7. . . end for
  8. . . Select the two clusters from the bisection with the lowest total SSE.
  9. . . Add these two clusters to the list of clusters.
  10. until The list of clusters contains K cluster.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

K-Means Applicability

A

K-means and its variations have limitations with respect to finding different types of clusters.

In particular, K-means has difficulty detecting “natural” clusters, when clusters have
- non-sperical shapes
- or widely different sizes
- or widely different densities.

The difficulty in these situations is that the K-means objective function is a mismatch for the kinds of clusters we are trying to find because it is minimized by globular clusters of equal size and density or by clusters that are well-separated.

28
Q

Strengths of K-means

A
  • K-means is simple
  • It can be used for a wide variety of data types
  • It is quite efficient, even though multiple runs are often performed.
29
Q

Weaknesses of K-means

A
  • Not suitable for all types of data.
  • It cannot handle non-globular clusters or clusters of different sizes and densities.
  • It has trouble clustering data that contains outliers.
  • K-means is restricted to data for which there is a notion of a center (centroid)
30
Q

2 Approaches for generating a hierarchical clustering

A
  • Agglomerative
  • Divisive
31
Q

Agglomerative Clustering

A

Start with the points as individual clusters and, at each step, merge the closest pair of clusters.

This requires defining a notion of cluster proximity.

32
Q

Divisive Clustering

A

Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain.

In this case, we need to decide which cluster to split at each step and how to do the splitting.

33
Q

Algorithm 5.4

Basic agglomerative hierarchical clustering algorithm

A
  1. Compute the proximity matrix, if necessary.
  2. repeat
  3. . . Merge the closest two clusters
  4. . . Update the proximity to reflect the proximity between the new cluster and the original clusters.
  5. until Only one cluster remains.
34
Q

Agglomerative Hierarchical Clustering

Space Complexity

A

The basic agglomerative hierarchical clustering algorithm uses a proximity matrix.

This requires the storage of m²/2 proximities (assuming the proximity matrix is symmetric) where m is the number of data points.

The space needed to keep track of the clusters is proportional to the number of clusters, which is m-1.

Hence, the total space complexity is O(m²)

35
Q

Agglomerative Hierarchical Clustering

Computational Complexity

A

O(m²) time is required to compute the proximity matrix.

36
Q

Key Issues in Hierarchical Clustering

Lack of a Global Objective Function

A

Agglomerative hierarchical clustering cannot be viewed as globally optimizing an objective function.

Instead, agglomerative hierarchical clustering techniques use various criteria to decide locally, at each step, which clusters should be merged (or split for divisive approaches).

This approach yields clustering algorithms that avoid the difficulty of attempting to solve a hard combinatorial optimization problem.

37
Q

Key Issues in Hierarchical Clustering

Ability to Handle Different Cluster Sizes

A

2 Approaches to treating the relative sizes of the pairs of clusters that are merged:

  • Weighted. All clusters are treated equally.
  • Unweighted. This takes the number of points in each cluster into account.

(The terminology of weighted or unweighted refers to the data points, not the clusters. I.e. treating clusters of unequal size equally - the weighted approach - gives different weights to the points in different clusters, while taking the cluster size into account - the unweighted approach - gives points in different clusters the same weight.)

38
Q

Key Issues in Hierarchical Clustering

Merging Decisions are Final

A

Agglomerative hierarchical clustering algorithms tend to make good local decisions about combining two clusters because they can use the local information about the pairwise similarity of all points.

However, once a decision is made to merge two clusters, it cannot be undone at a later time.

This approach prevents a local optimization criterion from becoming a global optimisation criterion.

39
Q

Hierarchical Clustering

Outliers

A

Outliers pose the most serious problems for Ward’s method and centroid-based hierarchical clustering approaches because they increase SSE and distort centroids.

For single link, complete link and group average, outliers are potentially less problematic. As hierarchical clustering proceeds for these algorithms, outliers tend to form singleton / small clusters that do not merge with any other clusters until much later in the merging process.

By discarding singleton or small clusters that are not merging with other clusters, outliers can be removed.

40
Q

Density-based clustering

A

Density-based clustering locates regions of high density that are separated from one another by regions of low density.

41
Q

Classification of Points According to Center-Based Density

A

A center-based approach to density allows us to classify a point as being:
- A core point, in the interior of a dense region.
- A border point, on the edge of a dense region.
- A noise point, in a sparsely occupied region.

42
Q

Classification of Points According to Center-Based Density

Core points

A

These points are in the interior of a density-based cluster.

A point is a core point if there are at least MinPts wihin a distance of Eps, where MinPts and Eps are user-specified parameters.

43
Q

Classification of Points According to Center-Based Density

Border points

A

A border point is not a core point, but falls within a neighbourhood of a core point.

A border point can fall within the neighbourhoods of several core points.

44
Q

Classification of Points According to Center-Based Density

Noise points

A

A noise point is any point that is neither a core point nor a border point.

45
Q

Algorithm 5.5

DBSCAN Algorithm

A
  1. Label all points as core, border, or noise points.
  2. Eliminate noise points.
  3. Put an edge between all core points within a distance Eps of each other.
  4. Make each group of connected core points into a separate cluster.
  5. Assign each border point to one of the clusters of its associated core points.
46
Q

DBScan Algorithm

Time & Space Complexity

A

The basic time complexity of the DBSCAN algorithm is O(m x time to find points in the Eps-neighbourhood), where m is the number of points.

In the worst case, this complexity is O(m²). However, in low-dimensional spaces, data structures such as kd-trees allow efficient retrieval of all points wihin a given distance of a specified point, and the time complexity can be as low as O(*m* log *m*) in the average case.

The space requirement of DBSCAN, even for high-dimensional data, is O(m) because it is necessary to keep only a small amount of data for each point, i.e. the cluster label and the identification of each point as a core, border, or noise point.

47
Q

Selection of DBScan Parameters

How to select Eps and MinPts

A

The basic approach is to look at the behaviour of the distance from a point to its kth nearest neighbour, which we will call the k-dist.

For points that belong to some cluster, the value of k-dist will be small if k is not larger than the cluster size.

If we compute the k-dist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k-dist that corresponds to a suitable value of Eps.

If we select this distance as the Eps parameter, and take the value of k as the MinPts parameter, then points for which the k-dist is less than Eps will be labelled as core points, while other points will be labelled as noise or border points.

48
Q

Strengths of DBScan

A

Because DBSCAN uses a density-based definition of a cluster, it:
- is relatively resistant to noise and
- can handle clusters of arbitrary shapes and size.

Thus, it can find many clusters that could not be found using K-means.

49
Q

Weaknesses of DBSCAN

A
  • DBSCAN has trouble when the clusters have widely varying densities.
  • It also has trouble with high-dimensional data because density is more difficult to define for such data.
  • DBSCAN can be expensive when the computation of nearest neighbours requires computing all pairwise proximities, as is usually the case for high-dimensional data.
50
Q

5 Important issues for cluster validation

A
  1. Determining the clustering tendency of a set of data, i.e. distinguishing whether non-random structure actually exists in the data.
  2. Determining the correct number of clusters.
  3. Evaluating how well the results of a cluster analysis fit the data without reference to external information.
  4. Comparing the results of a cluster analysis to externally known results, suh as externally provided class labels.
  5. Comparing two sets of clusters to determine which is better.
51
Q

3 Types of cluster evaluation measures

A
  • Unsupervised
  • Supervised
  • Relative
52
Q

3 Types of cluster evaluation measures

Unsupervised

A

Measures the goodness of a clustering structure without respect to external information.
An example of this is the SSE.

Unsupervised measures of cluster validity are often further divided into two classes:
- measures of cluster cohesion
- measures of cluster separation

Unsupervised measures are often called internal indices because they use only information present in the data set.

53
Q

Cluster cohesion

A

(tightness, compactness)

Cluster cohesion determines how closely related the objects in a cluster are.

54
Q

Cluster separation

A

(isolation)

Cluster separation determines how distinct or well-separated a cluster is from other clusters.

55
Q

3 Types of cluster evaluation measures

Supervised

A

Measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure.

E.g. entropy.

Supervised measures are often called external indices because they use information not present in the data set.

56
Q

Graph-based View of Cohesion

A

The sum of the weights of the links in the proximity graph that connect points within the cluster.

57
Q

Graph-based View of Cluster Separation

A

The separation between two clusters can be measured by the sum of the weights of the links from points in one cluster to points in the other cluster.

58
Q

Prototype-based View of Cluster Cohesion

A

The sum of the proximities with respect to the prototype (centroid or medoid) of the cluster.

59
Q

Prototype-based View of Cluster Separation

A

The separation between two clusters can be measured by the proximity of the two cluster prototypes.

60
Q

Between group sum of squares

(SSB)

A

When proximity is measured by Euclidean distance, the traditional measure of separation between clusters is the between group sum of squares (SSB), which is:

the sum of the squared distance of a cluster centroid c, to the overall mean, c, of all the data points.

61
Q

Silhouette Coefficient

A

Combines both cohesion and separation.

The value can vary between -1 and 1. A negative value is undesirable, as this corresponds to a case in which aᵢ, the average distance to points in the cluster, is greater than bᵢ, the minimum average distance to points in another cluster.

We want the silhouette coefficient to be positive (aᵢ < bᵢ), and for aᵢ to be as close to 0 as possible, since the coefficient assumes its maximum value of 1 when aᵢ = 0.

62
Q

3 Steps to Calculate the Silhouette Coefficient

A

For the ith object:

  1. Calculate its average distance to all other objects in the cluster (aᵢ)
  2. For any cluster not containing i, calculate i’s average distance to all objects in the cluster. Find the minimum w.r.t. all clusters. (bᵢ)
  3. sᵢ = (bᵢ - aᵢ) / max(aᵢ, bᵢ)
63
Q

Cophenetic Distance

A

The proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time.

E.g.
If at some point in the agglomerative hierarchical clustering process, the smallest distance between the two clusters that are merged is 0.1, then all the points in one cluster have a cophenetic distance of 0.1 with respect to the points in the other cluster.

64
Q

Cophenetic Distance Matrix

A

A matrix in which the entries are the cophenetic distances between each pair of objects.

The cophenetic distance is different for each hierarchical clustering of points.

65
Q

Cophenetic Correlation Coefficient

(CPCC)

A

The correlation between the entries of the Cophenetic Distance Matrix and the original dissimilarity matrix.

It is a standard measure of how well a hierarchical clustering fits the data.

One of the most common uses is to evaluate which type of hierarchical clustering is best for a particular type of data.