Chapter 5: Cluster Analysis Flashcards

Question

# K-Means: Reducing the SSE with Postprocessing 2 Strategies to decrease the number of clusters

Answer 1

1. **Disperse a cluster**: This is accomplished by removing the centroid that corresponds to the cluster and reassigning the points to the other clusters. Ideally, the cluster that is dispersed should be the one that increases the total SSE the least. 2. **Merge two clusters**: The clusters with the closest centroids are typically chosen, although another, perhaps better, approch is to merge the two clusters that result in the smallest increase in total SSE. These two merging strategies are the same ones that are used in the hierarchical clustering techniques known as the centroid method and Ward's method, respectively.

Answer 2

1. Initialize the list of clusters to contain the cluster consisting of all points. 2. **repeat** 3. . . Remove a cluster from the list of clusters. 4. . . {Perform several "trial" bisections of the chosen cluster.} 5. . . **for** *i*=1 to *number of trials* **do** 6. . . . . Bisect the selected cluster using basic K-means. 7. . . **end for** 8. . . Select the two clusters from the bisection with the lowest total SSE. 9. . . Add these two clusters to the list of clusters. 10. **until** The list of clusters contains K cluster.

Answer 3

K-means and its variations have limitations with respect to finding different types of clusters. In particular, K-means has difficulty detecting "natural" clusters, when clusters have - **non-sperical shapes** - or **widely different sizes** - or **widely different densities**. The difficulty in these situations is that the K-means objective function is a mismatch for the kinds of clusters we are trying to find because it is minimized by globular clusters of equal size and density or by clusters that are well-separated.

Answer 4

- K-means is simple - It can be used for a wide variety of data types - It is quite efficient, even though multiple runs are often performed.

Answer 5

- Not suitable for all types of data. - It cannot handle non-globular clusters or clusters of different sizes and densities. - It has trouble clustering data that contains outliers. - K-means is restricted to data for which there is a notion of a center (centroid)

Answer 6

- Agglomerative - Divisive

Answer 7

Start with the points as individual clusters and, at each step, merge the closest pair of clusters. This requires defining a notion of cluster proximity.

Answer 8

Start with one, all-inclusive cluster and, at each step, split a cluster until only singleton clusters of individual points remain. In this case, we need to decide which cluster to split at each step and how to do the splitting.

Answer 9

1. Compute the proximity matrix, if necessary. 2. **repeat** 3. . . Merge the closest two clusters 4. . . Update the proximity to reflect the proximity between the new cluster and the original clusters. 5. **until** Only one cluster remains.

Answer 10

The basic agglomerative hierarchical clustering algorithm uses a proximity matrix. This requires the storage of `m²/2` proximities (assuming the proximity matrix is symmetric) where `m` is the number of data points. The space needed to keep track of the clusters is proportional to the number of clusters, which is `m-1`. Hence, the total space complexity is `O(m²)`

Answer 11

`O(m²)` time is required to compute the proximity matrix.

Answer 12

Agglomerative hierarchical clustering cannot be viewed as globally optimizing an objective function. Instead, agglomerative hierarchical clustering techniques use various criteria to decide locally, at each step, which clusters should be merged (or split for divisive approaches). This approach yields clustering algorithms that avoid the difficulty of attempting to solve a hard combinatorial optimization problem.

Answer 13

2 Approaches to treating the relative sizes of the pairs of clusters that are merged: - **Weighted**. All clusters are treated equally. - **Unweighted**. This takes the number of points in each cluster into account. (The terminology of weighted or unweighted refers to the data points, not the clusters. I.e. treating clusters of unequal size equally - the weighted approach - gives different weights to the points in different clusters, while taking the cluster size into account - the unweighted approach - gives points in different clusters the same weight.)

Answer 14

Agglomerative hierarchical clustering algorithms tend to make good local decisions about combining two clusters because they can use the local information about the pairwise similarity of all points. However, **once a decision is made to merge two clusters, it cannot be undone at a later time**. This approach prevents a local optimization criterion from becoming a global optimisation criterion.

Answer 15

**Outliers** pose the most serious problems for Ward's method and centroid-based hierarchical clustering approaches because they **increase SSE and distort centroids**. For **single link, complete link and group average, outliers are potentially less problematic**. As hierarchical clustering proceeds for these algorithms, outliers tend to form singleton / small clusters that do not merge with any other clusters until much later in the merging process. By discarding singleton or small clusters that are not merging with other clusters, outliers can be removed.

Answer 16

Density-based clustering locates regions of high density that are separated from one another by regions of low density.

Answer 17

A center-based approach to density allows us to classify a point as being: - A **core point**, in the interior of a dense region. - A **border point**, on the edge of a dense region. - A **noise point**, in a sparsely occupied region.

Answer 18

These points are in the interior of a density-based cluster. A point is a core point if there are at least *MinPts* wihin a distance of *Eps*, where *MinPts* and *Eps* are user-specified parameters.

Answer 19

A border point is not a core point, but falls within a neighbourhood of a core point. A border point can fall within the neighbourhoods of several core points.

Answer 20

A noise point is any point that is neither a core point nor a border point.

Answer 21

1. Label all points as core, border, or noise points. 2. Eliminate noise points. 3. Put an edge between all core points within a distance `Eps` of each other. 4. Make each group of connected core points into a separate cluster. 5. Assign each border point to one of the clusters of its associated core points.

Answer 22

The basic time complexity of the DBSCAN algorithm is *O(m x time to find points in the Eps-neighbourhood)*, where *m* is the number of points. In the worst case, this complexity is `O(m²)`. However, in low-dimensional spaces, data structures such as kd-trees allow efficient retrieval of all points wihin a given distance of a specified point, and the time complexity can be as low as `O(*m* log *m*)` in the average case. The space requirement of DBSCAN, even for high-dimensional data, is `O(m)` because it is necessary to keep only a small amount of data for each point, i.e. the cluster label and the identification of each point as a core, border, or noise point.

Answer 23

The basic approach is to look at the behaviour of the distance from a point to its *kth* nearest neighbour, which we will call the k-dist. For points that belong to some cluster, the value of k-dist will be small if *k* is not larger than the cluster size. If we compute the *k*-dist for all the data points for some *k*, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of *k*-dist that corresponds to a suitable value of `Eps`. If we select this distance as the `Eps` parameter, and take the value of *k* as the `MinPts` parameter, then points for which the *k*-dist is less than *Eps* will be labelled as core points, while other points will be labelled as noise or border points.

Answer 24

Because DBSCAN uses a density-based definition of a cluster, it: - is relatively resistant to noise and - can handle clusters of arbitrary shapes and size. Thus, it can find many clusters that could not be found using K-means.

Answer 25

- DBSCAN has trouble when the clusters have widely varying densities. - It also has trouble with high-dimensional data because density is more difficult to define for such data. - DBSCAN can be expensive when the computation of nearest neighbours requires computing all pairwise proximities, as is usually the case for high-dimensional data.

Answer 26

1. Determining the **clustering tendency** of a set of data, i.e. distinguishing whether non-random structure actually exists in the data. 2. Determining the correct number of clusters. 3. Evaluating how well the results of a cluster analysis fit the data without reference to external information. 4. Comparing the results of a cluster analysis to externally known results, suh as externally provided class labels. 5. Comparing two sets of clusters to determine which is better.

Answer 27

- Unsupervised - Supervised - Relative

Answer 28

Measures the goodness of a clustering structure without respect to external information. An example of this is the SSE. Unsupervised measures of cluster validity are often further divided into two classes: - measures of **cluster cohesion** - measures of **cluster separation** Unsupervised measures are often called **internal indices** because they use only information present in the data set.

Answer 29

(tightness, compactness) Cluster cohesion determines how closely related the objects in a cluster are.

Answer 30

(isolation) Cluster separation determines how distinct or well-separated a cluster is from other clusters.

Answer 31

Measures the extent to which the clustering structure discovered by a clustering algorithm matches some external structure. E.g. entropy. Supervised measures are often called **external indices** because they use information not present in the data set.

Answer 32

The sum of the weights of the links in the proximity graph that connect points within the cluster.

Answer 33

The separation between two clusters can be measured by the sum of the weights of the links from points in one cluster to points in the other cluster.

Answer 34

The sum of the proximities with respect to the prototype (centroid or medoid) of the cluster.

Answer 35

The separation between two clusters can be measured by the proximity of the two cluster prototypes.

Answer 36

When proximity is measured by Euclidean distance, the traditional measure of separation between clusters is the between group sum of squares (SSB), which is: the sum of the squared distance of a cluster centroid **c***ᵢ*, to the overall mean, **c**, of all the data points.

Answer 37

Combines both cohesion and separation. The value can vary between -1 and 1. A negative value is undesirable, as this corresponds to a case in which *aᵢ*, the average distance to points in the cluster, is greater than *bᵢ*, the minimum average distance to points in another cluster. We want the silhouette coefficient to be positive (*aᵢ* < *bᵢ*), and for *aᵢ* to be as close to 0 as possible, since the coefficient assumes its maximum value of 1 when *aᵢ* = 0.

Answer 38

For the *i*th object: 1. Calculate its average distance to all other objects in the cluster (*aᵢ*) 2. For any cluster not containing *i*, calculate *i*'s average distance to all objects in the cluster. Find the minimum w.r.t. all clusters. (*bᵢ*) 3. *sᵢ = (bᵢ - aᵢ) / max(aᵢ, bᵢ)*

Answer 39

The proximity at which an agglomerative hierarchical clustering technique puts the objects in the same cluster for the first time. ## Footnote E.g. If at some point in the agglomerative hierarchical clustering process, the smallest distance between the two clusters that are merged is 0.1, then all the points in one cluster have a cophenetic distance of 0.1 with respect to the points in the other cluster.

Answer 40

A matrix in which the entries are the cophenetic distances between each pair of objects. The cophenetic distance is different for each hierarchical clustering of points.

Answer 41

The correlation between the entries of the Cophenetic Distance Matrix and the original dissimilarity matrix. It is a standard measure of how well a hierarchical clustering fits the data. One of the most common uses is to evaluate which type of hierarchical clustering is best for a particular type of data.

Chapter 5: Cluster Analysis Flashcards

(65 cards)