Chapter 4 Flashcards

1
Q

What is cluster analysis?

A

Cluster analysis is a data-reduction technique designed to uncover subgroups of observations within a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How does cluster analysis differ from classification analysis?

A
  • Classification analysis knows the class of each object at the start of the process
  • When carrying out cluster analysis, the class of each object is assumed not known (therefore it is known as unsupervised or indirect or descriptive method)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Is cluster analysis supervised or unsupervised learning?

A

Unsupervised - the class of each object is assumed unknown

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is clustering?

A

The process of grouping a set of physical or abstract objects into classes of similar objects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a cluster?

A

A collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in another cluster.

A cluster of data objects can be treated collectively as one group and so may be considered a form of data compression.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How can we measure the similarity of the objects within each cluster?

A

Most commonly used methods are distance-based

There are a variety of techniques available.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the formal objective of cluster analysis?

A

Given a data matrix composed of n observations (rows) and p variables (columns), the objective of cluster analysis is to cluster the observations into groups that are internally homogeneous (internal cohesion) and heterogeneous from group to group (external separation)

Intra-cluster distances are minimised
Inter-cluster distances are maximised

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the difference between PCA and cluster analysis?

A

Cluster analysis involves clustering the data cases (rows) rather than the data variables (columns)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are some simple examples of clustering objects?

A
  • Marketing researchers use cluster analysis as a customer-segmentation strategy
  • Psychological researchers cluster symptoms and demographics of patients to uncover subtypes of disorder - more targeted and effective treatments etc.
  • Medical researcher help catalog gene-expression patterns
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the two main areas of cluster analysis techniques?

A
  • Hierarchical
  • Non-hierarchical
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are hierarchical methods?

What are the possible approaches?

A

Hierarchical methods create a hierarchical decomposition of the given set of data objects.

  • Agglomerative approach
  • Divisive approach
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the agglomerative approach?

A

Bottom-up approach

Starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one, or until a termination condition holds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the divisive approach?

A

Top-down approach

Starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller cluster, until eventually each holds an object in its own individual cluster, or a termination condition holds.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are non-hierarchical methods also called?

A

Partitioning methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are non-hierarchical methods?

A

Non-hierarchical produce k partitions of objects and the objects can be relocated, allowing poor initial partitions to be corrected at a Larter stage.

Each of the k partitions represents a cluster which must satisfy:
- Each cluster must contain at least one object
- Each object must belong to exactly one group

Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative technique that attempts to improve the partitioning by moving objects from one group to another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do non-hierarchical and hierarchical methods differ?

A
  • Non-hierarchical methods allow relocation of objects, thus allowing poor initial partitions to be corrected at a later stage
  • Non-hierarchical methods are computationally efficient and therefore suitable for large datasets
  • The number of clusters must be known in advance for non-hierarchical clustering
  • Non-hierarchical is non-deterministic - generally producing different clusters with different initialisations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the general criterion of a good partitioning?

A

Objects in the same cluster are “close” or related to each other, whereas objects of different clusters are “far apart” or very different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When can hierarchical clustering be particularly useful

A

When you expect nested clustering and a meaningful hierarchy. This is often the case in the biological sciences.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why may hierarchical algorithms be consider greedy?

A

Once an observation is assigned to a cluster, it cannot be reassigned later in the process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the disadvantages of hierarchical clustering?

A

It can be difficult to apply in large samples (it is computationally inefficient).

eg hundreds or thousands of observations - partitioning methods can work well in these situations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which clustering method is deterministic, and what does this mean?

A

Hierarchical clustering methods

Deterministic means that the method produces the same clustering each time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What are the common steps in cluster analysis?

A

1 - choose the appropriate attributes
2 - scale the data
3 - screen for outliers
4 - calculate distances
5 - select a clustering algorithm
6 - obtain one or more cluster solutions
7 - determine the number of clusters present
8 - obtain a final clustering solution
9 - visualise the results
10 - interpret the clusters
11 - validate the results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How do you choose appropriate attributes?

A

Select the variables you feel may be important for identifying and understanding differences among groups of observations within the data.

A sophisticated cluster analysis cannot compensate for a poor choice of variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Why would you need to scale the data?

A

If the variables in the analysis vary int Ange, the variables with the largest range will have the greatest impact on the results. This is often undesirable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How can you scale the data?

A
  • Most popular approach is to standardise each variable to a. mean of 0 and SD of 1
  • Dividing each variable by its maximum value
  • Subtracting the variable’s mean and dividing by the variables mean absolute deviation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Why do we need to screen for outliers?

A

Many clustering techniques are sensitive to outliers, distorting the cluster solutions obtained.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How can you screen for outliers?

A
  • Using functions from the outliers package
  • The mvoutlier package contains functions that can be used to identify multivariate outliers

Or you can use a clustering method that is robust to the presence of outliers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What are the most popular measures of the distance between two observations?

A
  • Euclidean distance - the most popular
  • Manhattan
  • Minkowski
  • Hamming
  • Tchebytchev
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Why might you try more than one algorithm?

A

To see how robust the results are to the choice of methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does visualisation help determine?

A

The meaning and usefulness of the cluster solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What kind of visualisation is usually used for hierarchical clustering?

A

Dendrogram

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What kind of visualisation is usually used for partitioning?

A

Bivariate cluster plot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What do you do once a cluster solution has been obtained?

A

Interpret (and possibly name) the clusters.

  • What do the observations in a cluster have in common?
  • How do they differ from the observations in other clusters?

This step typically utilises summary statistics for each variable by cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What summary statistics are used for continuous data?

A

The mean or median for each variable within each cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

In addition, what summary statistics are included for mixed data (data that contains categorical variables?

A

Modes or category distributions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

What does validating the cluster solution involve?

A

Asking the question “are these groupings in some sense real, not a manifestation of unique aspects of this dataset or a statistical technique?”

If a different cluster method or different sample is employed, would the same clusters be obtained?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What does every cluster analysis begin with?

A

The calculation of a distance, dissimilarity or proximity between each entity to be clustered.

Each clustering problem is based on some kind of “distance” between points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

When do you use a distance measure such as the Euclidean distance?

A

If the data is predominantly quantitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

When do you use an index of dissimilarity?

A

When the data is predominantly qualitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

If X and Y are two points, then a function d(X,Y) is said to be a distance between two observations if it satisfies what properties?

A
  • Non-negativity (distance is a non-negative number)
  • Identity (the distance of an object to itself is 0)
  • Symmetry (distance is a symmetric function)
  • Triangle inequality (going directly from X to Y in space is no longer than taking a detour via another object Z)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Describe the Euclidean distance

A

A Euclidean distance (L2 norm) is based on the locations of points in space. It is a measure of similarity (or dissimilarity) used for variables which are continuous.

[See flashcard for formula]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Give examples of variables which may use the Euclidean distance

A

Continuous variables
- Height
- Weight
- Temperature

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the Manhattan distance?

A

The Manhattan distance (L1 norm or city-block distance) is a variation of the Euclidean distance.

[See flashcard for formula]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What do hierarchical methods of clustering do?

A

They allow us to get a family of partitions, each associated with the subsequent levels of grouping among the observations, calculated on the basis of the available data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What makes a hierarchical method agglomerative or divisive?

A

Based on how the hierarchical decomposition (representation) is formed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

What does agglomerative clustering do?

A

Initially places each object into a cluster of its own. The clusters are then merged step by step according to some criterion.

47
Q

What does divisive clustering do?

A

All the objects form one initial cluster. The cluster is then split according to some criterion. The cluster splitting process repeats until, eventually, each new cluster contains only a single object.

48
Q

What is the termination condition?

A

A user specified desired number of clusters, in either agglomerative or divisive clustering.

49
Q

Which is the more popular hierarchical clustering method?

A

Agglomerative

50
Q

What is the first step of agglomerative clustering?

A

To determine which elements to mere into a cluster. Usually, we want to take the two closest elements, according to the “closeness measure”

51
Q

What is used during the agglomerative clustering process?

A

A distance matrix

The number in the I-th row and j-th column is the distance between the I-th and j-th elements.

As clustering progresses, rows and columns are merged as the clusters are merged and the distances updated.

52
Q

What is the benefit of using distance matrixes?

A

Caches distances between clusters

53
Q

What happens once the initial distance matrix is created?

A

The smallest distance value is identified, which represents the distance of the closest elements. These elements are then merged together. This process is repeated iteratively until all observations are in one cluster.

54
Q

How do we carry out this process manually?

A

Using visual inspection of “closeness” ie looking for the smallest value

55
Q

What is a dendrogram?

A

A dendrogram is a tree-like diagram that represents the relationships of similarity among a group of entities.

Tree of hierarchical clustering

This structure associates to every step of the hierarchical procedure, one and only one clustering of the observations in the k groups.

56
Q

What do the branches of the dendrogram describe?

A

Subsequent clusterings of the observations

Branches indicate divisions of the observations into clusters

57
Q

What is at the root of the tree?

A

All the observations are contained in only one class

58
Q

The partitions defined by a dendrogram are nested - what does this mean?

A

In hierarchical methods, the elements that are united (or divided) at a certain step will remain united (or divided) until the end of the process.

59
Q

Describe the outline of an agglomerative clustering algorithm.

A
  • Initialisation: given n statistical observations to classify, every element represents a group
  • Selection: the two nearest clusters are selected
  • Updating: the number of clusters is updated to (n-1) through the union in a unique cluster. The matrix of distances is updated.
  • Repetition: steps 2 and 3 are performed n-1 times
  • End: the procedure stops when all the elements are incorporated in a single cluster
60
Q

How do clustering methods differ?

A

In the way that the distance between two objects, or clusters, is measured

61
Q

What are the five most common hierarchical clustering methods?

A
  • Single linkage
  • Complete linkage
  • Average linkage
  • Centroid
  • Ward
62
Q

What is the definition of the distance between two clusters in the single linkage method?

A

Shortest distance between a point in one cluster and a point in the other cluster

Nearest neighbour

63
Q

What is the definition of the distance between two clusters in the complete linkage method?

A

Longest distance between a point in one cluster and a point in the other cluster

Furthest neighbour

64
Q

What is the definition of the distance between two clusters in the average linkage method?

A

Average distance between each point in one cluster and each point in the other cluster

Also called UPGMA - unweighted pair group meaning average

65
Q

What is the definition of the distance between two clusters in the centroid method?

A

Distance between the centroids (vector of variable means) of the two clusters.

For a single observation, the centroid is the variable’s values

66
Q

What is the definition of the distance between two clusters in the ward method?

A

The ANOVA sum of squares between the two clusters added up over all the variables

67
Q

What kind of clusters does the single linkage method tend to produce?

A

Elongated, cigar-shaped clusters.

It also commonly displays a phenomenon called chaining - dissimilar observations are joined into the same cluster because they are similar to intermediate observations between them.

68
Q

What kind of clusters does the complete linkage method tend to produce?

A

Compact clusters of approximately equal diameter.

Can be sensitive to outliers

69
Q

Which methods are sensitive to outliers?

A
  • Complete linkage
  • Ward’s method
70
Q

What kind of clusters does the average linkage method tend to produce?

A

A compromise between single and complete linkage. It is less likely to chain and is less susceptible to outliers. It also has a tendency to join clusters with small variances.

71
Q

What kind of clusters does the ward method tend to produce?

A

Tends to join clusters with small numbers of observations and tends to produce clusters with roughly equal numbers of observations.

It can be sensitive to outliers

72
Q

What kind of clusters does the centroid method produce?

A

It offers an attractive alternative due to its simple and easily understood definition of cluster distances. it is also less sensitive to outliers than other hierarchical methods.

It may not perform as well as the average linkage or ward method.

73
Q

How do you calculate the centroid of a group?

A

You need the original data.

In merging clusters together iteratively, it is necessary to recalculate the centroid positions of any clusters that have changed membership. Ie replace distances with respect to the original distances with the distances with respect to the centroid of the new cluster

[see formula]

74
Q

What are the similarities and differences between the average linkage and centroid methods?

A
  • Both methods consider all the observations in each cluster in calculating the distance between clusters
  • Average linkage method considers the average of the distances between observations
  • Centroid method calculates the centroid of each group then measures the distance between the centroids
75
Q

Explain the theory behind Ward’s method.

A

In choosing the groups to be joined, Ward’s method minimises an objective function using the principle that clustering aims to create groups with maximum internal cohesion and maximum external separation.

In Ward’s minimum-variance method, the distance between two clusters is the ANOVA sum of squares between the two clusters, added up over all the variables. At each generation, the within-cluster sum of square sis minimised over all partitions obtainable by merging two clusters from the previous generation.

76
Q

What are the two parts of the total deviance (T) of the p variables in the Ward method?

A

W - deviance within groups
B - deviance between groups

T = W + B

Analogous to dividing the variance into two parts for linear regression.

In Ward’s method, groups are joined so that the increase in W is smaller and the increase in B is larger. This achieves the greatest possible internal cohesion and external separation.

77
Q

Which methods do not require the distance matrix?

A
  • Centroid method
  • Ward’s method (which can be interpreted as a variant of the centroid method)
78
Q

What three things should you identify in a clustering question?

A
  • Whether coordinates are provided or not - ie do we need to calculate the distances or are they given?
  • The clustering method to use
  • The type of distance to use
79
Q

How should you arrange items on the dendrogram axis?

A

Arrange in clustering order, rather than alphabetical/numerical order

This looks the neatest.

80
Q

What does the minimum value in a distance matrix represent?

A

The smallest value (X) corresponds to the items to be clustered. It is said that A and B are clustered at X. X will be the height of the dendrogram branch.

81
Q

What does an outlier look like on a dendrogram?

A

A single isolated branch

82
Q

What is k?

A

The number of non-overlapping clusters

It is intended to classify the n observations - the partitioning algorithm classifies each observations on the bases of the selected criterion

83
Q

How are the number of clusters k usually provided?

A

The user has to input the desired number of clusters k

84
Q

What is the general algorithm for partition clustering?

A

1 - choose the number of groups k and choose an initial clustering of the n statistical units in that number of groups
2 - evaluate the “transfer” of each observation from the initial group to another group. The purpose is to maximise the internal cohesion of the groups. The variation in the objective function determined by the transfer is calculated, and if relevant, the transfer becomes permanent.
3 - repeat step 2 until a stopping rule is applied

85
Q

Which are faster - hierarchical or non-hierarchical algorithms?

A

Non-hierarchical algorithms - they employ an interactive structure calculation, which does not require us to determine the distance matrix.

They are more suitable for large datasets as hierarchical algorithms would be too slow.

86
Q

Why can it be difficult to do a global maximisation of the objective function?

A

There can be many possible ways of dividing n observations into k non-overlapping groups, especially for real data, and it may be impossible to obtain and compare all these combinations.

Non-hierarchical algorithms may produce constrained solutions, often corresponding to local maxima of the objective function.

87
Q

What are the two main partitioning algorithms?

A
  • The k-means algorithm
  • The k-medoids algorithm
88
Q

What are clusters represented by and what does k indicate in the k-means algorithm?

A

Clusters - represented by the mean value of the objects in that cluster
K - the number of groups established a priori

89
Q

What are clusters represented by in the k-medoids algorithm?

A

Each cluster is represented by one of the objects located near the centre of the cluster (the medoid)

90
Q

What are the inputs and outputs of the k-means algorithm?

A

Inputs:
k - the number of clusters
D - a set containing n objects

Output:
- A set of k clusters

91
Q

What are the steps of the k-means algorithm?

A

1 - initialisation
2- transfer evaluation
3 - repetition

92
Q

What is involved in the initialisation step of the k-means algorithm?

A

Having determined the number of groups, k points (called seeds) are defined in the p-dimensional space. The seeds constitute centroids (measure of position, mean) of the clusters in the initial partition.

There should be sufficient distance between them to improve the properties of the convergence of the algorithm.

Once the seeds are defined, an initial partition of the observations is built, allocating each observation to the group whose centroid is closest.

93
Q

What is involved in the transfer evaluation step of the k-means algorithm?

A

The distance of each observation from the centroids of k groups is calculated. The distance between an observation and the centroid of the group to which it has been assigned has to be a minimum, if it is not a minimum, the observations will be moved to the cluster whose centroid is closest. The centroids of the old group and the new group are then re-calculated.

94
Q

How often is step 2 repeated?

A

Until a suitable stabilisation of the groups.

95
Q

What does the k-means algorithm attempt to minimise and what is the formula for this?

A

Determine k partitions that minimise the square-error function

[See formula]

96
Q

Which distance measure does the k-means algorithm often employ to calculate the distance between the observations and the centuries?

A

Euclidean distance

97
Q

What is a disadvantage of the k-means method?

A

The possibility of obtaining distorted results when there are outliers in the data

98
Q

What are the steps for carrying out a k-means algorithm question?

A
  • Given a series of points and told how many clusters k are desired
  • Randomly select (or be given) the initial cluster centres
  • Create a table showing the distance of each point from each of the centroids - calculate Euclidean distances
  • Use an asterisk to indicate which centre is closest to each point - we are looking for the minimum in each row
  • Write down the new clusters
  • Recalculate the centroids (determining the mean of the points in each cluster)
  • Repeat this process until the clusters remain the same
99
Q

How do you calculate the centroid?

A

Determine the mean of the points in each cluster

100
Q

Is the k-means algorithm sensitive to outliers? Why?

A

The k-means algorithm is sensitive to outliers, as an object with an extremely large value may distort the centroid for a particular cluster

101
Q

What is the alternative to the k-means algorithm to account for its sensitivity to outliers?

A

Using the k-medoids algorithm

Instead of taking the mean value of the objects in a. cluster as a reference point, this algorithm picks an actual object to represent the cluster.

The object is chosen in such a way that it is “closest” to the centre of the cluster

102
Q

What are some issues with determining the number of clusters?

A

There are no completely satisfactory methods that can be used for determining the number of population clusters for any type of cluster analysis.

Looking for a “gap” in the joining of clusters can be subjective and dependent on the clustering method.

103
Q

In the case of hierarchical clustering, how can we determine the optimal number of clusters?

A

Using a dendrogram by comparing the average distances between cluster centres.

104
Q

What is the k-means algorithm sensitive to?

A
  • Outliers
  • The randomly-chosen initial cluster centres
  • The number of clusters
105
Q

The k-means algorithm is very sensitive to the randomly-chosen cluster centres - what does this mean practically?

A

If we had selected a different combination of starting points, we may have found clusters that split the data differently.

106
Q

Why does the choice of the number of clusters require a delicate balance?

A

Setting k to be very large will improve the homogeneity of the clusters but at the same time risks overfitting the data.

107
Q

What is the ideal situation for choosing the number of clusters?

A

Ideally you will have a priori knowledge about the true groupings and you can apply this information to choosing the number of clusters.

108
Q

What might determine the number of clusters?

A
  • Business requirements or motivation for analysis
  • eg number of tables in the meeting hall - clustering attendee list
  • eg budget to create X distinct advertising campaigns
  • eg clustering movies, consider award show genre categories
109
Q

What is one rule of thumb for setting k without any prior knowledge?

What are the problems with this?

A

k = sqrt(n/2)
n - number of examples in the dataset

If the dataset is large, it is likely to result in an unwieldy number of clusters. Other statistical methods can assist in finding a suitable k-means cluster set.

110
Q

What other method can be used to find a suitable k-means cluster set?

A

The elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k.

Homogeneity is expected to increase as additional clusters are added and heterogeneity continues to decrease with more clusters.

We want to find a k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow.

Numerous statistics can bar used with the elbow method.

111
Q

Why is it not always feasible to iteratively test a large number of k values?

A

Large datasets can be fairly time consuming, clustering the data repeatedly is even worse.

112
Q

Why is it usually okay to choose a k value based on convenience?

A

Applications requiring the exact optimal set of clusters are fairly rare.

113
Q

How might the process of setting k itself lead to interesting insights?

A

By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries.

Groups that are more tightly clustered will change a little, while less homogenous groups will form and disband over time.

114
Q
A