Chapter 4 Flashcards

Question

How can you scale the data?

Answer 1

- Most popular approach is to standardise each variable to a. mean of 0 and SD of 1 - Dividing each variable by its maximum value - Subtracting the variable's mean and dividing by the variables mean absolute deviation

Answer 2

Many clustering techniques are sensitive to outliers, distorting the cluster solutions obtained.

Answer 3

- Using functions from the outliers package - The mvoutlier package contains functions that can be used to identify multivariate outliers Or you can use a clustering method that is robust to the presence of outliers

Answer 4

- Euclidean distance - the most popular - Manhattan - Minkowski - Hamming - Tchebytchev

Answer 5

To see how robust the results are to the choice of methods

Answer 6

The meaning and usefulness of the cluster solution

Answer 7

Dendrogram

Answer 8

Bivariate cluster plot

Answer 9

Interpret (and possibly name) the clusters. - What do the observations in a cluster have in common? - How do they differ from the observations in other clusters? This step typically utilises summary statistics for each variable by cluster

Answer 10

The mean or median for each variable within each cluster

Answer 11

Modes or category distributions

Answer 12

Asking the question "are these groupings in some sense real, not a manifestation of unique aspects of this dataset or a statistical technique?" If a different cluster method or different sample is employed, would the same clusters be obtained?

Answer 13

The calculation of a distance, dissimilarity or proximity between each entity to be clustered. Each clustering problem is based on some kind of "distance" between points

Answer 14

If the data is predominantly quantitative

Answer 15

When the data is predominantly qualitative

Answer 16

- Non-negativity (distance is a non-negative number) - Identity (the distance of an object to itself is 0) - Symmetry (distance is a symmetric function) - Triangle inequality (going directly from X to Y in space is no longer than taking a detour via another object Z)

Answer 17

A Euclidean distance (L2 norm) is based on the locations of points in space. It is a measure of similarity (or dissimilarity) used for variables which are continuous. [See flashcard for formula]

Answer 18

Continuous variables - Height - Weight - Temperature

Answer 19

The Manhattan distance (L1 norm or city-block distance) is a variation of the Euclidean distance. [See flashcard for formula]

Answer 20

They allow us to get a family of partitions, each associated with the subsequent levels of grouping among the observations, calculated on the basis of the available data.

Answer 21

Based on how the hierarchical decomposition (representation) is formed.

Answer 22

Initially places each object into a cluster of its own. The clusters are then merged step by step according to some criterion.

Answer 23

All the objects form one initial cluster. The cluster is then split according to some criterion. The cluster splitting process repeats until, eventually, each new cluster contains only a single object.

Answer 24

A user specified desired number of clusters, in either agglomerative or divisive clustering.

Answer 25

Agglomerative

Answer 26

To determine which elements to mere into a cluster. Usually, we want to take the two closest elements, according to the "closeness measure"

Answer 27

A distance matrix The number in the I-th row and j-th column is the distance between the I-th and j-th elements. As clustering progresses, rows and columns are merged as the clusters are merged and the distances updated.

Answer 28

Caches distances between clusters

Answer 29

The smallest distance value is identified, which represents the distance of the closest elements. These elements are then merged together. This process is repeated iteratively until all observations are in one cluster.

Answer 30

Using visual inspection of "closeness" ie looking for the smallest value

Answer 31

A dendrogram is a tree-like diagram that represents the relationships of similarity among a group of entities. Tree of hierarchical clustering This structure associates to every step of the hierarchical procedure, one and only one clustering of the observations in the k groups.

Answer 32

Subsequent clusterings of the observations Branches indicate divisions of the observations into clusters

Answer 33

All the observations are contained in only one class

Answer 34

In hierarchical methods, the elements that are united (or divided) at a certain step will remain united (or divided) until the end of the process.

Answer 35

- Initialisation: given n statistical observations to classify, every element represents a group - Selection: the two nearest clusters are selected - Updating: the number of clusters is updated to (n-1) through the union in a unique cluster. The matrix of distances is updated. - Repetition: steps 2 and 3 are performed n-1 times - End: the procedure stops when all the elements are incorporated in a single cluster

Answer 36

In the way that the distance between two objects, or clusters, is measured

Answer 37

- Single linkage - Complete linkage - Average linkage - Centroid - Ward

Answer 38

Shortest distance between a point in one cluster and a point in the other cluster Nearest neighbour

Answer 39

Longest distance between a point in one cluster and a point in the other cluster Furthest neighbour

Answer 40

Average distance between each point in one cluster and each point in the other cluster Also called UPGMA - unweighted pair group meaning average

Answer 41

Distance between the centroids (vector of variable means) of the two clusters. For a single observation, the centroid is the variable's values

Answer 42

The ANOVA sum of squares between the two clusters added up over all the variables

Answer 43

Elongated, cigar-shaped clusters. It also commonly displays a phenomenon called chaining - dissimilar observations are joined into the same cluster because they are similar to intermediate observations between them.

Answer 44

Compact clusters of approximately equal diameter. Can be sensitive to outliers

Answer 45

- Complete linkage - Ward's method

Answer 46

A compromise between single and complete linkage. It is less likely to chain and is less susceptible to outliers. It also has a tendency to join clusters with small variances.

Answer 47

Tends to join clusters with small numbers of observations and tends to produce clusters with roughly equal numbers of observations. It can be sensitive to outliers

Answer 48

It offers an attractive alternative due to its simple and easily understood definition of cluster distances. it is also less sensitive to outliers than other hierarchical methods. It may not perform as well as the average linkage or ward method.

Answer 49

You need the original data. In merging clusters together iteratively, it is necessary to recalculate the centroid positions of any clusters that have changed membership. Ie replace distances with respect to the original distances with the distances with respect to the centroid of the new cluster [see formula]

Answer 50

- Both methods consider all the observations in each cluster in calculating the distance between clusters - Average linkage method considers the average of the distances between observations - Centroid method calculates the centroid of each group then measures the distance between the centroids

Answer 51

In choosing the groups to be joined, Ward's method minimises an objective function using the principle that clustering aims to create groups with maximum internal cohesion and maximum external separation. In Ward's minimum-variance method, the distance between two clusters is the ANOVA sum of squares between the two clusters, added up over all the variables. At each generation, the within-cluster sum of square sis minimised over all partitions obtainable by merging two clusters from the previous generation.

Answer 52

W - deviance within groups B - deviance between groups T = W + B Analogous to dividing the variance into two parts for linear regression. In Ward's method, groups are joined so that the increase in W is smaller and the increase in B is larger. This achieves the greatest possible internal cohesion and external separation.

Answer 53

- Centroid method - Ward's method (which can be interpreted as a variant of the centroid method)

Answer 54

- Whether coordinates are provided or not - ie do we need to calculate the distances or are they given? - The clustering method to use - The type of distance to use

Answer 55

Arrange in clustering order, rather than alphabetical/numerical order This looks the neatest.

Answer 56

The smallest value (X) corresponds to the items to be clustered. It is said that A and B are clustered at X. X will be the height of the dendrogram branch.

Answer 57

A single isolated branch

Answer 58

The number of non-overlapping clusters It is intended to classify the n observations - the partitioning algorithm classifies each observations on the bases of the selected criterion

Answer 59

The user has to input the desired number of clusters k

Answer 60

1 - choose the number of groups k and choose an initial clustering of the n statistical units in that number of groups 2 - evaluate the "transfer" of each observation from the initial group to another group. The purpose is to maximise the internal cohesion of the groups. The variation in the objective function determined by the transfer is calculated, and if relevant, the transfer becomes permanent. 3 - repeat step 2 until a stopping rule is applied

Answer 61

Non-hierarchical algorithms - they employ an interactive structure calculation, which does not require us to determine the distance matrix. They are more suitable for large datasets as hierarchical algorithms would be too slow.

Answer 62

There can be many possible ways of dividing n observations into k non-overlapping groups, especially for real data, and it may be impossible to obtain and compare all these combinations. Non-hierarchical algorithms may produce constrained solutions, often corresponding to local maxima of the objective function.

Answer 63

- The k-means algorithm - The k-medoids algorithm

Answer 64

Clusters - represented by the mean value of the objects in that cluster K - the number of groups established a priori

Answer 65

Each cluster is represented by one of the objects located near the centre of the cluster (the medoid)

Answer 66

Inputs: k - the number of clusters D - a set containing n objects Output: - A set of k clusters

Answer 67

1 - initialisation 2- transfer evaluation 3 - repetition

Answer 68

Having determined the number of groups, k points (called seeds) are defined in the p-dimensional space. The seeds constitute centroids (measure of position, mean) of the clusters in the initial partition. There should be sufficient distance between them to improve the properties of the convergence of the algorithm. Once the seeds are defined, an initial partition of the observations is built, allocating each observation to the group whose centroid is closest.

Answer 69

The distance of each observation from the centroids of k groups is calculated. The distance between an observation and the centroid of the group to which it has been assigned has to be a minimum, if it is not a minimum, the observations will be moved to the cluster whose centroid is closest. The centroids of the old group and the new group are then re-calculated.

Answer 70

Until a suitable stabilisation of the groups.

Answer 71

Determine k partitions that minimise the square-error function [See formula]

Answer 72

Euclidean distance

Answer 73

The possibility of obtaining distorted results when there are outliers in the data

Answer 74

- Given a series of points and told how many clusters k are desired - Randomly select (or be given) the initial cluster centres - Create a table showing the distance of each point from each of the centroids - calculate Euclidean distances - Use an asterisk to indicate which centre is closest to each point - we are looking for the minimum in each row - Write down the new clusters - Recalculate the centroids (determining the mean of the points in each cluster) - Repeat this process until the clusters remain the same

Answer 75

Determine the mean of the points in each cluster

Answer 76

The k-means algorithm is sensitive to outliers, as an object with an extremely large value may distort the centroid for a particular cluster

Answer 77

Using the k-medoids algorithm Instead of taking the mean value of the objects in a. cluster as a reference point, this algorithm picks an actual object to represent the cluster. The object is chosen in such a way that it is "closest" to the centre of the cluster

Answer 78

There are no completely satisfactory methods that can be used for determining the number of population clusters for any type of cluster analysis. Looking for a "gap" in the joining of clusters can be subjective and dependent on the clustering method.

Answer 79

Using a dendrogram by comparing the average distances between cluster centres.

Answer 80

- Outliers - The randomly-chosen initial cluster centres - The number of clusters

Answer 81

If we had selected a different combination of starting points, we may have found clusters that split the data differently.

Answer 82

Setting k to be very large will improve the homogeneity of the clusters but at the same time risks overfitting the data.

Answer 83

Ideally you will have a priori knowledge about the true groupings and you can apply this information to choosing the number of clusters.

Answer 84

- Business requirements or motivation for analysis - eg number of tables in the meeting hall - clustering attendee list - eg budget to create X distinct advertising campaigns - eg clustering movies, consider award show genre categories

Answer 85

k = sqrt(n/2) n - number of examples in the dataset If the dataset is large, it is likely to result in an unwieldy number of clusters. Other statistical methods can assist in finding a suitable k-means cluster set.

Answer 86

The elbow method attempts to gauge how the homogeneity or heterogeneity within the clusters changes for various values of k. Homogeneity is expected to increase as additional clusters are added and heterogeneity continues to decrease with more clusters. We want to find a k so that there are diminishing returns beyond that point. This value of k is known as the elbow point because it looks like an elbow. Numerous statistics can bar used with the elbow method.

Answer 87

Large datasets can be fairly time consuming, clustering the data repeatedly is even worse.

Answer 88

Applications requiring the exact optimal set of clusters are fairly rare.

Answer 89

By observing how the characteristics of the clusters change as k is varied, one might infer where the data have naturally defined boundaries. Groups that are more tightly clustered will change a little, while less homogenous groups will form and disband over time.

Chapter 4 Flashcards

(114 cards)