Chapter 4 Flashcards
What is cluster analysis?
Cluster analysis is a data-reduction technique designed to uncover subgroups of observations within a dataset.
How does cluster analysis differ from classification analysis?
- Classification analysis knows the class of each object at the start of the process
- When carrying out cluster analysis, the class of each object is assumed not known (therefore it is known as unsupervised or indirect or descriptive method)
Is cluster analysis supervised or unsupervised learning?
Unsupervised - the class of each object is assumed unknown
What is clustering?
The process of grouping a set of physical or abstract objects into classes of similar objects
What is a cluster?
A collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in another cluster.
A cluster of data objects can be treated collectively as one group and so may be considered a form of data compression.
How can we measure the similarity of the objects within each cluster?
Most commonly used methods are distance-based
There are a variety of techniques available.
What is the formal objective of cluster analysis?
Given a data matrix composed of n observations (rows) and p variables (columns), the objective of cluster analysis is to cluster the observations into groups that are internally homogeneous (internal cohesion) and heterogeneous from group to group (external separation)
Intra-cluster distances are minimised
Inter-cluster distances are maximised
What is the difference between PCA and cluster analysis?
Cluster analysis involves clustering the data cases (rows) rather than the data variables (columns)
What are some simple examples of clustering objects?
- Marketing researchers use cluster analysis as a customer-segmentation strategy
- Psychological researchers cluster symptoms and demographics of patients to uncover subtypes of disorder - more targeted and effective treatments etc.
- Medical researcher help catalog gene-expression patterns
What are the two main areas of cluster analysis techniques?
- Hierarchical
- Non-hierarchical
What are hierarchical methods?
What are the possible approaches?
Hierarchical methods create a hierarchical decomposition of the given set of data objects.
- Agglomerative approach
- Divisive approach
What is the agglomerative approach?
Bottom-up approach
Starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one, or until a termination condition holds.
What is the divisive approach?
Top-down approach
Starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller cluster, until eventually each holds an object in its own individual cluster, or a termination condition holds.
What are non-hierarchical methods also called?
Partitioning methods
What are non-hierarchical methods?
Non-hierarchical produce k partitions of objects and the objects can be relocated, allowing poor initial partitions to be corrected at a Larter stage.
Each of the k partitions represents a cluster which must satisfy:
- Each cluster must contain at least one object
- Each object must belong to exactly one group
Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative technique that attempts to improve the partitioning by moving objects from one group to another.
How do non-hierarchical and hierarchical methods differ?
- Non-hierarchical methods allow relocation of objects, thus allowing poor initial partitions to be corrected at a later stage
- Non-hierarchical methods are computationally efficient and therefore suitable for large datasets
- The number of clusters must be known in advance for non-hierarchical clustering
- Non-hierarchical is non-deterministic - generally producing different clusters with different initialisations
What is the general criterion of a good partitioning?
Objects in the same cluster are “close” or related to each other, whereas objects of different clusters are “far apart” or very different
When can hierarchical clustering be particularly useful
When you expect nested clustering and a meaningful hierarchy. This is often the case in the biological sciences.
Why may hierarchical algorithms be consider greedy?
Once an observation is assigned to a cluster, it cannot be reassigned later in the process.
What are the disadvantages of hierarchical clustering?
It can be difficult to apply in large samples (it is computationally inefficient).
eg hundreds or thousands of observations - partitioning methods can work well in these situations
Which clustering method is deterministic, and what does this mean?
Hierarchical clustering methods
Deterministic means that the method produces the same clustering each time.
What are the common steps in cluster analysis?
1 - choose the appropriate attributes
2 - scale the data
3 - screen for outliers
4 - calculate distances
5 - select a clustering algorithm
6 - obtain one or more cluster solutions
7 - determine the number of clusters present
8 - obtain a final clustering solution
9 - visualise the results
10 - interpret the clusters
11 - validate the results
How do you choose appropriate attributes?
Select the variables you feel may be important for identifying and understanding differences among groups of observations within the data.
A sophisticated cluster analysis cannot compensate for a poor choice of variables.
Why would you need to scale the data?
If the variables in the analysis vary int Ange, the variables with the largest range will have the greatest impact on the results. This is often undesirable.
How can you scale the data?
- Most popular approach is to standardise each variable to a. mean of 0 and SD of 1
- Dividing each variable by its maximum value
- Subtracting the variable’s mean and dividing by the variables mean absolute deviation
Why do we need to screen for outliers?
Many clustering techniques are sensitive to outliers, distorting the cluster solutions obtained.
How can you screen for outliers?
- Using functions from the outliers package
- The mvoutlier package contains functions that can be used to identify multivariate outliers
Or you can use a clustering method that is robust to the presence of outliers
What are the most popular measures of the distance between two observations?
- Euclidean distance - the most popular
- Manhattan
- Minkowski
- Hamming
- Tchebytchev
Why might you try more than one algorithm?
To see how robust the results are to the choice of methods
What does visualisation help determine?
The meaning and usefulness of the cluster solution
What kind of visualisation is usually used for hierarchical clustering?
Dendrogram
What kind of visualisation is usually used for partitioning?
Bivariate cluster plot
What do you do once a cluster solution has been obtained?
Interpret (and possibly name) the clusters.
- What do the observations in a cluster have in common?
- How do they differ from the observations in other clusters?
This step typically utilises summary statistics for each variable by cluster
What summary statistics are used for continuous data?
The mean or median for each variable within each cluster
In addition, what summary statistics are included for mixed data (data that contains categorical variables?
Modes or category distributions
What does validating the cluster solution involve?
Asking the question “are these groupings in some sense real, not a manifestation of unique aspects of this dataset or a statistical technique?”
If a different cluster method or different sample is employed, would the same clusters be obtained?
What does every cluster analysis begin with?
The calculation of a distance, dissimilarity or proximity between each entity to be clustered.
Each clustering problem is based on some kind of “distance” between points
When do you use a distance measure such as the Euclidean distance?
If the data is predominantly quantitative
When do you use an index of dissimilarity?
When the data is predominantly qualitative
If X and Y are two points, then a function d(X,Y) is said to be a distance between two observations if it satisfies what properties?
- Non-negativity (distance is a non-negative number)
- Identity (the distance of an object to itself is 0)
- Symmetry (distance is a symmetric function)
- Triangle inequality (going directly from X to Y in space is no longer than taking a detour via another object Z)
Describe the Euclidean distance
A Euclidean distance (L2 norm) is based on the locations of points in space. It is a measure of similarity (or dissimilarity) used for variables which are continuous.
[See flashcard for formula]
Give examples of variables which may use the Euclidean distance
Continuous variables
- Height
- Weight
- Temperature
What is the Manhattan distance?
The Manhattan distance (L1 norm or city-block distance) is a variation of the Euclidean distance.
[See flashcard for formula]
What do hierarchical methods of clustering do?
They allow us to get a family of partitions, each associated with the subsequent levels of grouping among the observations, calculated on the basis of the available data.
What makes a hierarchical method agglomerative or divisive?
Based on how the hierarchical decomposition (representation) is formed.