Chapter 4 Flashcards
What is cluster analysis?
Cluster analysis is a data-reduction technique designed to uncover subgroups of observations within a dataset.
How does cluster analysis differ from classification analysis?
- Classification analysis knows the class of each object at the start of the process
- When carrying out cluster analysis, the class of each object is assumed not known (therefore it is known as unsupervised or indirect or descriptive method)
Is cluster analysis supervised or unsupervised learning?
Unsupervised - the class of each object is assumed unknown
What is clustering?
The process of grouping a set of physical or abstract objects into classes of similar objects
What is a cluster?
A collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in another cluster.
A cluster of data objects can be treated collectively as one group and so may be considered a form of data compression.
How can we measure the similarity of the objects within each cluster?
Most commonly used methods are distance-based
There are a variety of techniques available.
What is the formal objective of cluster analysis?
Given a data matrix composed of n observations (rows) and p variables (columns), the objective of cluster analysis is to cluster the observations into groups that are internally homogeneous (internal cohesion) and heterogeneous from group to group (external separation)
Intra-cluster distances are minimised
Inter-cluster distances are maximised
What is the difference between PCA and cluster analysis?
Cluster analysis involves clustering the data cases (rows) rather than the data variables (columns)
What are some simple examples of clustering objects?
- Marketing researchers use cluster analysis as a customer-segmentation strategy
- Psychological researchers cluster symptoms and demographics of patients to uncover subtypes of disorder - more targeted and effective treatments etc.
- Medical researcher help catalog gene-expression patterns
What are the two main areas of cluster analysis techniques?
- Hierarchical
- Non-hierarchical
What are hierarchical methods?
What are the possible approaches?
Hierarchical methods create a hierarchical decomposition of the given set of data objects.
- Agglomerative approach
- Divisive approach
What is the agglomerative approach?
Bottom-up approach
Starts with each object forming a separate group. It successively merges the objects or groups that are close to one another, until all of the groups are merged into one, or until a termination condition holds.
What is the divisive approach?
Top-down approach
Starts with all the objects in the same cluster. In each successive iteration, a cluster is split up into smaller cluster, until eventually each holds an object in its own individual cluster, or a termination condition holds.
What are non-hierarchical methods also called?
Partitioning methods
What are non-hierarchical methods?
Non-hierarchical produce k partitions of objects and the objects can be relocated, allowing poor initial partitions to be corrected at a Larter stage.
Each of the k partitions represents a cluster which must satisfy:
- Each cluster must contain at least one object
- Each object must belong to exactly one group
Given k, the number of partitions to construct, a partitioning method creates an initial partitioning. It then uses an iterative technique that attempts to improve the partitioning by moving objects from one group to another.
How do non-hierarchical and hierarchical methods differ?
- Non-hierarchical methods allow relocation of objects, thus allowing poor initial partitions to be corrected at a later stage
- Non-hierarchical methods are computationally efficient and therefore suitable for large datasets
- The number of clusters must be known in advance for non-hierarchical clustering
- Non-hierarchical is non-deterministic - generally producing different clusters with different initialisations
What is the general criterion of a good partitioning?
Objects in the same cluster are “close” or related to each other, whereas objects of different clusters are “far apart” or very different
When can hierarchical clustering be particularly useful
When you expect nested clustering and a meaningful hierarchy. This is often the case in the biological sciences.
Why may hierarchical algorithms be consider greedy?
Once an observation is assigned to a cluster, it cannot be reassigned later in the process.
What are the disadvantages of hierarchical clustering?
It can be difficult to apply in large samples (it is computationally inefficient).
eg hundreds or thousands of observations - partitioning methods can work well in these situations
Which clustering method is deterministic, and what does this mean?
Hierarchical clustering methods
Deterministic means that the method produces the same clustering each time.
What are the common steps in cluster analysis?
1 - choose the appropriate attributes
2 - scale the data
3 - screen for outliers
4 - calculate distances
5 - select a clustering algorithm
6 - obtain one or more cluster solutions
7 - determine the number of clusters present
8 - obtain a final clustering solution
9 - visualise the results
10 - interpret the clusters
11 - validate the results
How do you choose appropriate attributes?
Select the variables you feel may be important for identifying and understanding differences among groups of observations within the data.
A sophisticated cluster analysis cannot compensate for a poor choice of variables.
Why would you need to scale the data?
If the variables in the analysis vary int Ange, the variables with the largest range will have the greatest impact on the results. This is often undesirable.