Session 6 Flashcards
Why use clustering?
- Data exploration: “looking for “interesting patterns” without prescribing any specific interpretation”.
- Information reduction.
- Verification.
Clustering in the CRISP-DM cycle
Mostly:
• Data understanding - exploration
• Data preparation - preprocessing, reducing dimensionality
• Modelling
How to interpret clusters?
Characteristically.
Characteristically.
• Interpreting clusters by looking at its members
OR
• Interpreting clusters by looking at a typical cluster member or typical characteristic(s).
How to interpret clusters?
Differentially.
Differentially.
• What differentiates Cluster X from Cluster Y?
• Supervised learning approach.
‣ Each data point has a new label - cluster ID.
‣ Predictive modelling with cluster ID as a target variable.
How would that work? Supervised learning approach - Differentially?
Set up a classification task: 1) a k-class task or 2) binary classification • Ensure intelligibility - be able to get the classifier definition
Cluster validity
How good is a given clustering?
Internal criteria
- uses the internal information of the process to evaluate the quality of clustering and to what extent it fits your data.
- Compactness and isolation.
• Can be used to specify the optimal number of clusters.
Indices are often (not always) method-dependent.
Cluster validity
How good is a given clustering?
External criteria
• Compare the created clusters to some external reference:
‣ Experts opinion, existing theory.
‣ External variables/groupings.
‣ Labels generated by a different clustering method.
- Similarity between the two clusterings.
- Can be used to find a suitable clustering algorithm for a given data set.
Steps in a typical cluster analysis:
A. Collect the data to use for clustering. Preprocess if needed.
B. Select the variables.
C. Select distance measure.
D. Select clustering method.
E. Experiment with different sets of variables/measures/methods.
F. Determine validity of the selected solution.
G. Interpret the results.
Experiment! - difficult to anticipate what combinations of variables, similarity measures, clustering methods will lead to interesting results.
Random Forest model to…
assess the explanatory power of each variable -> variable importance
Where else can we find clustering applications?
- Identifying fake news
- Document analysis (organising the information)
- Weather analysis (e.g. understanding meteorological patterns)
- Genetics