Session 6 Flashcards

Question 1

Q

Why use clustering?

Answer

A

Data exploration: “looking for “interesting patterns” without prescribing any specific interpretation”.
Information reduction.
Verification.

Question 2

Q

Clustering in the CRISP-DM cycle

Answer

A

Mostly:
• Data understanding - exploration
• Data preparation - preprocessing, reducing dimensionality
• Modelling

Question 3

Q

How to interpret clusters?

Characteristically.

Answer

A

Characteristically.

• Interpreting clusters by looking at its members
OR
• Interpreting clusters by looking at a typical cluster member or typical characteristic(s).

Question 4

Q

How to interpret clusters?

Differentially.

Answer

A

Differentially.

• What differentiates Cluster X from Cluster Y?

• Supervised learning approach.
‣ Each data point has a new label - cluster ID.
‣ Predictive modelling with cluster ID as a target variable.

Question 5

Q

How would that work? Supervised learning approach - Differentially?

Answer

A

Set up a classification task:
1) a k-class task or
2) binary classification
• Ensure intelligibility - be able to get
the classifier definition

Question 6

Q

Cluster validity

How good is a given clustering?

Internal criteria

Answer

A

uses the internal information of the process to evaluate the quality of clustering and to what extent it fits your data.
Compactness and isolation.

• Can be used to specify the optimal number of clusters.
Indices are often (not always) method-dependent.

Question 7

Q

Cluster validity

How good is a given clustering?

External criteria

Answer

A

• Compare the created clusters to some external reference:
‣ Experts opinion, existing theory.
‣ External variables/groupings.
‣ Labels generated by a different clustering method.

Similarity between the two clusterings.
Can be used to find a suitable clustering algorithm for a given data set.

Question 8

Q

Steps in a typical cluster analysis:

Answer

A

A. Collect the data to use for clustering. Preprocess if needed.
B. Select the variables.
C. Select distance measure.
D. Select clustering method.
E. Experiment with different sets of variables/measures/methods.
F. Determine validity of the selected solution.
G. Interpret the results.

Experiment! - difficult to anticipate what combinations of variables, similarity measures, clustering methods will lead to interesting results.

Question 9

Q

Random Forest model to…

Answer

A

assess the explanatory power of each variable -> variable importance

Question 10

Q

Where else can we find clustering applications?

Answer

A

Identifying fake news
Document analysis (organising the information)
Weather analysis (e.g. understanding meteorological patterns)
Genetics

Session 6 Flashcards

(10 cards)