Session 6 Flashcards

1
Q

Why use clustering?

A
  1. Data exploration: “looking for “interesting patterns” without prescribing any specific interpretation”.
  2. Information reduction.
  3. Verification.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Clustering in the CRISP-DM cycle

A

Mostly:
• Data understanding - exploration
• Data preparation - preprocessing, reducing dimensionality
• Modelling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How to interpret clusters?

Characteristically.

A

Characteristically.

• Interpreting clusters by looking at its members
OR
• Interpreting clusters by looking at a typical cluster member or typical characteristic(s).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to interpret clusters?

Differentially.

A

Differentially.

• What differentiates Cluster X from Cluster Y?

• Supervised learning approach.
‣ Each data point has a new label - cluster ID.
‣ Predictive modelling with cluster ID as a target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How would that work? Supervised learning approach - Differentially?

A
Set up a classification task:
1) a k-class task or
2) binary classification
• Ensure intelligibility - be able to get
the classifier definition
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Cluster validity

How good is a given clustering?

Internal criteria

A
  • uses the internal information of the process to evaluate the quality of clustering and to what extent it fits your data.
  • Compactness and isolation.

• Can be used to specify the optimal number of clusters.
Indices are often (not always) method-dependent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Cluster validity

How good is a given clustering?

External criteria

A

• Compare the created clusters to some external reference:
‣ Experts opinion, existing theory.
‣ External variables/groupings.
‣ Labels generated by a different clustering method.

  • Similarity between the two clusterings.
  • Can be used to find a suitable clustering algorithm for a given data set.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Steps in a typical cluster analysis:

A

A. Collect the data to use for clustering. Preprocess if needed.
B. Select the variables.
C. Select distance measure.
D. Select clustering method.
E. Experiment with different sets of variables/measures/methods.
F. Determine validity of the selected solution.
G. Interpret the results.

Experiment! - difficult to anticipate what combinations of variables, similarity measures, clustering methods will lead to interesting results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Random Forest model to…

A

assess the explanatory power of each variable -> variable importance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Where else can we find clustering applications?

A
  • Identifying fake news
  • Document analysis (organising the information)
  • Weather analysis (e.g. understanding meteorological patterns)
  • Genetics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly