Clustering Flashcards

1
Q

What topics are under dissimilarities?

A

Continuous dissims.
Binary dissims.
3 Rules
Dissim matricies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are two types of continuous dissims?

A

Euclidean

Manhattan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are two types of binary dissims?

A

Simple

Jaccard

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the Simple binary dissimilarity equation?

A

d(x,y)=(a+d)/(a+b+c+d)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the Jaccard binary dissimilarity equation?

A

d(x,y)=a/(a+b+c)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the three rules for dissimilarities?

A

d(x,y)>=0, if x=y then d=0
d(x,y)=d(y,x)
d(x,y)>=d(x,z)+d(x,y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are the key features of dissimilarity matricies?

A

0’s along the diagonal
Usually factors are standardized
Usually factors are scaled
Symmetric

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What topics are important to hierarchical clustering?

A

5 steps
Linkage
Chaining
Number of groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the five steps in hierarchical clustering?

A
  1. Each obs. into its own group
  2. Pair nearest obs.
  3. One less group now!
  4. Pair next nearest obs.
  5. Repeat until no individual obs left
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What types of linkage are there?

A

Simple
Complete
Average

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the formula for simple linkage?

A

d(A,B)=min x€A,y€B,d(x,y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the formula for complete linkage?

A

d(A,B)=max x€A,y€B,d(x,y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is important to know about the number of groups in hierarchical clustering?

A

The dendrogram should be cut along the y-axis, this divides the total obs. into groups below the line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is chaining?

A

Where observations are individually appended to the same group one after another -results in poor model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What topics are under partitioning methods?

A

6 steps
K-means
Starting position
Number of groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the 6 steps to partitioning methods?

A
  1. Data is partitioned into groups
  2. Centroids of groups are calculated
  3. Each obs. distance from centroid is calculated
  4. Move obs to new group if necessary
  5. If no moves are made, stop, local min found
  6. Repeat from 2 otherwise
17
Q

What is important to knows about k means?

A

It is iterative
It is computationally efficient (+)
It results in local minimums (-)

18
Q

What is important to remember about starting positions?

A

Random point
Random area
Or result from hierarchical cluster

19
Q

How is the number of clusters in the data determined?

A

Trialling range of clusters (e.g., 1-10)
Plotting sum of squares
Looking for elbow in graph

20
Q

What three topics are under cluster validation?

A

5 steps
Rand index
Issues with clustering

21
Q

What are the issues with clustering?

A

Sensitive to the start point
Only works for continuous data
Clusters are all spherical

22
Q

What things are important to note about the rand index?

A

How to calculate: rand(S1,S2)=A/A+D

Adjusted rand accounts for natural randomness

23
Q

What are the 5 steps in cluster validation?

A
  1. Split data into train/test
  2. Run clustering on training data
  3. Group test data based on cluster: S1
  4. Cluster test individually: S2
  5. Cross tabulate S1 and S2