applied statistics terms Flashcards

1
Q

what is clustering about?

A

finding discrete groups with small differences between group members

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

is clustering classification?

A

no

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

hard clustering

A

each data point is only assigned to a single cluster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

soft clustering

A

a datapoint is assigned with a certain degree of strenght over all clusters (not used in this course)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

can you add data to K-means clustering?

A

yes, by adding it to a cluster that already exists

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what method pre-specifies number of clusters?

A

K-means clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

what method creates a dendogram?

A

hierarchical clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

how does k-means clustering work?

A

you specify how many clusters you want (k) and it creaters that many random centers. it adds the closest data point to those clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is agglomerative clustering?

A

from the bottom to the top, focusses on mergers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what’s divisive clustering?

A

from top to bottom, focusses on splits

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

how to calculate binary distance after clustering for binary data?

A

Jaccard distance, (intersection / union) or manhattan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

how to calculate binary distance after clustering for continuous data?

A

Euclidian or manhattan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

which distance techniques look at absolute distances?

A

Euclidian and Manhattan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

which distance techniques look at relative distances?

A

Jaccard and Bray Curtis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is linkage?

A

calculating the distance between (sub)clusters in hierarchical clustering

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

single linkage

A

shortest distance, two closest points in the two clusters

17
Q

complete linkage

A

longest distance, farthest points in the two clusters

18
Q

centroid linkage

A

distance between the centroids of the two clusters

19
Q

average linkage

A

average distance of all the pairwise distances

20
Q

Wards minimun variance method linkage

A

you compute a centroid for the two clusters if they were merged (so one centroid for both clusters) and then computes the distance of the new centroid to all datapoints, and looks at the sum of squares for all compared to the new centroid

21
Q

what linkage to use when you want to do anova or regression?

A

Ward

22
Q

how does single linkage shape clusters?

A

it elongates

23
Q

how does complete linkage shape clusters?

A

they become compact

24
Q

what is the within cluster sum of squares if all data points are its own cluster?

A

0

25
Q

what is measured when all data points belong to one cluster and you measure sum of squares?

A

the variance

26
Q

what is WSS?

A

Sum of squares

27
Q

what number of clusters to choose?

A

the one were adding one doesn’t improve the total sum of squares much (the elbow)

28
Q

how to test the quality of a clustering?

A

silhouette score

29
Q

when do you have a better silhoutte score?

A

when the within cluster distance is smaller than the between cluster distance

30
Q

what is better, higher or lower silhouette score?

A

higher

31
Q

what is the difference between Manhattan and Jaccard distance?

A

Manhattan looks at absolute distance, Jaccard at relative

32
Q

what is a tanglegram?

A

a way to compare two dendograms

33
Q

what does the clustering coefficient say?

A

the degree of clustering, how related points are