applied statistics terms Flashcards
what is clustering about?
finding discrete groups with small differences between group members
is clustering classification?
no
hard clustering
each data point is only assigned to a single cluster
soft clustering
a datapoint is assigned with a certain degree of strenght over all clusters (not used in this course)
can you add data to K-means clustering?
yes, by adding it to a cluster that already exists
what method pre-specifies number of clusters?
K-means clustering
what method creates a dendogram?
hierarchical clustering
how does k-means clustering work?
you specify how many clusters you want (k) and it creaters that many random centers. it adds the closest data point to those clusters.
what is agglomerative clustering?
from the bottom to the top, focusses on mergers
what’s divisive clustering?
from top to bottom, focusses on splits
how to calculate binary distance after clustering for binary data?
Jaccard distance, (intersection / union) or manhattan
how to calculate binary distance after clustering for continuous data?
Euclidian or manhattan
which distance techniques look at absolute distances?
Euclidian and Manhattan
which distance techniques look at relative distances?
Jaccard and Bray Curtis
what is linkage?
calculating the distance between (sub)clusters in hierarchical clustering
single linkage
shortest distance, two closest points in the two clusters
complete linkage
longest distance, farthest points in the two clusters
centroid linkage
distance between the centroids of the two clusters
average linkage
average distance of all the pairwise distances
Wards minimun variance method linkage
you compute a centroid for the two clusters if they were merged (so one centroid for both clusters) and then computes the distance of the new centroid to all datapoints, and looks at the sum of squares for all compared to the new centroid
what linkage to use when you want to do anova or regression?
Ward
how does single linkage shape clusters?
it elongates
how does complete linkage shape clusters?
they become compact
what is the within cluster sum of squares if all data points are its own cluster?
0
what is measured when all data points belong to one cluster and you measure sum of squares?
the variance
what is WSS?
Sum of squares
what number of clusters to choose?
the one were adding one doesn’t improve the total sum of squares much (the elbow)
how to test the quality of a clustering?
silhouette score
when do you have a better silhoutte score?
when the within cluster distance is smaller than the between cluster distance
what is better, higher or lower silhouette score?
higher
what is the difference between Manhattan and Jaccard distance?
Manhattan looks at absolute distance, Jaccard at relative
what is a tanglegram?
a way to compare two dendograms
what does the clustering coefficient say?
the degree of clustering, how related points are