VL 12 Flashcards

Question 1

Q

What is Clustering? and why do you use it?

Answer

A

Clustering is a technique that groups similar data points together based on their similarities, without using predefined categories. It helps find patterns and structures within datasets for various applications.

need for generalisation, grouping, classification
1, 2 variables –> humans can cluster (plots)
3, 4, … (coplot, image, pairs, PCA, MDS) –> computers should better do the clustering

Question 2

Q

types of clustering

Answer

A

Hierarchical Clustering:
A set of nested clusters organized as a hierarchical tree
–> we will get a dendrogram and a cluster id by
dendrogram cutting
Partitional Clustering: A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
–> we will only get a cluster id

Question 3

Q

What is hierarchical clustering?

Answer

A

no need to specify numbers of cluster k before clustering starts
the algorithm constructs a tree like hierarchy (dendrogram) which (implicitly) contains all values of k
on one end of the tree there are k clusters each with one object and on the other end of the tree there is one cluster containing all k objects

Question 4

Q

What do we need to cluster?

Answer

A

needed for the clustering is a complete set, i.e. a matrix, of the (dis)similarities between all objects.
dissimilarity coeﬀicients may be obtained from the
computation of distances (see last section)
Euclidean distance, Manhattan distance, correlation, coeﬀicient based distances are usually used
high quality clusters with:
– high intra-class similarity
– low inter-class similarity
good: small circles, long lines
bad: large circles, short line

Question 5

Q

Distance Matrix

Answer

A

you can generate the distance matrix from your data by dist function:
dist.mt = dist(data, method =´manhattan´)
be carefully to create a distance matrix for the row or the column items, transpose your matrix if you want to switch between row and column item distances
if the distance matrix is not generated by the dist function: use dist.mt = as.dist(matrix) to generate a distance
matrix from a “manually” made distance matrix

Question 6

Q

Agglomerative Clustering

Answer

A

a distance matrix is required
agglomerative methods start with n clusters and proceed by successive fusions until a single cluster is obtained
1st: two “most similar” objects are joined into cluster
2nd: distance matrix is reduced as rows of joined objects are merged
hclust: hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it

Question 7

Q

hclust: hierarchical agglomerative clustering

Answer

A

performed from a distance matrix which is stepwise reduced
repeated recalculation for distance of merged rows to all
remaining rows
single linkage to merge closest rows (use smallest distance value)
– complete linkage to merge closest rows (use the largest distance value)
– average linkage to merge closest rows (use the average distance value)
default for hclust is complete linkage
computationally intensive
don’t do this for 20.000 genes, but do this for 100 samples
transpose matrix correctly to samples, not genes