VL 12 Flashcards
1
Q
What is Clustering? and why do you use it?
A
Clustering is a technique that groups similar data points together based on their similarities, without using predefined categories. It helps find patterns and structures within datasets for various applications.
- need for generalisation, grouping, classification
- 1, 2 variables –> humans can cluster (plots)
- 3, 4, … (coplot, image, pairs, PCA, MDS) –> computers should better do the clustering
2
Q
types of clustering
A
- Hierarchical Clustering:
A set of nested clusters organized as a hierarchical tree
–> we will get a dendrogram and a cluster id by
dendrogram cutting - Partitional Clustering: A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset
–> we will only get a cluster id
3
Q
What is hierarchical clustering?
A
- no need to specify numbers of cluster k before clustering starts
- the algorithm constructs a tree like hierarchy (dendrogram) which (implicitly) contains all values of k
- on one end of the tree there are k clusters each with one object and on the other end of the tree there is one cluster containing all k objects
4
Q
What do we need to cluster?
A
- needed for the clustering is a complete set, i.e. a matrix, of the (dis)similarities between all objects.
- dissimilarity coefficients may be obtained from the
computation of distances (see last section) - Euclidean distance, Manhattan distance, correlation, coefficient based distances are usually used
- high quality clusters with:
– high intra-class similarity
– low inter-class similarity - good: small circles, long lines
- bad: large circles, short line
5
Q
Distance Matrix
A
- you can generate the distance matrix from your data by dist function:
dist.mt = dist(data, method =´manhattan´) - be carefully to create a distance matrix for the row or the column items, transpose your matrix if you want to switch between row and column item distances
- if the distance matrix is not generated by the dist function: use dist.mt = as.dist(matrix) to generate a distance
matrix from a “manually” made distance matrix
6
Q
Agglomerative Clustering
A
- a distance matrix is required
- agglomerative methods start with n clusters and proceed by successive fusions until a single cluster is obtained
- 1st: two “most similar” objects are joined into cluster
- 2nd: distance matrix is reduced as rows of joined objects are merged
- hclust: hierarchical cluster analysis on a set of dissimilarities and methods for analyzing it
7
Q
hclust: hierarchical agglomerative clustering
A
- performed from a distance matrix which is stepwise reduced
- repeated recalculation for distance of merged rows to all
remaining rows - single linkage to merge closest rows (use smallest distance value)
– complete linkage to merge closest rows (use the largest distance value)
– average linkage to merge closest rows (use the average distance value) - default for hclust is complete linkage
- computationally intensive
- don’t do this for 20.000 genes, but do this for 100 samples
- transpose matrix correctly to samples, not genes