DNA microarrays 2 Flashcards
What is cluster analysis useful for?
Reduces the number of data points by clustering/grouping objects (e.g. genes) together based on their similarity to each other
What are the two measures of similarity in cluster analysis? How do we cluster genes in microarrays?
- Euclidean distance: absolute distance between two points in space
- Correlation distance: similarity of the directions in which two vectors point
In microarrays: we cluster genes with similar expressions
What are the 2 types of clustering algorithms?
- Hierarchical clustering
- Partitional clustering
Describe hierarchical clustering (4)
- Do not have to specify “k” number of clusters
- Deterministic: cluster together the two closest genes and then the next closest (always get the same answer when clustering the data)
- Clusters are hierarchical, based on similarity
- Dendogram (tree) is generated
Describe partitional clustering
- Have to specify “k” number of clusters
- Nondeterministic: initialization of clusters is done randomly (will get different answers everytime you cluster the data)
- All clusters and their elements are at the same level
What are the two types of hierarchical clustering? Describe them
- Agglomerative (bottom-up): start with single gene clusters and successively join the closest/most similar clusters until all genes belong to super-cluster
- Divisive (top-down): start with one super-cluster and begin to split into smaller clusters based on similarity and repeat until you get single gene clusters
Dendogram
Indicates the degree of similarity or distance between data points
True or false: There is biological rationale for a “genealogy” tree
False
- you can cluster all sorts of data
Are dendograms robust?
No, the branching structure is sensitive to particular algorithms and distance metrics used, and to the order of data entry
What does the height of a node in the dendogram represent?
The distance of the two children clusters (how dissimilar they are)
Single linkage way to define the distance between two clusters
Uses the distance between the closest members of two clusters (nearest neighbors).
Complete linkage way to define the distance between two clusters
Uses the distance between the farthest members of two clusters.
Centroid linkage way to define the distance between two clusters
measures the distance between the centroids (average position) of two clusters.
Average linkage way to define the distance between two clusters
calculates the average of all distances between members of two clusters.
Describe K-means clustering and its steps (5 steps)
A NON-DETERMINISTIC partitioning method that subdivides objects or genes into a predetermined number (k) of clusters
1. Assign number of k clusters. Algorithm randomly selects k cluster centers (centroids)
2. Patterns are assigned to each cluster based on the nearest distance of the object to the nearest centroid
3. Now, the centroids are moved to the center of their respective clusters
4. Patterns are reassigned to new clusters based on distance to new clusters
5. Steps 3 & 4 are repeated until patterns between clusters do not change