DNA microarrays 2 Flashcards
What is cluster analysis useful for?
Reduces the number of data points by clustering/grouping objects (e.g. genes) together based on their similarity to each other
What are the two measures of similarity in cluster analysis? How do we cluster genes in microarrays?
- Euclidean distance: absolute distance between two points in space
- Correlation distance: similarity of the directions in which two vectors point
In microarrays: we cluster genes with similar expressions
What are the 2 types of clustering algorithms?
- Hierarchical clustering
- Partitional clustering
Describe hierarchical clustering (4)
- Do not have to specify “k” number of clusters
- Deterministic: cluster together the two closest genes and then the next closest (always get the same answer when clustering the data)
- Clusters are hierarchical, based on similarity
- Dendogram (tree) is generated
Describe partitional clustering
- Have to specify “k” number of clusters
- Nondeterministic: initialization of clusters is done randomly (will get different answers everytime you cluster the data)
- All clusters and their elements are at the same level
What are the two types of hierarchical clustering? Describe them
- Agglomerative (bottom-up): start with single gene clusters and successively join the closest/most similar clusters until all genes belong to super-cluster
- Divisive (top-down): start with one super-cluster and begin to split into smaller clusters based on similarity and repeat until you get single gene clusters
Dendogram
Indicates the degree of similarity or distance between data points
True or false: There is biological rationale for a “genealogy” tree
False
- you can cluster all sorts of data
Are dendograms robust?
No, the branching structure is sensitive to particular algorithms and distance metrics used, and to the order of data entry
What does the height of a node in the dendogram represent?
The distance of the two children clusters (how dissimilar they are)
Single linkage way to define the distance between two clusters
Uses the distance between the closest members of two clusters (nearest neighbors).
Complete linkage way to define the distance between two clusters
Uses the distance between the farthest members of two clusters.
Centroid linkage way to define the distance between two clusters
measures the distance between the centroids (average position) of two clusters.
Average linkage way to define the distance between two clusters
calculates the average of all distances between members of two clusters.
Describe K-means clustering and its steps (5 steps)
A NON-DETERMINISTIC partitioning method that subdivides objects or genes into a predetermined number (k) of clusters
1. Assign number of k clusters. Algorithm randomly selects k cluster centers (centroids)
2. Patterns are assigned to each cluster based on the nearest distance of the object to the nearest centroid
3. Now, the centroids are moved to the center of their respective clusters
4. Patterns are reassigned to new clusters based on distance to new clusters
5. Steps 3 & 4 are repeated until patterns between clusters do not change
Why is k-means clustering non-deterministic?
Non-deterministic because if the original objects and centroids were originally in different regions, the end output would differ
What is two dimensional hierarchical clustering and what is it used for?
2D clustering is the process of grouping together similar entities/patterns in both rows and columns of a dataset
- Identifying co-regulated genes under certain environmental conditions
In 2D hierarchical clustering, what does a vertical dendogram show? A horizontal dendogram?
What is generated with rearranged values that correspond to the clustergram?
Vertical dendogram: Shows similarity of rows (genes)
Horizontal dendogram: Shows similarity of columns (experiments)
- A matrix is generated with rearranged values that correspond to the clustergram
Gene ontology
Collaborative effort to address the need for consistent descriptions of gene products in different databases
In gene ontology, what three main categories are genes assigned into?
- Biological process (e.g. DNA replication, transcription, etc.)
- Molecular function (i.e. what type of protein is transcribed? kinase, part of a complex, etc.)
- Cellular component (i.e. where does the protein exist in the cell, which organelle?)
What are genome browsers?
Graphical interface that displays information from a database of genome data
- Includes descriptions genes (gene ontology, chromosome coordinates, nucleotide and amino acid sequences, expression, mutant, phenotype, genetic interaction)
Genes that are expressed at the same time most likely…
Have the same function and need to be regulated together
- Uncharacterized genes are predicted to perhaps have the same function as the annotated genes within the same cluster of co-expressed genes (guilt-by-association)
Functional enrichment
Co-expressed genes overrepresented with a particular function than expected (% of genes in the genome with the same function)
How is functional enrichment determined?
- Input clusters of co-expressed genes into Princeton GO-Term Finder, which then searches for functional enrichment.
- A functional enrichment probability (P-value) is calculated based on the hypergeometric distribution (P<0.01 is considered significant)