DNA microarrays 2 Flashcards
What is cluster analysis useful for?
Reduces the number of data points by clustering/grouping objects (e.g. genes) together based on their similarity to each other
What are the two measures of similarity in cluster analysis? How do we cluster genes in microarrays?
- Euclidean distance: absolute distance between two points in space
- Correlation distance: similarity of the directions in which two vectors point
In microarrays: we cluster genes with similar expressions
What are the 2 types of clustering algorithms?
- Hierarchical clustering
- Partitional clustering
Describe hierarchical clustering (4)
- Do not have to specify “k” number of clusters
- Deterministic: cluster together the two closest genes and then the next closest (always get the same answer when clustering the data)
- Clusters are hierarchical, based on similarity
- Dendogram (tree) is generated
Describe partitional clustering
- Have to specify “k” number of clusters
- Nondeterministic: initialization of clusters is done randomly (will get different answers everytime you cluster the data)
- All clusters and their elements are at the same level
What are the two types of hierarchical clustering? Describe them
- Agglomerative (bottom-up): start with single gene clusters and successively join the closest/most similar clusters until all genes belong to super-cluster
- Divisive (top-down): start with one super-cluster and begin to split into smaller clusters based on similarity and repeat until you get single gene clusters
Dendogram
Indicates the degree of similarity or distance between data points
True or false: There is biological rationale for a “genealogy” tree
False
- you can cluster all sorts of data
Are dendograms robust?
No, the branching structure is sensitive to particular algorithms and distance metrics used, and to the order of data entry
What does the height of a node in the dendogram represent?
The distance of the two children clusters (how dissimilar they are)
Single linkage way to define the distance between two clusters
Uses the distance between the closest members of two clusters (nearest neighbors).
Complete linkage way to define the distance between two clusters
Uses the distance between the farthest members of two clusters.
Centroid linkage way to define the distance between two clusters
measures the distance between the centroids (average position) of two clusters.
Average linkage way to define the distance between two clusters
calculates the average of all distances between members of two clusters.
Describe K-means clustering and its steps (5 steps)
A NON-DETERMINISTIC partitioning method that subdivides objects or genes into a predetermined number (k) of clusters
1. Assign number of k clusters. Algorithm randomly selects k cluster centers (centroids)
2. Patterns are assigned to each cluster based on the nearest distance of the object to the nearest centroid
3. Now, the centroids are moved to the center of their respective clusters
4. Patterns are reassigned to new clusters based on distance to new clusters
5. Steps 3 & 4 are repeated until patterns between clusters do not change
Why is k-means clustering non-deterministic?
Non-deterministic because if the original objects and centroids were originally in different regions, the end output would differ
What is two dimensional hierarchical clustering and what is it used for?
2D clustering is the process of grouping together similar entities/patterns in both rows and columns of a dataset
- Identifying co-regulated genes under certain environmental conditions
In 2D hierarchical clustering, what does a vertical dendogram show? A horizontal dendogram?
What is generated with rearranged values that correspond to the clustergram?
Vertical dendogram: Shows similarity of rows (genes)
Horizontal dendogram: Shows similarity of columns (experiments)
- A matrix is generated with rearranged values that correspond to the clustergram
Gene ontology
Collaborative effort to address the need for consistent descriptions of gene products in different databases
In gene ontology, what three main categories are genes assigned into?
- Biological process (e.g. DNA replication, transcription, etc.)
- Molecular function (i.e. what type of protein is transcribed? kinase, part of a complex, etc.)
- Cellular component (i.e. where does the protein exist in the cell, which organelle?)
What are genome browsers?
Graphical interface that displays information from a database of genome data
- Includes descriptions genes (gene ontology, chromosome coordinates, nucleotide and amino acid sequences, expression, mutant, phenotype, genetic interaction)
Genes that are expressed at the same time most likely…
Have the same function and need to be regulated together
- Uncharacterized genes are predicted to perhaps have the same function as the annotated genes within the same cluster of co-expressed genes (guilt-by-association)
Functional enrichment
Co-expressed genes overrepresented with a particular function than expected (% of genes in the genome with the same function)
How is functional enrichment determined?
- Input clusters of co-expressed genes into Princeton GO-Term Finder, which then searches for functional enrichment.
- A functional enrichment probability (P-value) is calculated based on the hypergeometric distribution (P<0.01 is considered significant)
Describe how hierarchical clustering of microarray data helps with tumour classification
Hierarchical clustering of microarray data determines specific gene profiles and tumour classification
- Microarray data can distinguish between different tumour types based on which types of genes are expressed
- Can predict the survival of a patient based on how the tumour clusters (e.g. if the patient has proliferation cell set highly expressed, then they will have a poor prognosis, aka low survival)
What are high density tiling microarrays and what are two examples of them?
Several million probes/microarrays can allow more thorough investigation of the genome than just measuring expression of known genes.
1. ORF microarray
2. Tiling microarray
ORF microarray and limitation
Probing for sequences of known and predicted genes
Limitation: non-coding genes are harder to find because there’s no start/stop codons, splicing sites or exons. There’s no way that these are all the genes in the genome, so we’re limited to only the genes that we know about.
Tiling microarray and limitation
Cover the entire genome, including non-coding regions, at regular intervals (high resolution)
Limitation: Resolution isn’t perfect compared to RNA-seq. If probes are positioned 300 nucleotides apart, some nucleotides will remain unsequenced.
What do tiling microarrays allow for in terms of UTRs?
Allows for defining the length of 5’-UTRs and 3’-UTRs for every mRNA
What can comparing the signal intensities on a tiling array to annotated ORFs allow?
Allows for researchers to determine whether transcription is happening where they expect (inside ORFs) or if there’s transcription in regions not previously annotated, indicating the presence of novel transcripts or non-coding RNAs
UTR function
5’-UTRs and 3’-UTRs contain sequences that regulated mRNA stability, localization and translation (post-transcriptional and translational level of gene regulation)
Longer 3’UTR are found in which types of transcripts?
Found in transcripts that undergo more gene regulation (e.g. cell cycle, ion transport, plasma membrane, mitochondria all require genes that are more complex)
Shorter 3’UTR found in which types of transcripts?
Found in transcripts that have reduced need for posttranscriptional regulation (e.g. housekeeping genes that are always turned on, like ribosome genes, glycolysis genes, etc.)
Tiling microarrays led to the discovery of what type of RNAs?
Long noncoding RNAs (lncRNA)
LncRNA
Noncoding RNA molecules longer than 200 bp (different from miRNA, siRNA, snoRNA, etc.)
- Involved in different levels of gene regulation and numerous diseases
Hox transcript antisense RNA (example of LncRNA) function
Silences transcription across 40 kb of the HOXD locus by recruiting PRC2 and inducing a repressive chromatin state
HOTAIR in breast cancer cells
LncRNA HOTAIR reprograms chromatin state to promote cancer metastasis
- HOTAIR is upregulataed in breast cancer cells, good prognostic marker for metastasis and survival and promotes invasion of breast carcinoma cells in mice
- Knocking out HOTAIR in mice decreases metastasis
- Can use HOTAIR as a potential gene target
RNA-Sequencing
- Whole transcriptome shotgun sequencing
- is the current approach to determine the transcriptome
- Higher resolution (1 nucleotide, it can measure the RNA sequences at the level of individual nucleotides (the building blocks of RNA), less RNA sample required and cost is reasonable compared to tiling microarrays
What do short reads prevent in RNA-sequencing?
Short reads prevent complete determination of transcriptome for large genomes with no reference genome available.
- People now combine short-reads and long reads (2nd gen and 3rd gen sequencing)