DNA microarrays 2 Flashcards by Doreen Amini

What is cluster analysis useful for?

Reduces the number of data points by clustering/grouping objects (e.g. genes) together based on their similarity to each other

How well did you know this?

Not at all

Perfectly

What are the two measures of similarity in cluster analysis? How do we cluster genes in microarrays?

Euclidean distance: absolute distance between two points in space
Correlation distance: similarity of the directions in which two vectors point

In microarrays: we cluster genes with similar expressions

How well did you know this?

Not at all

Perfectly

What are the 2 types of clustering algorithms?

Hierarchical clustering
Partitional clustering

How well did you know this?

Not at all

Perfectly

Describe hierarchical clustering (4)

Do not have to specify “k” number of clusters
Deterministic: cluster together the two closest genes and then the next closest (always get the same answer when clustering the data)
Clusters are hierarchical, based on similarity
Dendogram (tree) is generated

How well did you know this?

Not at all

Perfectly

Describe partitional clustering

Have to specify “k” number of clusters
Nondeterministic: initialization of clusters is done randomly (will get different answers everytime you cluster the data)
All clusters and their elements are at the same level

How well did you know this?

Not at all

Perfectly

What are the two types of hierarchical clustering? Describe them

Agglomerative (bottom-up): start with single gene clusters and successively join the closest/most similar clusters until all genes belong to super-cluster
Divisive (top-down): start with one super-cluster and begin to split into smaller clusters based on similarity and repeat until you get single gene clusters

How well did you know this?

Not at all

Perfectly

Dendogram

Indicates the degree of similarity or distance between data points

How well did you know this?

Not at all

Perfectly

True or false: There is biological rationale for a “genealogy” tree

False
- you can cluster all sorts of data

How well did you know this?

Not at all

Perfectly

Are dendograms robust?

No, the branching structure is sensitive to particular algorithms and distance metrics used, and to the order of data entry

How well did you know this?

Not at all

Perfectly

What does the height of a node in the dendogram represent?

The distance of the two children clusters (how dissimilar they are)

How well did you know this?

Not at all

Perfectly

Single linkage way to define the distance between two clusters

Uses the distance between the closest members of two clusters (nearest neighbors).

How well did you know this?

Not at all

Perfectly

Complete linkage way to define the distance between two clusters

Uses the distance between the farthest members of two clusters.

How well did you know this?

Not at all

Perfectly

Centroid linkage way to define the distance between two clusters

measures the distance between the centroids (average position) of two clusters.

How well did you know this?

Not at all

Perfectly

Average linkage way to define the distance between two clusters

calculates the average of all distances between members of two clusters.

How well did you know this?

Not at all

Perfectly

Describe K-means clustering and its steps (5 steps)

A NON-DETERMINISTIC partitioning method that subdivides objects or genes into a predetermined number (k) of clusters
1. Assign number of k clusters. Algorithm randomly selects k cluster centers (centroids)
2. Patterns are assigned to each cluster based on the nearest distance of the object to the nearest centroid
3. Now, the centroids are moved to the center of their respective clusters
4. Patterns are reassigned to new clusters based on distance to new clusters
5. Steps 3 & 4 are repeated until patterns between clusters do not change

How well did you know this?

Not at all

Perfectly

Why is k-means clustering non-deterministic?

Non-deterministic because if the original objects and centroids were originally in different regions, the end output would differ

What is two dimensional hierarchical clustering and what is it used for?

2D clustering is the process of grouping together similar entities/patterns in both rows and columns of a dataset
- Identifying co-regulated genes under certain environmental conditions

In 2D hierarchical clustering, what does a vertical dendogram show? A horizontal dendogram?
What is generated with rearranged values that correspond to the clustergram?

Vertical dendogram: Shows similarity of rows (genes)
Horizontal dendogram: Shows similarity of columns (experiments)
- A matrix is generated with rearranged values that correspond to the clustergram

Gene ontology

Collaborative effort to address the need for consistent descriptions of gene products in different databases

In gene ontology, what three main categories are genes assigned into?

Biological process (e.g. DNA replication, transcription, etc.)
Molecular function (i.e. what type of protein is transcribed? kinase, part of a complex, etc.)
Cellular component (i.e. where does the protein exist in the cell, which organelle?)

What are genome browsers?

Graphical interface that displays information from a database of genome data
- Includes descriptions genes (gene ontology, chromosome coordinates, nucleotide and amino acid sequences, expression, mutant, phenotype, genetic interaction)

Genes that are expressed at the same time most likely…

Have the same function and need to be regulated together
- Uncharacterized genes are predicted to perhaps have the same function as the annotated genes within the same cluster of co-expressed genes (guilt-by-association)

Functional enrichment

Co-expressed genes overrepresented with a particular function than expected (% of genes in the genome with the same function)

How is functional enrichment determined?

Input clusters of co-expressed genes into Princeton GO-Term Finder, which then searches for functional enrichment.
A functional enrichment probability (P-value) is calculated based on the hypergeometric distribution (P<0.01 is considered significant)

Describe how hierarchical clustering of microarray data helps with tumour classification

Hierarchical clustering of microarray data determines specific gene profiles and tumour classification - Microarray data can distinguish between different tumour types based on which types of genes are expressed - Can predict the survival of a patient based on how the tumour clusters (e.g. if the patient has proliferation cell set highly expressed, then they will have a poor prognosis, aka low survival)

What are high density tiling microarrays and what are two examples of them?

Several million probes/microarrays can allow more thorough investigation of the genome than just measuring expression of known genes. 1. ORF microarray 2. Tiling microarray

ORF microarray and limitation

Probing for sequences of known and predicted genes Limitation: non-coding genes are harder to find because there's no start/stop codons, splicing sites or exons. There's no way that these are all the genes in the genome, so we're limited to only the genes that we know about.

Tiling microarray and limitation

Cover the entire genome, including non-coding regions, at regular intervals (high resolution) Limitation: Resolution isn't perfect compared to RNA-seq. If probes are positioned 300 nucleotides apart, some nucleotides will remain unsequenced.

What do tiling microarrays allow for in terms of UTRs?

Allows for defining the length of 5'-UTRs and 3'-UTRs for every mRNA

What can comparing the signal intensities on a tiling array to annotated ORFs allow?

Allows for researchers to determine whether transcription is happening where they expect (inside ORFs) or if there's transcription in regions not previously annotated, indicating the presence of novel transcripts or non-coding RNAs

UTR function

5'-UTRs and 3'-UTRs contain sequences that regulated mRNA stability, localization and translation (post-transcriptional and translational level of gene regulation)

Longer 3'UTR are found in which types of transcripts?

Found in transcripts that undergo more gene regulation (e.g. cell cycle, ion transport, plasma membrane, mitochondria all require genes that are more complex)

Shorter 3'UTR found in which types of transcripts?

Found in transcripts that have reduced need for posttranscriptional regulation (e.g. housekeeping genes that are always turned on, like ribosome genes, glycolysis genes, etc.)

Tiling microarrays led to the discovery of what type of RNAs?

Long noncoding RNAs (lncRNA)

LncRNA

Noncoding RNA molecules longer than 200 bp (different from miRNA, siRNA, snoRNA, etc.) - Involved in different levels of gene regulation and numerous diseases

Hox transcript antisense RNA (example of LncRNA) function

Silences transcription across 40 kb of the HOXD locus by recruiting PRC2 and inducing a repressive chromatin state

HOTAIR in breast cancer cells

LncRNA HOTAIR reprograms chromatin state to promote cancer metastasis - HOTAIR is upregulataed in breast cancer cells, good prognostic marker for metastasis and survival and promotes invasion of breast carcinoma cells in mice - Knocking out HOTAIR in mice decreases metastasis - Can use HOTAIR as a potential gene target

RNA-Sequencing

- Whole transcriptome shotgun sequencing - is the current approach to determine the transcriptome - Higher resolution (1 nucleotide, it can measure the RNA sequences at the level of individual nucleotides (the building blocks of RNA), less RNA sample required and cost is reasonable compared to tiling microarrays

What do short reads prevent in RNA-sequencing?

Short reads prevent complete determination of transcriptome for large genomes with no reference genome available. - People now combine short-reads and long reads (2nd gen and 3rd gen sequencing)