DNA microarrays 2 Flashcards

1
Q

What is cluster analysis useful for?

A

Reduces the number of data points by clustering/grouping objects (e.g. genes) together based on their similarity to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two measures of similarity in cluster analysis? How do we cluster genes in microarrays?

A
  1. Euclidean distance: absolute distance between two points in space
  2. Correlation distance: similarity of the directions in which two vectors point

In microarrays: we cluster genes with similar expressions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the 2 types of clustering algorithms?

A
  1. Hierarchical clustering
  2. Partitional clustering
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe hierarchical clustering (4)

A
  • Do not have to specify “k” number of clusters
  • Deterministic: cluster together the two closest genes and then the next closest (always get the same answer when clustering the data)
  • Clusters are hierarchical, based on similarity
  • Dendogram (tree) is generated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe partitional clustering

A
  • Have to specify “k” number of clusters
  • Nondeterministic: initialization of clusters is done randomly (will get different answers everytime you cluster the data)
  • All clusters and their elements are at the same level
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two types of hierarchical clustering? Describe them

A
  1. Agglomerative (bottom-up): start with single gene clusters and successively join the closest/most similar clusters until all genes belong to super-cluster
  2. Divisive (top-down): start with one super-cluster and begin to split into smaller clusters based on similarity and repeat until you get single gene clusters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Dendogram

A

Indicates the degree of similarity or distance between data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

True or false: There is biological rationale for a “genealogy” tree

A

False
- you can cluster all sorts of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Are dendograms robust?

A

No, the branching structure is sensitive to particular algorithms and distance metrics used, and to the order of data entry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does the height of a node in the dendogram represent?

A

The distance of the two children clusters (how dissimilar they are)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Single linkage way to define the distance between two clusters

A

Uses the distance between the closest members of two clusters (nearest neighbors).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Complete linkage way to define the distance between two clusters

A

Uses the distance between the farthest members of two clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Centroid linkage way to define the distance between two clusters

A

measures the distance between the centroids (average position) of two clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Average linkage way to define the distance between two clusters

A

calculates the average of all distances between members of two clusters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe K-means clustering and its steps (5 steps)

A

A NON-DETERMINISTIC partitioning method that subdivides objects or genes into a predetermined number (k) of clusters
1. Assign number of k clusters. Algorithm randomly selects k cluster centers (centroids)
2. Patterns are assigned to each cluster based on the nearest distance of the object to the nearest centroid
3. Now, the centroids are moved to the center of their respective clusters
4. Patterns are reassigned to new clusters based on distance to new clusters
5. Steps 3 & 4 are repeated until patterns between clusters do not change

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is k-means clustering non-deterministic?

A

Non-deterministic because if the original objects and centroids were originally in different regions, the end output would differ

17
Q

What is two dimensional hierarchical clustering and what is it used for?

A

2D clustering is the process of grouping together similar entities/patterns in both rows and columns of a dataset
- Identifying co-regulated genes under certain environmental conditions

18
Q

In 2D hierarchical clustering, what does a vertical dendogram show? A horizontal dendogram?
What is generated with rearranged values that correspond to the clustergram?

A

Vertical dendogram: Shows similarity of rows (genes)
Horizontal dendogram: Shows similarity of columns (experiments)
- A matrix is generated with rearranged values that correspond to the clustergram

19
Q

Gene ontology

A

Collaborative effort to address the need for consistent descriptions of gene products in different databases

20
Q

In gene ontology, what three main categories are genes assigned into?

A
  1. Biological process (e.g. DNA replication, transcription, etc.)
  2. Molecular function (i.e. what type of protein is transcribed? kinase, part of a complex, etc.)
  3. Cellular component (i.e. where does the protein exist in the cell, which organelle?)
21
Q

What are genome browsers?

A

Graphical interface that displays information from a database of genome data
- Includes descriptions genes (gene ontology, chromosome coordinates, nucleotide and amino acid sequences, expression, mutant, phenotype, genetic interaction)

22
Q

Genes that are expressed at the same time most likely…

A

Have the same function and need to be regulated together
- Uncharacterized genes are predicted to perhaps have the same function as the annotated genes within the same cluster of co-expressed genes (guilt-by-association)

23
Q

Functional enrichment

A

Co-expressed genes overrepresented with a particular function than expected (% of genes in the genome with the same function)

24
Q

How is functional enrichment determined?

A
  • Input clusters of co-expressed genes into Princeton GO-Term Finder, which then searches for functional enrichment.
  • A functional enrichment probability (P-value) is calculated based on the hypergeometric distribution (P<0.01 is considered significant)
25
Q

Describe how hierarchical clustering of microarray data helps with tumour classification

A

Hierarchical clustering of microarray data determines specific gene profiles and tumour classification
- Microarray data can distinguish between different tumour types based on which types of genes are expressed
- Can predict the survival of a patient based on how the tumour clusters (e.g. if the patient has proliferation cell set highly expressed, then they will have a poor prognosis, aka low survival)

26
Q

What are high density tiling microarrays and what are two examples of them?

A

Several million probes/microarrays can allow more thorough investigation of the genome than just measuring expression of known genes.
1. ORF microarray
2. Tiling microarray

27
Q

ORF microarray and limitation

A

Probing for sequences of known and predicted genes
Limitation: non-coding genes are harder to find because there’s no start/stop codons, splicing sites or exons. There’s no way that these are all the genes in the genome, so we’re limited to only the genes that we know about.

28
Q

Tiling microarray and limitation

A

Cover the entire genome, including non-coding regions, at regular intervals (high resolution)
Limitation: Resolution isn’t perfect compared to RNA-seq. If probes are positioned 300 nucleotides apart, some nucleotides will remain unsequenced.

29
Q

What do tiling microarrays allow for in terms of UTRs?

A

Allows for defining the length of 5’-UTRs and 3’-UTRs for every mRNA

30
Q

What can comparing the signal intensities on a tiling array to annotated ORFs allow?

A

Allows for researchers to determine whether transcription is happening where they expect (inside ORFs) or if there’s transcription in regions not previously annotated, indicating the presence of novel transcripts or non-coding RNAs

31
Q

UTR function

A

5’-UTRs and 3’-UTRs contain sequences that regulated mRNA stability, localization and translation (post-transcriptional and translational level of gene regulation)

32
Q

Longer 3’UTR are found in which types of transcripts?

A

Found in transcripts that undergo more gene regulation (e.g. cell cycle, ion transport, plasma membrane, mitochondria all require genes that are more complex)

33
Q

Shorter 3’UTR found in which types of transcripts?

A

Found in transcripts that have reduced need for posttranscriptional regulation (e.g. housekeeping genes that are always turned on, like ribosome genes, glycolysis genes, etc.)

34
Q

Tiling microarrays led to the discovery of what type of RNAs?

A

Long noncoding RNAs (lncRNA)

35
Q

LncRNA

A

Noncoding RNA molecules longer than 200 bp (different from miRNA, siRNA, snoRNA, etc.)
- Involved in different levels of gene regulation and numerous diseases

36
Q

Hox transcript antisense RNA (example of LncRNA) function

A

Silences transcription across 40 kb of the HOXD locus by recruiting PRC2 and inducing a repressive chromatin state

37
Q

HOTAIR in breast cancer cells

A

LncRNA HOTAIR reprograms chromatin state to promote cancer metastasis
- HOTAIR is upregulataed in breast cancer cells, good prognostic marker for metastasis and survival and promotes invasion of breast carcinoma cells in mice
- Knocking out HOTAIR in mice decreases metastasis
- Can use HOTAIR as a potential gene target

38
Q

RNA-Sequencing

A
  • Whole transcriptome shotgun sequencing
  • is the current approach to determine the transcriptome
  • Higher resolution (1 nucleotide, it can measure the RNA sequences at the level of individual nucleotides (the building blocks of RNA), less RNA sample required and cost is reasonable compared to tiling microarrays
39
Q

What do short reads prevent in RNA-sequencing?

A

Short reads prevent complete determination of transcriptome for large genomes with no reference genome available.
- People now combine short-reads and long reads (2nd gen and 3rd gen sequencing)