Lecture 10 Flashcards

1
Q

What is the ultimate crime in bioinformatics?

A

not using existing resources
*Excel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Name a tool for Data Manipulation

A

BioConductor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is BioConductor?

A

-open-source suite of programs for gene expression profiling analysis
-Runs in R statistical language

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What can you do with 13 lines in BioConductor?

A
  1. GC arrays
  2. Identify significantly differentially expressed genes
  3. Display in a heatmap
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why is gene expression profiling important?

A
  1. Only 40-60% of genes identified in genome sequencing projects are functionally annotated by sequence similarity (lineage and species-specific genes)
  2. sequence similarity will not identify novel functions of proteins
  3. genes involved in regulation, interaction, or integration of pathways
  4. genes expressed at low levels or show transient exprsesion (missed)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How were genes involved in regulation, interaction, or integration of pathways traditionally identified?

A

genetic (mutant) analysis and biochemically

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is functional genomics?

A

Field aiming to create and apply technologies that take advantage of sequence information to analyze full complement of gens and proteins encoded by an organism

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the 4 major approaches used to elucidate possible function of genes?

A
  1. Expression pattern for all genes
  2. Expression and distribution of all proteins
  3. Knocking out of genes and examination of phenotype and/or gene expression patterns
  4. Identifying interactions among proteins (two-hybrid analysis and newer bait methods)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the four omics approaches? Which one is the only context-independent

A
  1. Genomics - complete set of genes of an organism or oragnelles
    **context-independent
  2. Transcriptome - complete set of mRNA molec present in cell, tissue, ororgan
  3. Proteome - complete set of protein molec. in cell, tissue, organ
  4. Metabolome - complete set of metabolites in cell, tissue, organ
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What methods of analysis are available for studying the genome?

A

Systematic DNA sequencing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What methods of analysis are available for studying the transcriptome? (4)

A

Microarrays, high-throughput northern analysis, ESTs, RNA-seq

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What methods of analysis are available for studying the proteome?

A

-2D gel electrophoresis
-peptide mass spec, BioID
-two-hybrid analysis
-peptide/protein microarray

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What methods of analysis are available for studying the metabolome?

A
  • nuclear magnetic resonance spectrometry NMRS
  • mass spectrometry
  • infra-red spectroscopy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Name three profiling technologies

A
  1. cDNA Microarrays
  2. Oligonucleotide Microarrays
  3. RNA-Seq
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do cDNA microarrays work?

A
  1. cDNAs for each gene in genome are spotted onto glass slide
    - Each spot represents specific gene
  2. Take RNA from some populations, label with fluorophores (diff dye colours), incorporated during RT
  3. Mix samples and hybridize to cDNA microarray
  4. Analyze colour (if mixed two populations, then equal abundance yellow, red or green means one population greater expression)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How do oligonucleotide microarrays work?

A
  1. Based on genome sequence design oligont that matches to 3’ end of transcript (25 nt in lenght)
  2. synthesized oligont on silican wayfare
  3. mRNA population labelled using biotynylation hybridized to array
  4. Scan to see how much labeled RNA is bound to particular probe
  5. Determine gene expression level according to intensity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what process is used in oligonucleotide microarray technology that is used in computer chips?

A

Photolithographic process (deprotect nt at each position and add specific nt)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What should be considered for oligonucleotide design in microarrays? (5)

A
  1. Should have similar Tm = allow similar hybridization efficiency
  2. Discriminate against members of same family (stimes for paralogues impossible)
  3. Oligos with specific Tm and length
  4. Free of secondary structure and self-annealing tendency
  5. unique to one species without homology with another species (discriminate bacterial from human genes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why is image processing done?

A

extract information after hybridization
- input is scanned images of array fluorescence
- grid applied to image
- position and value (with background correction) on slide associated with appropriate identifier
- output: table of Ids and values

*ScanAnalyze, Affymetrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Name 5 cell-type specific expression profiling methods

A
  1. Lasser Capture Microdissection
  2. Specific GFP lines–>protoplasting–>FACS
  3. INTACT / TRAP (translating ribosome affinity purification)
  4. scRNA-seq
  5. Spatial transcriptomics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What are the problems of counts for profiling experiments? why do we need normalization

A

Microarray: counts per pixel (CCD)
numbers (CDD or rna-seq) are:
-arbitrary
-not comparable between samples
-not linear multiple of abundance of what you want to detect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does normalization do?

A

remove trends that correlate with variables not expected to influence gene expression changes
-mean expression level across samples should be similar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is an MA plot?

A

A = x axis = Log intensity of expression
M = y axis = ratio of intensity relative to median value
- expect a cloud shape

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What are the methods of normalization for microarray data? (2)

A
  1. RMA Robust Multichip Analysis
    - quantile normalization
    -better expression estimates but introduces inter-array correlations in coexpression analyses
  2. GCOS/ MAS5.0 Affymetrix normalization algorith
  3. Loess
    -locally weighted linear regression to smooth data (cDNA microarray)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Name two methods to normalize RNA-seq data

A
  1. RPKM Reads per KB per Million reads
  2. FPKM Fragments per KB per Million reads
  3. TPM transcripts per million
    TMM trimmed means of m-values
    CPM counts per million
26
Q

Name two gene expression databases

A
  1. ArrayExpress
  2. GEO (gene expression omnibus)

-may also have organism-specific gene expression DBs: Arabidopsis, human (RefExA), mouse, worm, fly

27
Q

Why is MIAME and MINSEQE metadata important?

A

Because they contain information relevant to specific experiment that allows correct interpretation of datasets
– source of tissue, age, microarray element identifiers, identifier annotation, fragmentation protocols, library protocols

28
Q

Complete name of MIAME

A

Minimum Information About a Microarray Experiment

29
Q

Complete name of MINSEQE

A

Minimum Information for a Sequencing Experiment

30
Q

What are four methods for selecting significant genes?

A
  1. Fold-change
  2. T-test
  3. S-test
  4. ANOVA
31
Q

For affymetrix chips what is higher: biological variation or chip-chip variation?

A

biological variation

32
Q

For cDNA microarrays which one is higher biological variation or chip-chip variation?

A

chip-chip variation so. need technical replicates

33
Q

What factors introduce variability in RNA profiling experiments?

A

gene and variety (types of sample, treatment, time)
individual sample, chip-array, dye (microarrays)
library prep (RNA-seq)

34
Q

How can you statistically validate results?

A
  1. Include duplicates (or replicates) to approximate baseline distribution
  2. Independent biological replicates
  3. Replicates for most variable biological factors
35
Q

Why is fold-change not a statistical test?

A

Subject to bias: low expression have higher variance

36
Q

Pros/ Cons of t-test:

A

Pro: better than t-test
Con: low power due to small sample size and unstable error variance

37
Q

Pros/Cons of S-test:

A

Pro: small positive constant added to denominator of gene-specific t-test
- genes with small fold-change will not be selected as significant

38
Q

What does regularized t-test do?

A

combines information from gene-specific and global average variance estimates (weighted average in denominator of t-test)

39
Q

What is B-statistic?

A

Log posterior odds ratio of differential expression versus non-differential expression

40
Q

If there are more than 2 conditions for comparison, which statistical test will you use?

A

ANOVA (with test) or Limma

41
Q

Why is Bonferroni correction used?

A

To further adjust p-value by dividing cutoff p-value by number of genes *p-value =0.05 –> 0.0005

42
Q

How can you select significant genes with multiple testing?

A
  1. convert test statistic to p-value
  2. FWER: family-wise error rate
    -prob of accumulating one or more false-positive errors over a number of tests
  3. FDR: false discovery rate
    -post-data measure of confidence
    -estimate false positive rates by swapping sample labels and asking how many DEGs are identified
43
Q

How can you select significant genes for RNA-seq data?

A

DESeq2: negative binomial distribution
Voom and Limma: after transforming count data to approximte normal distribution

44
Q

Can you use RPKM/FPKM/TMM to select DEGs in RNA-seq?

A

no

45
Q

What type of error is controlled better with Limma and Voom?

A

Type 1 error: false positive

46
Q

Name four ways to organize expression data

A
  1. Similar expression profiles
  2. Groups of genes of interest
  3. Functional classification according to GO or MIPS categories
  4. Pathway analysis
47
Q

When is not normalizing or log-transforming data acceptable?

A

If you really want to cluster based on expression level

48
Q

Why is better to log-transform if you are working with fold-change? what base log is used?

A

Because values less than 1 are not compressed, and extended equally in negative direction as positive counterparts do.
- norming and median-centering allow shape of change to be better visualized (median less susceptible to outliers than mean)

Base 2 (log2) so you can use exponent for 2 to get fold change

49
Q

Why do you do median/media-centering?

A
  1. Want analysis to be independent of the amount of gene present in reference sample
  2. remove biases
50
Q

What assumption is made when median-centering?

A

Average gene in an experiment is expected to have a ratio of 1.0 (log-ratio of 0)

51
Q

What does normalization do?

A

Sets magnitude (sum of squares of values) of row/column vector to 1.0

Divide values of vector by square root of the sum of squares of the values

52
Q

Describe the process of normalizing data

A
  1. Log2 transformation
  2. Compute median of log2 values
  3. Substract log2 value - median = log2-MC
  4. Compute sum of squares of log2-MC
  5. Compute log2-MC-N bc sum of squares was not 1, so normalize until sum of median squares equals to 1
53
Q

For what are volcano plots used?

A

Identify genes that are highly or lowly significantly differentially expressed relative to mock or control sample

54
Q

Name 5 clustering methods

A
  1. Hierarchical
  2. K-means
  3. SOM: self-organizing map
  4. Dimensionality reduction methods: PCA, t-SNE, UMAP (LDA)
  5. SVM
55
Q

What does SOM stands for and what is it?

A

Self-Organizing Map

a clustering method

56
Q

How does hierarchical clustering calculate pairwise distances?

A

Pearson Correlation Coefficient (PCC)

57
Q

What are the steps of hierarchical clustering?

A
  1. Compute all possible pairwise distances (pearson CC)
  2. Join closest neighbours, with branch length reflective of distance between them
  3. Recompute distances. Use average linkage clustering
    - node score= average of PCCs between genes in the node and other gene vectors
    -If average node PCC is higher than that between two individual genes, join it to the best gene . Otherwise, join two genes with best PCC
  4. Repeat until all nodes/genes are joined
58
Q

What is one difference between hierarchical clustering and k-means?

A

With k-means we approximate the initial number of clusters, while we do not set that number in hierarchical clustering

59
Q

Name 4 applications of SOMs

A
  1. Automatic speech recognition
  2. Cloud classification from satellite images
  3. Analysis of electrical signals from brain
  4. Gene expression data analysis
60
Q

What questions can you ask after establishing similarly expressed genes?

A
  1. Insight into specific biological experiment
  2. Coexpression analysis by using gene expression databases
  3. Do promoters of similarly expressed genes contain common cis-elements?
  4. Is there enrichment of a particular functional category (GO, MIPS, KEGG)–>individual clusters
  5. Are genes in a given cluster all part of given pathway –>map expression info onto pathways