HC 5 - Analysis of Transcriptomics Data - Part 1: Data Analysis Flashcards

hoorcollege 5

1
Q

Steps from obtaining new knowledge (pipeline for transcriptomics)

A

1) Experimental design and data collection
2) Data quality control and preprocessing
3) Data analysis
4) Biological interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

1) Experimental design and data analysis components

A

-Frame biological question
-Choose a platform
-Identify noise factors
-Design Experiment

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

2) Data quality control and preprocessing components

A

-Quality control of raw data
-Calculate expression values (transcriptomics): mapping and counting
-Perform normalization to remove biases introduced by sampling and measurement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

3) data analysis components

A

-Perform explorative data analysis
-Analyze the results of assembly/mapping
-Perform hypothesis testing: statistical tests to find significant differences between groups

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

4) Biological Interpretation components

A

-Interpret transcriptome differences in relation to experimental conditions
-Analyze the response of sets of genes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which steps of experimental pipeline transcriptomics require prior knowledge and context?

A

Experimental design and biological interpretation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Genotoxicity

A

Property of chemical agents that damages the genetic information within a cell causing mutations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Phred scores are for a … quality control

A

technical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Technical quality control: Spike-in RNA control ratio mixtures

A

Two mixtures of the same 92 ERCC RNA transcripts are prepared with 4 subpools of 23 transcripts per subpool with different defined abundance ratios between the two samples.
> recieve differential expression and dynamic range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is normalization needed?

A

Biological differences are wanted for measurements but technical variability should be removed (different lengths of transcripts).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Assumptions of recognizing technical variability

A

-The average expression levels are equal
-The distribution of the expression levels are the same
-Most genes have similar expression levels across all samples
-A spike-in standard can be used to quantify technical error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Issues with RNA-seq where normalization is needed for

A

-Sequencing depth
-Transcript length
-Transcriptome composition

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Issue Sequencing depth: why

A

Expression values are higher for all genes in a certain sample and the total sum of the reads is higher in this sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Workflow normalisation sequencing depth: two ways to calculate Reads Per Million

A

Method 1
> Add up total mapped reads (depth)
> Divide read counts of each gene by this normalization factor
> Multiply with 10^6
Method 2
> Count up total reads and divide by 10^6 (per million scaling factor)
> Divide read counts by the per million scaling factor
> now you have RPM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Problems with RNA-seq for the issue transcript length

A

-Longer transcripts give longer reads
-Isoforms of a gene can have different lengths
-Important for abundance estimation and differential expression analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which value needs to be obtained for normalization for transcript length?

A

RPKM values or FPKM values (reads or fragments per kilobase per million)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Workflow obtaining RPKM/FPKM

A

-Count up the total reads in a sample and divide that number by 10^6 (per million scaling factor)
-Divide the read counts by per million scaling factor
-Divide the RPM values by the length of the gene in kilobases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Workflow obtaining TPM values (transcripts per million)

A

-Divide read counts by length of each gene in kilobases > RPK value
-Count up all RPK values in sample and divide by 10^6 > per million scaling factor
-Dividie RPK values by per million scaling factor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Why might the RPKM values be hard to compare between samples?

A

The total RPKM value for the sample differs. This value correlates with higher RPKM values for a gene (biased)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Why is TPM used for comparing samples?

A

Relative abundances of transcripts are made comparable (equal sums per sample)
> the gene length does not really exist in case of splicing (in the reads of transcriptomics)
> TPM is relevant for transcript abundances
> Gene consists of isoforms and still debate for ways to calculate gene level TPMs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Issue transcriptome composition in a MA-plot

A

Horizontal line with a lot of dots (y=0)
> under the line: genes specific for tissue A

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is a Funky Gene (FG)?

A

Not comparable gene between the samples
> when a certain gene has a very extreme difference in expression, other genes are shown as different between groups: this is not reliable (because the total counts of one group becomes insanely high, the normalization factor is too large and somewhat equal expression values become different (too low in group of funkiness)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Normalization of funkiness of a FG.

A

Subtract the counts for the FG for both groups from the total counts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Assumption for identifying FG

A

Assumption that the majority of genes is not differentially expressed. With a FG, that is the case,.
Under the assumption it is more plausible that only the FG is differentially expressed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

DESeq normalizes data by removing FGs from the normalization equations. Which assumptions are made?

A
  1. The majority of the genes are not differentially expressed
  2. Reads are uniquely assigned to genes
  3. The effect of isoforms on differential expression is negligible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Differential splicing and differential gene expression are …

A

confounded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What are possible outcomes when differential gene expression is detected?

A

-Differential gene expression
-Differential isoform expression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Outcomes for no differential gene expression detected

A

-No differential gene expression
-Differential isoform detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Is normalization trivial (alledaags) for RNA-seq data?

A

No, but it is important

30
Q

Gene level analysis and transcript level analysis are …

A

fundamentally different

31
Q

Is there an established method which takes all three relevant issues of RNA-seq into account?

A

No

32
Q

When upregulation or downregulation of gene expression are measured in transcripts than how are they expressed?

A

In fold change values (FC)
> mostly log transformed

33
Q

What happens with 8 times upregulation and downregulation when FC values are not log transformed

A

8 seems like a bigger effect than 1/8, but the effect is equal in a different direction

34
Q

Which log is mainly used for transforming FC values?

A

log2

35
Q

Log2 transformation on 8x up/downregulation

A

-Log2(8/1) = 3
-Log2(1/8) = -3

36
Q

Advantages log2 transformation

A

-Symmetric numbers
-Normal distribution of large datasets
-Up- and downregulation are equivalent

37
Q

Reasons log transformations

A

-Ratios are not symmetric (around zero)
-Stabilization of variance
-Compress the range of data
-Approx. normally distributed data which is nice for analysis

38
Q

Log2(FC) formula

A

log2(H/B) = log2(H) - log2(B)

39
Q

After the normalization, you have reliable gene expression data. What is next?

A

Exploratory data analysis: make pictures plots and graphs

40
Q

Where is the log(FC) found on a graph with the expression of a gene from two samples on different axes?

A

Log(FC) is on an perpendicular line against the correlation trend

41
Q

Explorative data analysis plots

A

-dotplot: expression of different samples on axes and dots of genes
-Boxplot: different samples on x-axis and gene expressions in the y direction
-PCA plot
-MAplot: each point is a gene, y-axis: mean FC, x-axis: average
-Volcano plot: plot p-value against FC: each point is a gene, low p-values are more significant and are higher after transformation

42
Q

Are the points outside the box outliers ?

A

No, it are genes with very high or low expression

43
Q

Comparing transcriptome with color diagram

A

Color plot with two samples as two columns and a lot of rows for different genes: the colors represent high or low expression. Sorted on differential expression

44
Q

Which transformation is needed to make very significant (low) p-values a high value for the volcano plot?

A

Negative log transformation
> -log(p-value)
this is a log10

45
Q

How can a difference be tested between two groups (healthy and patient)

A

t-test

46
Q

t =

A

t = signal/noise = gemiddeldePatient - gemiddeldeHealthy / sqrt(1/n (s1^2+s2^2) = Difference between the means of the two samples: average log FC / standard error: estimate of the standard deviaion of the numerator

47
Q

How is the p-value derived from the t-value?

A

Searching up the t-distribution

48
Q

How can the assumptions of the two-sample t-test be relaxed?

A

Use different variations of the t-test

49
Q

Assumptions classic t-test

A

-Equal sample size
-Pooled variance (equal)

50
Q

What is good practice? Is performing many t-tests for high dimensional groups (many genes) good practice?

A

Using specialized methods and software for genome wide differential gene expression analysis. Using t-tests is not good practice > methodological issues

51
Q

Why is good practice needed in stead of genome wide t-tests? Name 3 reasons why

A

-Curse of dimensionality: many measurements, small n
-RNAseq (and microarray) data do not follow a normal distribution
-Control the number of false discoveries

52
Q
  1. Curse of dimensionality: the problem
A

Estimating 20,000 (all genes) times a parameter based on a few observations is not stable
> if you draw a sample of n from a normal distribution with sd=1 and estimate sd from the n samples, done 20,000 times, the plotted estimated sd values aren’t very precize around 1

53
Q

Curse of dimensionality: the effect

A

When variances are small by coincidence, the t-value becomes very large or low and the gene becomes significant by chance

54
Q

Curse of dimensionality: solution

A

Correct the gene variance estimates to the average
> more precise se (standard error) as a result

55
Q

The technical variance in RNAseq data is a Poisson distribution. Explain.

A

The limiting distribution as n (sequencing reads) tends to infinity and p (probability of picking a gene) to zero: Poisson

56
Q

Poisson distribution formula

A

P(X=k) = (m^k*e^-m)/k! = (n k) p^k(1-p)^n-k
p= chance of picking gene
q= chance of not picking gene
kans op aantal voorvallen in een interval

57
Q

Characteristic of the Poisson distribution

A

The variance and the mean are the same

58
Q

RNAseq has no normal distribution. The idea is that technical replicates can be modelled with the Poisson distribution. The biological replicates show overdispension (greater variability than expected) which is the biological variance. How can this variance be modelled?

A

With the negative binomial distribution

59
Q

What kind of error is a false positive?

A

A type I error

60
Q

What is Bonferroni correction

A

Probability of at least one false positive
for Prob (wrong somewhere) a is 2.5*10^-6 for 20,000 genes
> P=0.05, R = 20,000
Prob(correct)= 1 - 0.05 = 0.95
Prob(globally correct) = (1-0.05)^20,000
Prob(an error) = 1 - prob(globally correct)
-Ac=Ae/R

61
Q

What is the p-value?

A

P(data| H0=true)
> low p-value, not accept H0

62
Q

Does a low p-value prove that something is different?

A

No, just that it is not equal

63
Q

If the alpha=0.01 and you test 100 genes per group. How many false discoveries are expected?

A

1
> if the p-value is lower than this, it is likely not by chance but significant unequal

64
Q

Problem with cutoff of 0.01 with gene expression

A

20,000 genes > 200 false discoveries
> Bonferroni

65
Q

Bonferroni correction principle

A

Alpha cutoff value has to be divided by the number of replications (R=20,000)

66
Q

What is FWER?

A

A guard against any false positives, but in many cases a certain number of false positives is alright
> use the FDR as a more relevant quantity for multiple testing control (control false discoveries)

67
Q

What is the FDR?

A

The False Discovery Rate: The expected proportion of Type I errors among the rejected hypotheses

68
Q

Biological interpretation: Overrepresentation analysis principle

A

In a given list of genes of interest (DEGs, differentially expressed genes), is there a gene set that is more represented than expected by chance alone

69
Q

What are ontologies?

A

Controlled vocabularies to describe functions of genes

70
Q

Gene Ontology Database structure

A

directed acyclic graphs, a tree from more specialized term to a less specialized term

71
Q

Hypergeometric test

A

Test if there is there is association between DEGs and gene set

72
Q

Data analysis results goals

A

Derive p-values with statistical test
> FDR correction: FDR </ alpha
> Hypergeometrical test for biological interpretation