HC 5 - Analysis of Transcriptomics Data - Part 1: Data Analysis Flashcards

Question 1

Q

Steps from obtaining new knowledge (pipeline for transcriptomics)

Answer

A

1) Experimental design and data collection
2) Data quality control and preprocessing
3) Data analysis
4) Biological interpretation

Question 2

Q

1) Experimental design and data analysis components

Answer

A

-Frame biological question
-Choose a platform
-Identify noise factors
-Design Experiment

Question 3

Q

2) Data quality control and preprocessing components

Answer

A

-Quality control of raw data
-Calculate expression values (transcriptomics): mapping and counting
-Perform normalization to remove biases introduced by sampling and measurement

Question 4

Q

3) data analysis components

Answer

A

-Perform explorative data analysis
-Analyze the results of assembly/mapping
-Perform hypothesis testing: statistical tests to find significant differences between groups

Question 5

Q

4) Biological Interpretation components

Answer

A

-Interpret transcriptome differences in relation to experimental conditions
-Analyze the response of sets of genes

Question 6

Q

Which steps of experimental pipeline transcriptomics require prior knowledge and context?

Answer

A

Experimental design and biological interpretation

Question 7

Q

Genotoxicity

Answer

A

Property of chemical agents that damages the genetic information within a cell causing mutations

Question 8

Q

Phred scores are for a … quality control

Answer

A

technical

Question 9

Q

Technical quality control: Spike-in RNA control ratio mixtures

Answer

A

Two mixtures of the same 92 ERCC RNA transcripts are prepared with 4 subpools of 23 transcripts per subpool with different defined abundance ratios between the two samples.
> recieve differential expression and dynamic range

Question 10

Q

Why is normalization needed?

Answer

A

Biological differences are wanted for measurements but technical variability should be removed (different lengths of transcripts).

Question 11

Q

Assumptions of recognizing technical variability

Answer

A

-The average expression levels are equal
-The distribution of the expression levels are the same
-Most genes have similar expression levels across all samples
-A spike-in standard can be used to quantify technical error

Question 12

Q

Issues with RNA-seq where normalization is needed for

Answer

A

-Sequencing depth
-Transcript length
-Transcriptome composition

Question 13

Q

Issue Sequencing depth: why

Answer

A

Expression values are higher for all genes in a certain sample and the total sum of the reads is higher in this sample

Question 14

Q

Workflow normalisation sequencing depth: two ways to calculate Reads Per Million

Answer

A

Method 1
> Add up total mapped reads (depth)
> Divide read counts of each gene by this normalization factor
> Multiply with 10^6
Method 2
> Count up total reads and divide by 10^6 (per million scaling factor)
> Divide read counts by the per million scaling factor
> now you have RPM

Question 15

Q

Problems with RNA-seq for the issue transcript length

Answer

A

-Longer transcripts give longer reads
-Isoforms of a gene can have different lengths
-Important for abundance estimation and differential expression analysis

Question 16

Q

Which value needs to be obtained for normalization for transcript length?

Answer

A

RPKM values or FPKM values (reads or fragments per kilobase per million)

Question 17

Q

Workflow obtaining RPKM/FPKM

Answer

A

-Count up the total reads in a sample and divide that number by 10^6 (per million scaling factor)
-Divide the read counts by per million scaling factor
-Divide the RPM values by the length of the gene in kilobases

Question 18

Q

Workflow obtaining TPM values (transcripts per million)

Answer

A

-Divide read counts by length of each gene in kilobases > RPK value
-Count up all RPK values in sample and divide by 10^6 > per million scaling factor
-Dividie RPK values by per million scaling factor

Question 19

Q

Why might the RPKM values be hard to compare between samples?

Answer

A

The total RPKM value for the sample differs. This value correlates with higher RPKM values for a gene (biased)

Question 20

Q

Why is TPM used for comparing samples?

Answer

A

Relative abundances of transcripts are made comparable (equal sums per sample)
> the gene length does not really exist in case of splicing (in the reads of transcriptomics)
> TPM is relevant for transcript abundances
> Gene consists of isoforms and still debate for ways to calculate gene level TPMs

Question 21

Q

Issue transcriptome composition in a MA-plot

Answer

A

Horizontal line with a lot of dots (y=0)
> under the line: genes specific for tissue A

Question 22

Q

What is a Funky Gene (FG)?

Answer

A

Not comparable gene between the samples
> when a certain gene has a very extreme difference in expression, other genes are shown as different between groups: this is not reliable (because the total counts of one group becomes insanely high, the normalization factor is too large and somewhat equal expression values become different (too low in group of funkiness)

Question 23

Q

Normalization of funkiness of a FG.

Answer

A

Subtract the counts for the FG for both groups from the total counts.

Question 24

Q

Assumption for identifying FG

Answer

A

Assumption that the majority of genes is not differentially expressed. With a FG, that is the case,.
Under the assumption it is more plausible that only the FG is differentially expressed

Question 25

Q

DESeq normalizes data by removing FGs from the normalization equations. Which assumptions are made?

Answer

A

The majority of the genes are not differentially expressed
Reads are uniquely assigned to genes
The effect of isoforms on differential expression is negligible

Question 26

Q

Differential splicing and differential gene expression are …

Answer

A

confounded

Question 27

Q

What are possible outcomes when differential gene expression is detected?

Answer

A

-Differential gene expression
-Differential isoform expression

Question 28

Q

Outcomes for no differential gene expression detected

Answer

A

-No differential gene expression
-Differential isoform detection

Question 29

Q

Is normalization trivial (alledaags) for RNA-seq data?

Answer

A

No, but it is important

Question 30

Q

Gene level analysis and transcript level analysis are …

Answer

A

fundamentally different

Question 31

Q

Is there an established method which takes all three relevant issues of RNA-seq into account?

Question 32

Q

When upregulation or downregulation of gene expression are measured in transcripts than how are they expressed?

Answer

A

In fold change values (FC)
> mostly log transformed

Question 33

Q

What happens with 8 times upregulation and downregulation when FC values are not log transformed

Answer

A

8 seems like a bigger effect than 1/8, but the effect is equal in a different direction

Question 34

Q

Which log is mainly used for transforming FC values?

Question 35

Q

Log2 transformation on 8x up/downregulation

Answer

A

-Log2(8/1) = 3
-Log2(1/8) = -3

Question 36

Q

Advantages log2 transformation

Answer

A

-Symmetric numbers
-Normal distribution of large datasets
-Up- and downregulation are equivalent

Question 37

Q

Reasons log transformations

Answer

A

-Ratios are not symmetric (around zero)
-Stabilization of variance
-Compress the range of data
-Approx. normally distributed data which is nice for analysis

Question 38

Q

Log2(FC) formula

Answer

A

log2(H/B) = log2(H) - log2(B)

Question 39

Q

After the normalization, you have reliable gene expression data. What is next?

Answer

A

Exploratory data analysis: make pictures plots and graphs

Question 40

Q

Where is the log(FC) found on a graph with the expression of a gene from two samples on different axes?

Answer

A

Log(FC) is on an perpendicular line against the correlation trend

Question 41

Q

Explorative data analysis plots

Answer

A

-dotplot: expression of different samples on axes and dots of genes
-Boxplot: different samples on x-axis and gene expressions in the y direction
-PCA plot
-MAplot: each point is a gene, y-axis: mean FC, x-axis: average
-Volcano plot: plot p-value against FC: each point is a gene, low p-values are more significant and are higher after transformation

Question 42

Q

Are the points outside the box outliers ?

Answer

A

No, it are genes with very high or low expression

Question 43

Q

Comparing transcriptome with color diagram

Answer

A

Color plot with two samples as two columns and a lot of rows for different genes: the colors represent high or low expression. Sorted on differential expression

Question 44

Q

Which transformation is needed to make very significant (low) p-values a high value for the volcano plot?

Answer

A

Negative log transformation
> -log(p-value)
this is a log10

Question 45

Q

How can a difference be tested between two groups (healthy and patient)

Question 46

Q

t =

Answer

A

t = signal/noise = gemiddeldePatient - gemiddeldeHealthy / sqrt(1/n (s1^2+s2^2) = Difference between the means of the two samples: average log FC / standard error: estimate of the standard deviaion of the numerator

Question 47

Q

How is the p-value derived from the t-value?

Answer

A

Searching up the t-distribution

Question 48

Q

How can the assumptions of the two-sample t-test be relaxed?

Answer

A

Use different variations of the t-test

Question 49

Q

Assumptions classic t-test

Answer

A

-Equal sample size
-Pooled variance (equal)

Question 50

Q

What is good practice? Is performing many t-tests for high dimensional groups (many genes) good practice?

Answer

A

Using specialized methods and software for genome wide differential gene expression analysis. Using t-tests is not good practice > methodological issues

Question 51

Q

Why is good practice needed in stead of genome wide t-tests? Name 3 reasons why

Answer

A

-Curse of dimensionality: many measurements, small n
-RNAseq (and microarray) data do not follow a normal distribution
-Control the number of false discoveries

Question 52

Q

Curse of dimensionality: the problem

Answer

A

Estimating 20,000 (all genes) times a parameter based on a few observations is not stable
> if you draw a sample of n from a normal distribution with sd=1 and estimate sd from the n samples, done 20,000 times, the plotted estimated sd values aren’t very precize around 1

Question 53

Q

Curse of dimensionality: the effect

Answer

A

When variances are small by coincidence, the t-value becomes very large or low and the gene becomes significant by chance

Question 54

Q

Curse of dimensionality: solution

Answer

A

Correct the gene variance estimates to the average
> more precise se (standard error) as a result

Question 55

Q

The technical variance in RNAseq data is a Poisson distribution. Explain.

Answer

A

The limiting distribution as n (sequencing reads) tends to infinity and p (probability of picking a gene) to zero: Poisson

Question 56

Q

Poisson distribution formula

Answer

A

P(X=k) = (m^k*e^-m)/k! = (n k) p^k(1-p)^n-k
p= chance of picking gene
q= chance of not picking gene
kans op aantal voorvallen in een interval

Question 57

Q

Characteristic of the Poisson distribution

Answer

A

The variance and the mean are the same

Question 58

Q

RNAseq has no normal distribution. The idea is that technical replicates can be modelled with the Poisson distribution. The biological replicates show overdispension (greater variability than expected) which is the biological variance. How can this variance be modelled?

Answer

A

With the negative binomial distribution

Question 59

Q

What kind of error is a false positive?

Answer

A

A type I error

Question 60

Q

What is Bonferroni correction

Answer

A

Probability of at least one false positive
for Prob (wrong somewhere) a is 2.5*10^-6 for 20,000 genes
> P=0.05, R = 20,000
Prob(correct)= 1 - 0.05 = 0.95
Prob(globally correct) = (1-0.05)^20,000
Prob(an error) = 1 - prob(globally correct)
-Ac=Ae/R

Question 61

Q

What is the p-value?

Answer

A

P(data| H0=true)
> low p-value, not accept H0

Question 62

Q

Does a low p-value prove that something is different?

Answer

A

No, just that it is not equal

Question 63

Q

If the alpha=0.01 and you test 100 genes per group. How many false discoveries are expected?

Answer

A

1
> if the p-value is lower than this, it is likely not by chance but significant unequal

Question 64

Q

Problem with cutoff of 0.01 with gene expression

Answer

A

20,000 genes > 200 false discoveries
> Bonferroni

Answer 62

A

Alpha cutoff value has to be divided by the number of replications (R=20,000)

Answer 63

A

A guard against any false positives, but in many cases a certain number of false positives is alright
> use the FDR as a more relevant quantity for multiple testing control (control false discoveries)

Answer 64

A

The False Discovery Rate: The expected proportion of Type I errors among the rejected hypotheses

Answer 65

A

In a given list of genes of interest (DEGs, differentially expressed genes), is there a gene set that is more represented than expected by chance alone

Answer 66

A

Controlled vocabularies to describe functions of genes

Answer 67

A

directed acyclic graphs, a tree from more specialized term to a less specialized term

Answer 68

A

Test if there is there is association between DEGs and gene set

Answer 69

A

Derive p-values with statistical test
> FDR correction: FDR </ alpha
> Hypergeometrical test for biological interpretation

HC 5 - Analysis of Transcriptomics Data - Part 1: Data Analysis Flashcards

hoorcollege 5