HC 5 - Analysis of Transcriptomics Data - Part 1: Data Analysis Flashcards
hoorcollege 5
Steps from obtaining new knowledge (pipeline for transcriptomics)
1) Experimental design and data collection
2) Data quality control and preprocessing
3) Data analysis
4) Biological interpretation
1) Experimental design and data analysis components
-Frame biological question
-Choose a platform
-Identify noise factors
-Design Experiment
2) Data quality control and preprocessing components
-Quality control of raw data
-Calculate expression values (transcriptomics): mapping and counting
-Perform normalization to remove biases introduced by sampling and measurement
3) data analysis components
-Perform explorative data analysis
-Analyze the results of assembly/mapping
-Perform hypothesis testing: statistical tests to find significant differences between groups
4) Biological Interpretation components
-Interpret transcriptome differences in relation to experimental conditions
-Analyze the response of sets of genes
Which steps of experimental pipeline transcriptomics require prior knowledge and context?
Experimental design and biological interpretation
Genotoxicity
Property of chemical agents that damages the genetic information within a cell causing mutations
Phred scores are for a … quality control
technical
Technical quality control: Spike-in RNA control ratio mixtures
Two mixtures of the same 92 ERCC RNA transcripts are prepared with 4 subpools of 23 transcripts per subpool with different defined abundance ratios between the two samples.
> recieve differential expression and dynamic range
Why is normalization needed?
Biological differences are wanted for measurements but technical variability should be removed (different lengths of transcripts).
Assumptions of recognizing technical variability
-The average expression levels are equal
-The distribution of the expression levels are the same
-Most genes have similar expression levels across all samples
-A spike-in standard can be used to quantify technical error
Issues with RNA-seq where normalization is needed for
-Sequencing depth
-Transcript length
-Transcriptome composition
Issue Sequencing depth: why
Expression values are higher for all genes in a certain sample and the total sum of the reads is higher in this sample
Workflow normalisation sequencing depth: two ways to calculate Reads Per Million
Method 1
> Add up total mapped reads (depth)
> Divide read counts of each gene by this normalization factor
> Multiply with 10^6
Method 2
> Count up total reads and divide by 10^6 (per million scaling factor)
> Divide read counts by the per million scaling factor
> now you have RPM
Problems with RNA-seq for the issue transcript length
-Longer transcripts give longer reads
-Isoforms of a gene can have different lengths
-Important for abundance estimation and differential expression analysis
Which value needs to be obtained for normalization for transcript length?
RPKM values or FPKM values (reads or fragments per kilobase per million)
Workflow obtaining RPKM/FPKM
-Count up the total reads in a sample and divide that number by 10^6 (per million scaling factor)
-Divide the read counts by per million scaling factor
-Divide the RPM values by the length of the gene in kilobases
Workflow obtaining TPM values (transcripts per million)
-Divide read counts by length of each gene in kilobases > RPK value
-Count up all RPK values in sample and divide by 10^6 > per million scaling factor
-Dividie RPK values by per million scaling factor
Why might the RPKM values be hard to compare between samples?
The total RPKM value for the sample differs. This value correlates with higher RPKM values for a gene (biased)
Why is TPM used for comparing samples?
Relative abundances of transcripts are made comparable (equal sums per sample)
> the gene length does not really exist in case of splicing (in the reads of transcriptomics)
> TPM is relevant for transcript abundances
> Gene consists of isoforms and still debate for ways to calculate gene level TPMs
Issue transcriptome composition in a MA-plot
Horizontal line with a lot of dots (y=0)
> under the line: genes specific for tissue A
What is a Funky Gene (FG)?
Not comparable gene between the samples
> when a certain gene has a very extreme difference in expression, other genes are shown as different between groups: this is not reliable (because the total counts of one group becomes insanely high, the normalization factor is too large and somewhat equal expression values become different (too low in group of funkiness)
Normalization of funkiness of a FG.
Subtract the counts for the FG for both groups from the total counts.
Assumption for identifying FG
Assumption that the majority of genes is not differentially expressed. With a FG, that is the case,.
Under the assumption it is more plausible that only the FG is differentially expressed
DESeq normalizes data by removing FGs from the normalization equations. Which assumptions are made?
- The majority of the genes are not differentially expressed
- Reads are uniquely assigned to genes
- The effect of isoforms on differential expression is negligible
Differential splicing and differential gene expression are …
confounded
What are possible outcomes when differential gene expression is detected?
-Differential gene expression
-Differential isoform expression
Outcomes for no differential gene expression detected
-No differential gene expression
-Differential isoform detection
Is normalization trivial (alledaags) for RNA-seq data?
No, but it is important
Gene level analysis and transcript level analysis are …
fundamentally different
Is there an established method which takes all three relevant issues of RNA-seq into account?
No
When upregulation or downregulation of gene expression are measured in transcripts than how are they expressed?
In fold change values (FC)
> mostly log transformed
What happens with 8 times upregulation and downregulation when FC values are not log transformed
8 seems like a bigger effect than 1/8, but the effect is equal in a different direction
Which log is mainly used for transforming FC values?
log2
Log2 transformation on 8x up/downregulation
-Log2(8/1) = 3
-Log2(1/8) = -3
Advantages log2 transformation
-Symmetric numbers
-Normal distribution of large datasets
-Up- and downregulation are equivalent
Reasons log transformations
-Ratios are not symmetric (around zero)
-Stabilization of variance
-Compress the range of data
-Approx. normally distributed data which is nice for analysis
Log2(FC) formula
log2(H/B) = log2(H) - log2(B)
After the normalization, you have reliable gene expression data. What is next?
Exploratory data analysis: make pictures plots and graphs
Where is the log(FC) found on a graph with the expression of a gene from two samples on different axes?
Log(FC) is on an perpendicular line against the correlation trend
Explorative data analysis plots
-dotplot: expression of different samples on axes and dots of genes
-Boxplot: different samples on x-axis and gene expressions in the y direction
-PCA plot
-MAplot: each point is a gene, y-axis: mean FC, x-axis: average
-Volcano plot: plot p-value against FC: each point is a gene, low p-values are more significant and are higher after transformation
Are the points outside the box outliers ?
No, it are genes with very high or low expression
Comparing transcriptome with color diagram
Color plot with two samples as two columns and a lot of rows for different genes: the colors represent high or low expression. Sorted on differential expression
Which transformation is needed to make very significant (low) p-values a high value for the volcano plot?
Negative log transformation
> -log(p-value)
this is a log10
How can a difference be tested between two groups (healthy and patient)
t-test
t =
t = signal/noise = gemiddeldePatient - gemiddeldeHealthy / sqrt(1/n (s1^2+s2^2) = Difference between the means of the two samples: average log FC / standard error: estimate of the standard deviaion of the numerator
How is the p-value derived from the t-value?
Searching up the t-distribution
How can the assumptions of the two-sample t-test be relaxed?
Use different variations of the t-test
Assumptions classic t-test
-Equal sample size
-Pooled variance (equal)
What is good practice? Is performing many t-tests for high dimensional groups (many genes) good practice?
Using specialized methods and software for genome wide differential gene expression analysis. Using t-tests is not good practice > methodological issues
Why is good practice needed in stead of genome wide t-tests? Name 3 reasons why
-Curse of dimensionality: many measurements, small n
-RNAseq (and microarray) data do not follow a normal distribution
-Control the number of false discoveries
- Curse of dimensionality: the problem
Estimating 20,000 (all genes) times a parameter based on a few observations is not stable
> if you draw a sample of n from a normal distribution with sd=1 and estimate sd from the n samples, done 20,000 times, the plotted estimated sd values aren’t very precize around 1
Curse of dimensionality: the effect
When variances are small by coincidence, the t-value becomes very large or low and the gene becomes significant by chance
Curse of dimensionality: solution
Correct the gene variance estimates to the average
> more precise se (standard error) as a result
The technical variance in RNAseq data is a Poisson distribution. Explain.
The limiting distribution as n (sequencing reads) tends to infinity and p (probability of picking a gene) to zero: Poisson
Poisson distribution formula
P(X=k) = (m^k*e^-m)/k! = (n k) p^k(1-p)^n-k
p= chance of picking gene
q= chance of not picking gene
kans op aantal voorvallen in een interval
Characteristic of the Poisson distribution
The variance and the mean are the same
RNAseq has no normal distribution. The idea is that technical replicates can be modelled with the Poisson distribution. The biological replicates show overdispension (greater variability than expected) which is the biological variance. How can this variance be modelled?
With the negative binomial distribution
What kind of error is a false positive?
A type I error
What is Bonferroni correction
Probability of at least one false positive
for Prob (wrong somewhere) a is 2.5*10^-6 for 20,000 genes
> P=0.05, R = 20,000
Prob(correct)= 1 - 0.05 = 0.95
Prob(globally correct) = (1-0.05)^20,000
Prob(an error) = 1 - prob(globally correct)
-Ac=Ae/R
What is the p-value?
P(data| H0=true)
> low p-value, not accept H0
Does a low p-value prove that something is different?
No, just that it is not equal
If the alpha=0.01 and you test 100 genes per group. How many false discoveries are expected?
1
> if the p-value is lower than this, it is likely not by chance but significant unequal
Problem with cutoff of 0.01 with gene expression
20,000 genes > 200 false discoveries
> Bonferroni
Bonferroni correction principle
Alpha cutoff value has to be divided by the number of replications (R=20,000)
What is FWER?
A guard against any false positives, but in many cases a certain number of false positives is alright
> use the FDR as a more relevant quantity for multiple testing control (control false discoveries)
What is the FDR?
The False Discovery Rate: The expected proportion of Type I errors among the rejected hypotheses
Biological interpretation: Overrepresentation analysis principle
In a given list of genes of interest (DEGs, differentially expressed genes), is there a gene set that is more represented than expected by chance alone
What are ontologies?
Controlled vocabularies to describe functions of genes
Gene Ontology Database structure
directed acyclic graphs, a tree from more specialized term to a less specialized term
Hypergeometric test
Test if there is there is association between DEGs and gene set
Data analysis results goals
Derive p-values with statistical test
> FDR correction: FDR </ alpha
> Hypergeometrical test for biological interpretation