HC 5 - Analysis of Transcriptomics Data - Part 1: Data Analysis Flashcards
hoorcollege 5
Steps from obtaining new knowledge (pipeline for transcriptomics)
1) Experimental design and data collection
2) Data quality control and preprocessing
3) Data analysis
4) Biological interpretation
1) Experimental design and data analysis components
-Frame biological question
-Choose a platform
-Identify noise factors
-Design Experiment
2) Data quality control and preprocessing components
-Quality control of raw data
-Calculate expression values (transcriptomics): mapping and counting
-Perform normalization to remove biases introduced by sampling and measurement
3) data analysis components
-Perform explorative data analysis
-Analyze the results of assembly/mapping
-Perform hypothesis testing: statistical tests to find significant differences between groups
4) Biological Interpretation components
-Interpret transcriptome differences in relation to experimental conditions
-Analyze the response of sets of genes
Which steps of experimental pipeline transcriptomics require prior knowledge and context?
Experimental design and biological interpretation
Genotoxicity
Property of chemical agents that damages the genetic information within a cell causing mutations
Phred scores are for a … quality control
technical
Technical quality control: Spike-in RNA control ratio mixtures
Two mixtures of the same 92 ERCC RNA transcripts are prepared with 4 subpools of 23 transcripts per subpool with different defined abundance ratios between the two samples.
> recieve differential expression and dynamic range
Why is normalization needed?
Biological differences are wanted for measurements but technical variability should be removed (different lengths of transcripts).
Assumptions of recognizing technical variability
-The average expression levels are equal
-The distribution of the expression levels are the same
-Most genes have similar expression levels across all samples
-A spike-in standard can be used to quantify technical error
Issues with RNA-seq where normalization is needed for
-Sequencing depth
-Transcript length
-Transcriptome composition
Issue Sequencing depth: why
Expression values are higher for all genes in a certain sample and the total sum of the reads is higher in this sample
Workflow normalisation sequencing depth: two ways to calculate Reads Per Million
Method 1
> Add up total mapped reads (depth)
> Divide read counts of each gene by this normalization factor
> Multiply with 10^6
Method 2
> Count up total reads and divide by 10^6 (per million scaling factor)
> Divide read counts by the per million scaling factor
> now you have RPM
Problems with RNA-seq for the issue transcript length
-Longer transcripts give longer reads
-Isoforms of a gene can have different lengths
-Important for abundance estimation and differential expression analysis
Which value needs to be obtained for normalization for transcript length?
RPKM values or FPKM values (reads or fragments per kilobase per million)
Workflow obtaining RPKM/FPKM
-Count up the total reads in a sample and divide that number by 10^6 (per million scaling factor)
-Divide the read counts by per million scaling factor
-Divide the RPM values by the length of the gene in kilobases
Workflow obtaining TPM values (transcripts per million)
-Divide read counts by length of each gene in kilobases > RPK value
-Count up all RPK values in sample and divide by 10^6 > per million scaling factor
-Dividie RPK values by per million scaling factor
Why might the RPKM values be hard to compare between samples?
The total RPKM value for the sample differs. This value correlates with higher RPKM values for a gene (biased)
Why is TPM used for comparing samples?
Relative abundances of transcripts are made comparable (equal sums per sample)
> the gene length does not really exist in case of splicing (in the reads of transcriptomics)
> TPM is relevant for transcript abundances
> Gene consists of isoforms and still debate for ways to calculate gene level TPMs
Issue transcriptome composition in a MA-plot
Horizontal line with a lot of dots (y=0)
> under the line: genes specific for tissue A
What is a Funky Gene (FG)?
Not comparable gene between the samples
> when a certain gene has a very extreme difference in expression, other genes are shown as different between groups: this is not reliable (because the total counts of one group becomes insanely high, the normalization factor is too large and somewhat equal expression values become different (too low in group of funkiness)
Normalization of funkiness of a FG.
Subtract the counts for the FG for both groups from the total counts.
Assumption for identifying FG
Assumption that the majority of genes is not differentially expressed. With a FG, that is the case,.
Under the assumption it is more plausible that only the FG is differentially expressed
DESeq normalizes data by removing FGs from the normalization equations. Which assumptions are made?
- The majority of the genes are not differentially expressed
- Reads are uniquely assigned to genes
- The effect of isoforms on differential expression is negligible
Differential splicing and differential gene expression are …
confounded
What are possible outcomes when differential gene expression is detected?
-Differential gene expression
-Differential isoform expression
Outcomes for no differential gene expression detected
-No differential gene expression
-Differential isoform detection