RNA seq Flashcards
What are the objectives of RNA seq?
- Study gene regulation and expression variation; * (e.g. compare different tissues, time points, disease states)
- Understand the structure, function and organization of information
within the genome - and many more - sub-classifying cancer, spatial transcritptomics, host pathogen interaction
Describe microarrays
- quick and cost effective
- based on hybridization to complementary sequence
in affy you need to chips per experiment - for control and for the actual experiment
very noisy!
What are some limitations of micrarrays?
- The data is very “noisy”
- Expression levels are determined by a spot of light against a noisy background
- Probes are not available for all genes - Affy probes are only present for approx 75-80% of human genes
- Genes with very low expression may not be detected
- The data requires a large degree of statistical manipulation
- Result only shows a gene is expressed but gives no information about which transcript
Outline the workflow used in RNA seq
Compare RNA sequencing and Microarrays
- Method works as it can be assumed every mRNA present will be sequenced the same number of times
- If experiment shows twice as much mRNA for a particular gene as control then gene expression is 2 fold greater
- RNA-seq gives more accurate quantification and has better dynamic range (ability to quantify genes expressed at low and high levels)
- Not limited by microarray probe sequences and availability
- RNA-seq can potentially identify novel transcripts (e.g. new splice sites)
- RNA-seq can be used to study alternative splicing
Outline the RNA seq analysis procedure
Describe library preparation
- Total RNa extraction and target RNA enrichment
- Poly(A) capture
- Ribosomal RNA deplaetion
- Fragment RNA and reverse transcribe
- Ligate adapters and PCR amplify
- indexes/barcodes allow multiplexing
What should you consider in your experimental design for RNA seq?
- Single vs paired end (latter helps identify e.g. isoforms)
- Sequencing depth (deeper sequencing detects more transcripts)
- Biological replicates (important for differential expression) - you need to have many samples to identify eny errors
- Spike-in RNAs (can help with normalization and quality control) - you add a known amount of RNA and then you can normalize your data
- Multiplexing (pool barcoded samples, then split across lanes)
- Batch design (randomize samples across experimental batches, cannot correct for batch effects if technical and experimental factors are confounded)
Describe the quality control step in RNA seq
- asses the quality and trim the reads if needed
- Quality control is an essential step in the analysis as poor quality reads can significantly impact results
What are some problems you can face during QC?
- Low-quality sequences (low confidence bases)
- Sequencing artefacts (duplicate reads, sequence bias)
- Sequence contamination (reads from another organism)
How can you solve the problems of low quality in QC?
- FastQC for simple QC reads on raw reads - helps you remove the low reads or trim them down
- Discard low quality reads, and trim adapter sequences & poor quality
bases (e.g. using Trimmomatic
- Discard low quality reads, and trim adapter sequences & poor quality
Describe read alignment
After quality control, reads are aligned to a reference genome or transcriptome.
Method depends on experiment aims and availability of suitable references.
What will you have to do if you’re not confident in your mRNA reads?
if you are not confident in your mRNA reads then you will probably have to map against the genome - more difficult because then you will have to map through the exon boundaries
What methods of alignment can you have?
-alignemnt to reference genome
-alignment to reference transcriptome
alignment to de novo assembled genome
Describe alignment to reference genome
- requires splice-aware aligners (e.g. STAR, HISAT2)
- use known splice junctions, but can also discover new ones
- computational challenge is to accurately align reads that span splice junctions
- you can give it an excel sheet with the exons so it is aware of where they are
Describe alignment to reference transcriptome
- unspliced alignment (e.g. Bowtie2)
- generally faster, but requires comprehensive reference transcriptome
- main challenge is dealing with multi-mapping (reads that map to several transcripts)
Describe alignment to de novo assembled transcriptome
if no suitable reference genome, first assemble reads into contigs, and align reads to this de novo transcriptome (e.g. for novel genome, cancer samples)
Describe reference based mapping
- If the reference genome is well annotated e.g. human, the reads can be mapped to known genes using the GTF file
- For less well defined annotations or where more accurate mapping is required then the mapped transcripts need to be assembled
- Novel genome
- To identify variable transcripts/isoforms
- Cancer samples
- Tools to produce these assembled transcripts include Cufflinks and StringTie
- Read mapping aligns reads to the reference genome, marking reads that align with and without splice junctions.
- Those that map unspliced must be exons, any that jump regions must span introns
- Cufflinks/StringTie look at the distribution of reads and estimates a transcript and transcript/gene read counts
Do we have alignment free methods? What are the benefits f them?
- Recent methods (e.g. Kallisto, Salmon) avoid full alignments of each read, and instead use ‘pseudoalignments’ that identify which transcripts are compatible with a given read (but not exactly where that read aligns).
- Very fast, accurate and computationally efficient
- Assume well annotated transcriptomes and cannot identify novel transcripts
- Pseudo aligners have been developed to provide faster and more efficient read mapping
- Kallisto “can quantify 30 million human reads in less than 3 minutes on a Mac
desktop computer using only the read sequences and a transcriptome index
that itself takes less than 10 minutes to build”
How can you quantify expression?
-direct fragment counting
- transcript level quantification
-alignment-free quantification
Describe direct fragment counting
- Count fragments that overlap each gene (use e.g. featureCounts,
HTSeq) - Simple and fast
- How to deal with multi-mapping reads?
- No information on relative transcript abundances
Describe transcript level quantification
- Assign fragments to specific transcripts
- Can aggregate over all possible isoforms to obtain gene-level count
- Use a statistical model to handle multi-mapping reads, and assign 1. these fragments probabilistical
- Can observe e.g. changes in isoform usage
Describe alignment free quantification
Bypass full alignment, fast and accurate (see later slides)
How can you estimate transcript abundances?
- Many methods have been developed to deal with multimapping reads.
- These use statistical models to link the probability of observing a given fragment to the relative transcript abundances.
- Optimisation algorithms are then used to infer transcript abundances given the observed reads and the assumed model.
- Cufflinks is an example which combines transcript assembly with abundance estimation. It uses fragment length information to help assign reads.
What are pseudo aligners?
- FM indexing illustrates one of the challenges with mapping reads to the entire genome
- Pseudo aligners have been developed to provide faster and more efficient read mapping
- Use k-mers rather than aligning full reads
- Examples include Salmon and Kallisto
- Kallisto “can quantify 30 million human reads in less than 3 minutes on a Mac desktop
computer using only the read sequences and a transcriptome index that itself takes
less than 10 minutes to build”
Explain how Kalisto works
- For each read determines which transcripts it is from rather than where it aligns
- Therefore, not necessary to do a full alignment of the reads to the genome
- Raw sequence reads are directly compared to transcript sequences and then used to quantify transcript abundance
- The comparison of the sequencing reads to the transcripts is done using a transcriptome de Bruijn graph (T-DBG)
- T-DBG constructed from the k-mers present in an input transcriptome as opposed to reads which is done normally for genome/transcriptome assembly.
- Transcripts converted into a T-DBG
- Each node/vertex is a k-mer in the T-DBG and associated with transcript(s) - a k- compatibility class
- Left most node has k-compatibility class of all 3 transcripts
- Once T-DBG built kallisto stores a hash table mapping each k-mer to the linear stretches (e.g. the first 3 nodes) it is contained in as well as the position
- Called the “kallisto index”
- Reads are also split int k-mers and matched to transcripts using the hash table
- The black nodes represent the k-mers of the read, where they match transcript k-mers
- To identify the transcript(s) a read is from identify all associated k-compatibility classes
- The k-compatibility classes of all black nodes for this read are the blue and pink transcripts
- Can be extended to paired-end reads using all the k-compatibility classes along both reads
What do you need to use Kalisto?
only works with a good transcriptome
How does Kalisto improve efficiency?
- Kallisto improves efficiency by utilising redundancy - the 3 left most nodes have the same k-compatibility class - the same equivalence class
- When a read k-mer is hashed the k-compatibility class of the node is identified and jumps to the node after the last one in the same equivalence class
- Once the left most k-mer of the read is hashed kallisto ignores the next 2 nodes as they are redundant and hash only the 4th k-mer of the read
- For most reads kallisto only performs a hash lookup for only two k-mers
What are the advantages and disadvantages of using pseudo aligners?
- Pseudo-aligners provide a highly efficient method for mapping reads to transcripts
- Benefits include speed and computational resources required
- Disadvantage is that reads can only be mapped to known transcripts
- Unable to identify and quantify unknown or novel transcripts
- Quantification is at the transcript and not gene level - slight disadvantage - we uaually work with genomesnot transcripts
- R library available to convert to gene level quantification for input to DESeq2 etc - tximport
- Kallisto has a companion R library for transcript differential expression analysis - sleuth
What is normalisation for?
Normalisation adjusts the read count to compensate for within sample analysis
What is within sample normalisation?
- Several units normalize counts by feature length to allow comparison of features WITHIN a sample: RPKM, FPKM, TPM
- They also normalize by total read count but this is generally NOT sufficient for comparison between samples (see later slide)
What is TPM?
Very similar to RPKM and FPKM but difference is the order of operations
- FPKM and TPM (transcripts per million( are both measures of the relative abundance of a transcript in
your pool of transcripts.
- TPM is now generally preferred over FPKM (as the proportionality constant for FPKM is experiment specific).
What are the problems we can face with relative abundance measures?
- FPKM & TPM both normalize for total read count, but this is generally not sufficient to make comparisons between samples.
- They can only tell us the relative proportion of transcripts in a sample.
- If we make further assumptions, we can develop suitable methods to compare gene abundances between samples.
- Ballgown further processes e.g. allowing for total read count per sample.
- YOU CAN’T COMPARE SAMPLES WITH THIS METHOOD, ONLY WITHIN SAMPLE OPTIMISATION
Assume 4 genes in genome, each equally expressed in condition 1. Genes C and D down regulated in condition 2 but other 2 have same expression. Both have 8000 reads mapped - As the number of reads are the same in both conditions the RPKM for genes A and B would be higher in 2 than 1 suggesting they are up regulated
- This is the RNA composition effect
- Tools that use this method, such as ballgown, also maintain extra information about samples (e.g. the total number of reads)
- This allows for the proper comparison of these normalized measures across samples
How do you do between sample normalisation?
- Highly and differentially expressed (DE) genes can distort gene abundance measures, so total read count is not an accurate normalization factor.
- Instead, we want to find a ‘control’ set of transcripts that are not DE and use these to estimate ‘size factors’ that enable meaningful comparisons between samples.
- Several methods exist (all assume most genes are not DE), including: TMM and Median of Ratios
Describe TMM
- Uses a trimmed weighted mean. Excludes genes with large log-fold ratios
between samples, and those with extreme abundance values before calculating
a weighted mean of log-fold changes. - anything extreme is exluded - Used by edgeR
Describe Median of ratios
- Uses the median of ratios of observed counts. For each gene, calculate the
(geometric) mean of its expression across all samples and treat this as a pseudo-
reference. For each sample, calculate the ratios of observed gene counts to
these pseudo-references, and take the median value as the size factor. - Used by DESeq2
What is batch correction?
- As well as normalization, we may need to correct for batch effects.
- These are confounding factors that cause unwanted variation in gene abundances between samples, due to technical factors that differ across batches (samples that are processed in parallel).
- For example, differences in reagents, equipment, or date of library preparation or sequencing may cause batch effects.
- Various tools (e.g. COMBAT, SVAseq) enable batch correction, assuming suitable experimental design so technical and experimental factors are not confounded.
What do you need for identification of differentially expressed genes?
- More replicates increases our power to detect DE genes.
- Minimum of 3 biological replicates recommended.
- RNA-seq experiments often have few replicates, so specialized statistical
methods require
- RNA-seq experiments often have few replicates, so specialized statistical
How do you get DE genes?
- Testing for a statistically significant change in expression
- Many individual statistical hypothesis tests are performed (for each of 100s- 1000s of genes) so p-values need to be corrected for multiple testing
- These methods generally require unnormalized data, as they perform an
integrated normalization step - use raw or estimated read counts
What is multiple-testing correction and why doo we need it?
- Calculated p-values need to be adjusted when repeating multiple independent statistical tests to reduce the false discovery rate (FDR)
- This applies to differential expression calculations where multiple genes are being compared
- DESeq2 implements Benjamini Hochberg multiple test correction and reports a q-value along with the p-value
- The q-value is the adjusted p-value and the significance value that should be used
What would be next steps in differential expression analysis?
- you can use RNA seq to narrow down the pool of genes and then do qPCR on the subset of genes
- RNA seq is a starting point - then you can take the genes to further analysis if you see anything interesting
What is RNA seq roles in alternative splicing?
- RNA-seq has applications beyond quantifying gene expression.
- One gene may give rise to several different mRNAs (and protein isoforms) due to alternative splicing
- RNA-seq allows us to study changes in isoform expression.
- indentification of mutations
- Mutations may affect RNA cis- regulatory elements, spliceosomal components, or trans-acting regulatory factors.
What are gene fusions and how can we link RNA se-q to it?
- Chromosomal rearrangements can lead to fused transcripts.
- RNA-seq allows us to detect these fusion events.
- Gene fusions are commonly reported in many types of cancer, and may be used for diagnosis and prognosis.- A fusion junction is a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene
- It might occur as result of a translocation, deletion or chromosomal inversion
- Example - PML-RAR protein associated with Acute Promyelocytic Leukaemia
- These types of structural rearrangements can also be identified by direct sequencing using paired end reads