RNA seq Flashcards

1
Q

What are the objectives of RNA seq?

A
  • Study gene regulation and expression variation; * (e.g. compare different tissues, time points, disease states)
  • Understand the structure, function and organization of information
    within the genome
  • and many more - sub-classifying cancer, spatial transcritptomics, host pathogen interaction
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe microarrays

A
  • quick and cost effective
  • based on hybridization to complementary sequence
    in affy you need to chips per experiment - for control and for the actual experiment
    very noisy!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are some limitations of micrarrays?

A
  • The data is very “noisy”
  • Expression levels are determined by a spot of light against a noisy background
  • Probes are not available for all genes - Affy probes are only present for approx 75-80% of human genes
  • Genes with very low expression may not be detected
  • The data requires a large degree of statistical manipulation
  • Result only shows a gene is expressed but gives no information about which transcript
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Outline the workflow used in RNA seq

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Compare RNA sequencing and Microarrays

A
  • Method works as it can be assumed every mRNA present will be sequenced the same number of times
  • If experiment shows twice as much mRNA for a particular gene as control then gene expression is 2 fold greater
  • RNA-seq gives more accurate quantification and has better dynamic range (ability to quantify genes expressed at low and high levels)
  • Not limited by microarray probe sequences and availability
  • RNA-seq can potentially identify novel transcripts (e.g. new splice sites)
  • RNA-seq can be used to study alternative splicing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Outline the RNA seq analysis procedure

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe library preparation

A
  • Total RNa extraction and target RNA enrichment
    • Poly(A) capture
    • Ribosomal RNA deplaetion
  • Fragment RNA and reverse transcribe
  • Ligate adapters and PCR amplify
    • indexes/barcodes allow multiplexing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What should you consider in your experimental design for RNA seq?

A
  • Single vs paired end (latter helps identify e.g. isoforms)
  • Sequencing depth (deeper sequencing detects more transcripts)
  • Biological replicates (important for differential expression) - you need to have many samples to identify eny errors
  • Spike-in RNAs (can help with normalization and quality control) - you add a known amount of RNA and then you can normalize your data
  • Multiplexing (pool barcoded samples, then split across lanes)
  • Batch design (randomize samples across experimental batches, cannot correct for batch effects if technical and experimental factors are confounded)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe the quality control step in RNA seq

A
  • asses the quality and trim the reads if needed
  • Quality control is an essential step in the analysis as poor quality reads can significantly impact results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are some problems you can face during QC?

A
  • Low-quality sequences (low confidence bases)
  • Sequencing artefacts (duplicate reads, sequence bias)
  • Sequence contamination (reads from another organism)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you solve the problems of low quality in QC?

A
  • FastQC for simple QC reads on raw reads - helps you remove the low reads or trim them down
    1. Discard low quality reads, and trim adapter sequences & poor quality
      bases (e.g. using Trimmomatic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe read alignment

A

After quality control, reads are aligned to a reference genome or transcriptome.
Method depends on experiment aims and availability of suitable references.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What will you have to do if you’re not confident in your mRNA reads?

A

if you are not confident in your mRNA reads then you will probably have to map against the genome - more difficult because then you will have to map through the exon boundaries

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What methods of alignment can you have?

A

-alignemnt to reference genome
-alignment to reference transcriptome
alignment to de novo assembled genome

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe alignment to reference genome

A
  • requires splice-aware aligners (e.g. STAR, HISAT2)
  • use known splice junctions, but can also discover new ones
  • computational challenge is to accurately align reads that span splice junctions
  • you can give it an excel sheet with the exons so it is aware of where they are
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe alignment to reference transcriptome

A
  • unspliced alignment (e.g. Bowtie2)
  • generally faster, but requires comprehensive reference transcriptome
  • main challenge is dealing with multi-mapping (reads that map to several transcripts)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Describe alignment to de novo assembled transcriptome

A

if no suitable reference genome, first assemble reads into contigs, and align reads to this de novo transcriptome (e.g. for novel genome, cancer samples)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe reference based mapping

A
  • If the reference genome is well annotated e.g. human, the reads can be mapped to known genes using the GTF file
  • For less well defined annotations or where more accurate mapping is required then the mapped transcripts need to be assembled
    • Novel genome
    • To identify variable transcripts/isoforms
    • Cancer samples
  • Tools to produce these assembled transcripts include Cufflinks and StringTie
  • Read mapping aligns reads to the reference genome, marking reads that align with and without splice junctions.
  • Those that map unspliced must be exons, any that jump regions must span introns
  • Cufflinks/StringTie look at the distribution of reads and estimates a transcript and transcript/gene read counts
19
Q

Do we have alignment free methods? What are the benefits f them?

A
  • Recent methods (e.g. Kallisto, Salmon) avoid full alignments of each read, and instead use ‘pseudoalignments’ that identify which transcripts are compatible with a given read (but not exactly where that read aligns).
  • Very fast, accurate and computationally efficient
  • Assume well annotated transcriptomes and cannot identify novel transcripts
  • Pseudo aligners have been developed to provide faster and more efficient read mapping
  • Kallisto can quantify 30 million human reads in less than 3 minutes on a Mac
    desktop computer using only the read sequences and a transcriptome index
    that itself takes less than 10 minutes to build”
20
Q

How can you quantify expression?

A

-direct fragment counting
- transcript level quantification
-alignment-free quantification

21
Q

Describe direct fragment counting

A
  • Count fragments that overlap each gene (use e.g. featureCounts,
    HTSeq)
  • Simple and fast
  • How to deal with multi-mapping reads?
  • No information on relative transcript abundances
22
Q

Describe transcript level quantification

A
  • Assign fragments to specific transcripts
  • Can aggregate over all possible isoforms to obtain gene-level count
  • Use a statistical model to handle multi-mapping reads, and assign 1. these fragments probabilistical
  • Can observe e.g. changes in isoform usage
23
Q

Describe alignment free quantification

A

Bypass full alignment, fast and accurate (see later slides)

24
Q

How can you estimate transcript abundances?

A
  • Many methods have been developed to deal with multimapping reads.
  • These use statistical models to link the probability of observing a given fragment to the relative transcript abundances.
  • Optimisation algorithms are then used to infer transcript abundances given the observed reads and the assumed model.
  • Cufflinks is an example which combines transcript assembly with abundance estimation. It uses fragment length information to help assign reads.
25
Q

What are pseudo aligners?

A
  • FM indexing illustrates one of the challenges with mapping reads to the entire genome
  • Pseudo aligners have been developed to provide faster and more efficient read mapping
  • Use k-mers rather than aligning full reads
  • Examples include Salmon and Kallisto
  • Kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop
    computer using only the read sequences and a transcriptome index that itself takes
    less than 10 minutes to build”
26
Q

Explain how Kalisto works

A
  • For each read determines which transcripts it is from rather than where it aligns
  • Therefore, not necessary to do a full alignment of the reads to the genome
  • Raw sequence reads are directly compared to transcript sequences and then used to quantify transcript abundance
  • The comparison of the sequencing reads to the transcripts is done using a transcriptome de Bruijn graph (T-DBG)
  • T-DBG constructed from the k-mers present in an input transcriptome as opposed to reads which is done normally for genome/transcriptome assembly.
  • Transcripts converted into a T-DBG
  • Each node/vertex is a k-mer in the T-DBG and associated with transcript(s) - a k- compatibility class
  • Left most node has k-compatibility class of all 3 transcripts
  • Once T-DBG built kallisto stores a hash table mapping each k-mer to the linear stretches (e.g. the first 3 nodes) it is contained in as well as the position
  • Called the “kallisto index”
  • Reads are also split int k-mers and matched to transcripts using the hash table
  • The black nodes represent the k-mers of the read, where they match transcript k-mers
  • To identify the transcript(s) a read is from identify all associated k-compatibility classes
  • The k-compatibility classes of all black nodes for this read are the blue and pink transcripts
  • Can be extended to paired-end reads using all the k-compatibility classes along both reads
27
Q

What do you need to use Kalisto?

A

only works with a good transcriptome

28
Q

How does Kalisto improve efficiency?

A
  • Kallisto improves efficiency by utilising redundancy - the 3 left most nodes have the same k-compatibility class - the same equivalence class
  • When a read k-mer is hashed the k-compatibility class of the node is identified and jumps to the node after the last one in the same equivalence class
  • Once the left most k-mer of the read is hashed kallisto ignores the next 2 nodes as they are redundant and hash only the 4th k-mer of the read
  • For most reads kallisto only performs a hash lookup for only two k-mers
29
Q

What are the advantages and disadvantages of using pseudo aligners?

A
  • Pseudo-aligners provide a highly efficient method for mapping reads to transcripts
  • Benefits include speed and computational resources required
  • Disadvantage is that reads can only be mapped to known transcripts
  • Unable to identify and quantify unknown or novel transcripts
  • Quantification is at the transcript and not gene level - slight disadvantage - we uaually work with genomesnot transcripts
  • R library available to convert to gene level quantification for input to DESeq2 etc - tximport
  • Kallisto has a companion R library for transcript differential expression analysis - sleuth
30
Q

What is normalisation for?

A

Normalisation adjusts the read count to compensate for within sample analysis

31
Q

What is within sample normalisation?

A
  • Several units normalize counts by feature length to allow comparison of features WITHIN a sample: RPKM, FPKM, TPM
  • They also normalize by total read count but this is generally NOT sufficient for comparison between samples (see later slide)
32
Q

What is TPM?

A

Very similar to RPKM and FPKM but difference is the order of operations
- FPKM and TPM (transcripts per million( are both measures of the relative abundance of a transcript in
your pool of transcripts.
- TPM is now generally preferred over FPKM (as the proportionality constant for FPKM is experiment specific).

33
Q

What are the problems we can face with relative abundance measures?

A
  • FPKM & TPM both normalize for total read count, but this is generally not sufficient to make comparisons between samples.
  • They can only tell us the relative proportion of transcripts in a sample.
  • If we make further assumptions, we can develop suitable methods to compare gene abundances between samples.
  • Ballgown further processes e.g. allowing for total read count per sample.
  • YOU CAN’T COMPARE SAMPLES WITH THIS METHOOD, ONLY WITHIN SAMPLE OPTIMISATION
    Assume 4 genes in genome, each equally expressed in condition 1. Genes C and D down regulated in condition 2 but other 2 have same expression. Both have 8000 reads mapped
  • As the number of reads are the same in both conditions the RPKM for genes A and B would be higher in 2 than 1 suggesting they are up regulated
  • This is the RNA composition effect
  • Tools that use this method, such as ballgown, also maintain extra information about samples (e.g. the total number of reads)
  • This allows for the proper comparison of these normalized measures across samples
34
Q

How do you do between sample normalisation?

A
  • Highly and differentially expressed (DE) genes can distort gene abundance measures, so total read count is not an accurate normalization factor.
  • Instead, we want to find a ‘control’ set of transcripts that are not DE and use these to estimate ‘size factors’ that enable meaningful comparisons between samples.
  • Several methods exist (all assume most genes are not DE), including: TMM and Median of Ratios
35
Q

Describe TMM

A
  • Uses a trimmed weighted mean. Excludes genes with large log-fold ratios
    between samples, and those with extreme abundance values before calculating
    a weighted mean of log-fold changes. - anything extreme is exluded
  • Used by edgeR
36
Q

Describe Median of ratios

A
  • Uses the median of ratios of observed counts. For each gene, calculate the
    (geometric) mean of its expression across all samples and treat this as a pseudo-
    reference. For each sample, calculate the ratios of observed gene counts to
    these pseudo-references, and take the median value as the size factor.
  • Used by DESeq2
37
Q

What is batch correction?

A
  • As well as normalization, we may need to correct for batch effects.
  • These are confounding factors that cause unwanted variation in gene abundances between samples, due to technical factors that differ across batches (samples that are processed in parallel).
  • For example, differences in reagents, equipment, or date of library preparation or sequencing may cause batch effects.
  • Various tools (e.g. COMBAT, SVAseq) enable batch correction, assuming suitable experimental design so technical and experimental factors are not confounded.
38
Q

What do you need for identification of differentially expressed genes?

A
  • More replicates increases our power to detect DE genes.
  • Minimum of 3 biological replicates recommended.
    1. RNA-seq experiments often have few replicates, so specialized statistical
      methods require
39
Q

How do you get DE genes?

A
  • Testing for a statistically significant change in expression
  • Many individual statistical hypothesis tests are performed (for each of 100s- 1000s of genes) so p-values need to be corrected for multiple testing
  • These methods generally require unnormalized data, as they perform an
    integrated normalization step - use raw or estimated read counts
40
Q

What is multiple-testing correction and why doo we need it?

A
  • Calculated p-values need to be adjusted when repeating multiple independent statistical tests to reduce the false discovery rate (FDR)
  • This applies to differential expression calculations where multiple genes are being compared
  • DESeq2 implements Benjamini Hochberg multiple test correction and reports a q-value along with the p-value
  • The q-value is the adjusted p-value and the significance value that should be used
41
Q

What would be next steps in differential expression analysis?

A
  • you can use RNA seq to narrow down the pool of genes and then do qPCR on the subset of genes
  • RNA seq is a starting point - then you can take the genes to further analysis if you see anything interesting
42
Q

What is RNA seq roles in alternative splicing?

A
  • RNA-seq has applications beyond quantifying gene expression.
  • One gene may give rise to several different mRNAs (and protein isoforms) due to alternative splicing
  • RNA-seq allows us to study changes in isoform expression.
  • indentification of mutations
  • Mutations may affect RNA cis- regulatory elements, spliceosomal components, or trans-acting regulatory factors.
43
Q

What are gene fusions and how can we link RNA se-q to it?

A
  • Chromosomal rearrangements can lead to fused transcripts.
  • RNA-seq allows us to detect these fusion events.
  • Gene fusions are commonly reported in many types of cancer, and may be used for diagnosis and prognosis.- A fusion junction is a section of transcribed RNA that maps to an exon from one gene followed by an exon from another gene
  • It might occur as result of a translocation, deletion or chromosomal inversion
  • Example - PML-RAR protein associated with Acute Promyelocytic Leukaemia
  • These types of structural rearrangements can also be identified by direct sequencing using paired end reads