5.2 RNA seq Flashcards
Describe how genes in prokaryotic are transcribed
Multiple mRNAs are expressed in an operon that may have multiple genes, These mRNAs get translated into individual proteins
Describe transcription in eukaryotes
- A gene is comprised of coding & non-coding regions (exons & introns) with 5’ UTR at the TSS, and 3’ UTR
- pre-MRNA gets spliced, capped, polyadenylated into mature mRNA
- mature mRNA exported into the cytoplasm for translation
Define the transcriptome
All RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells
Is there a molecular or transcriptional technique that can measure the entire transcriptome of a cell? Why or why not?
No there is not.
Each RNA species requires its own customized experimental workflow and analytical pipeline
ex: miRNA are small, single-stranded, require ligation to adaptors
Name 5 challenges of RNA sequencing compared to DNA sequencing
- Sample purity/quality/qty: RNA inherently labile; hard to get good samples
- RNA consist of small exons that may be separated by large introns: hard to map with seed & extension strategy since exon-exon junctions don’t exist in genome
- Relative abundance of RNAs vary wildly (amount of genomic DNA consistent, but RNA expression not across cell types)
- RNAs come in wide range of sizes
- RNA easily degradable
How is RNA quality assessed?
- Using a bioanalyzer, a small microchip gel is injected with a polymer and RNA (w dye and markers)
- Fluorescence unit diagram & RNA Integrity number (RIN) value outputted
What is a RIN value?
The ratio of expected RNA species (in 18S and 20S) over smaller RNA fragments
What is a good and bad RIN score?
Good RIN:10
Threshold to proceed in RNA experiment: 7
Bad RIN: 0
What are some challenges in using poor quality RNA?
- Will get RNA that degrades in a non-predictable way, creating smaller species than expected
- Smaller species get captured during library construction in non-stochastic way which affect gene counts
What form should RNA be in when creating an RNA-seq library?
RNA should be fully in tact
In mammalian RNA-seq what is (typically) the first step that is done?
mRNA is separate from other RNA types
In mammalian cells, what is the most abundant type of RNA and why is it not desirable to analyze for gene expression?
~80% total RNA is tRNA
tRNA is abundance, highly repetitive and not useful for understanding differential gene expression
How can eukaryotic mRNA be separated from other RNA types? For prokaryotic mRNA?
- Poly-A tail selection (Note not all mRNAs have this) –> PolydT beads bind to Poly A tails
- Ribodepletion: depletion of rRNA
Proks.: mRNAs have no polyA tail, can only use ribodepletion
Describe the 7 Steps used to construct an mRNA RNA-seq library
- target mRNA is enriched using PolyTbeads or ribodepletion
- RNA fragmented and primed
- First strand of cDNA generated
- Second strand of cRNA generated
- 3’ end adenylated & 5’ ends repaired
- Adaptors containing barcodes added to both ends
- ligated fragments PCR amplified
How are different adaptors added to the ends of inserts?
Forked adaptors:
- first 14 nts are complementary
- Afterwards, no longer complementary and diverge
- not an issue in PCR
Prevents loss of ~ 50% product as in Ion Torrent
Why is it important to know which strand a transcript originated from?
to be able to distinguish the expression levels of 2 different genes/exons on different stands that may overlap
Strand information can be retained when shearing so the orientation of the insert is known when sequencing
Describe the un-stranded protocol?
- synthesis of randomly primed ds-cDNA + addition of adaptors for sequencing
- info on which strand the original mRNA template came from is lost
- can’t determine gene expression of overlapping genes that are transcribed from different strands
Describe the stranded protocol
- dUTPs added in synthesis of 2nd cDNA strand instead of dTTPs (can be 1st strand, usually 2nd)
- before PCR, strand with Uracils degraded using uracil-N-glycosylase. The remaining strand corresponds to the original mRNA transcript
Uracil acts as a molecular tag on the 2nd strand for removal prior to sequencing
What are the common analysis goals of RNA-seq? (6)
- gene expression and differential expression
- transcript discovery and annotation
- allele-specific expression (& in relation to SNPs or mutations)
- Mutation discovery
- fusion detection
- RNA editing
Should PRC duplicates be removed?
Depends:
RNA-seq: typically include duplicates
Chip: best practice to remove
Whole genome analysis: always remove bc not representative of true biological replicates
To decide asses library complexity
If removed assess duplicates at paired-end reads level and not single ends reads level
What are some concerns about PCR duplicates in RNA-seq?
- may be due to biased PCR amplification of certain fragments (same w Chip-seq)
- duplicates w no PCR bias expected in highly-expressed short genes (over-representation actually reflective of biology)
- removing duplicates for short or highly expressed genes compresses the top end of their expression (reduces dynamic range of experiment)
What is an appropriate depth for RNA-sequencing?
Depends on:
- research question
- Tissue type, RNA preparation, quality of input RNA, library construction method
- sequencing type: read length, paired vs unpaired
- computational approach and resources
- similar publications
- create pilot experiments
What is the standard read depth for reference mapping?
200 million reads
Describe the RNA-seq workflow when a reference genome is available
- Reads are mapped to the reference genome using a gapped aligner that can deal with exon/intron boundaries
- Novel transcript discovery and quantification can proceed without without an annotation file (annotation file gives coordinates of genes with respect to reference genome and should usually be used)
Describe the RNA-seq workflow when a reference transcriptome is available
- Reads aligned to reference transcriptome using an ungapped aligner since transcriptome already has introns spliced out
- transcript identification and quantification can occur simultaneously since read counts are now directly associated with the reads
Describe the RNA-seq workflow when no reference genome is available
- reads need to be assembled into contigs or transcripts (software can stitch reads back together)
- assembled contigs are contained in FASTA file
- for quantification reads are mapped/aligned to assembled contigs using ungapped aligner
Describe the basic workflow in RNA-seq
- RNA library generated
- library mapped to reference genome (STAR)
- Transcript read count table generated (Htseq)
Why are exon-exon junctions challenging in RNA-seq?
Reads span regions where introns exist in the reference genome, creating a computation challenge to calculate