week 2: mRNA seq Flashcards
RNA vs. DNA
Molecular differences
Strand #
DNA is relatively stable and only minor sample degradation is seen with shipping (barring any delays/complications)
RNA is highly UNSTABLE and sample degradation may be seen with shipping (especially if going to China)
RNA
- AGUC
- OH on sugar
- single stranded, and often but not always, linear in shape
DNA
- AGTC
- H on sugar
- two antiparallel, complementary strands form a double helix
central dogma
DNA, RNA, mRNA, protein
transcription: DNA to RNA
polyadenylation: RNA to mRNA
translation: mRNA to protein
mRNA
– polyadenylated mRNA is RNA that is expressed and/or in the process of being expressed. It is therefore an indication of the downstream effects of modifications made to biological control systems that are influenced by both internal and external factors. (i.e. drugs, environmental factors, DNA modifications, etc.)
mRNA vs. “Other” RNA
Other types of RNA are important for many applications as well, however, a main focus within the genomics is on actively expressed transcripts (result of transcription; cell makes RNA out of DNA), that is why mRNA is specifically so important
Other types:
- tRNA, lncRNA, ncRNA, rRNA, miRNA, snRNA, snoRNA
TLDR:
“Gene Expression” study = mRNA Seq
Key feature that differentiates mRNA from other RNA types = polyA tail
QC steps
total RNA
(sample QC)
library construction
(library QC)
sequencing
(data QC)
Bioinformatics analysis
QC & Library Prep
- Starting Material
Cell Pellets or Tissue / Whole Blood / FFPE
- Novogene can extract the total RNA from most cell / tissue
- Novogene can never guarantee the success of an extraction as it’s highly depended on the quality of the starting material
- Extraction costs are $50-$70/sample (variable based on sample count)
- we encourage clients to perform their own RNA extraction is possible
Total RNA
- >100ng of total RNA needed for non-directional library prep (most common)
- >400ng of total RNA needed for directional library prep
- What if the client has less than 100ng?? – we can accommodate it through one our low input pipelines
Sample QC
- QC Method = Agilent 2100 or AATI fragment analyzer
Checks RNA integrity, purity, and quantity
For all normal RNA/DNA pipelines, we allow for 2 library QC’s within the quoted cost. (i.e. 1 sample + 1 backup can be QC’d for no additional cost)
Any additional QC’s that may be needed on top of that
Sample QC
RNA Sample Requirements
Client Facing Sample Requirements
chart of requirements - look in ppt :)
RNA Sample Requirements
- Qubit
Quantification of RNA
- Unit: ng/µL
Not used as part of Novogene internal QC, but is a common tool used by clients.
A qubit fluorometer is a sensitive instrument used to quantify RNA (as well as DNA and protein) based on fluorescence
Fluorescent dye used in the Qubit assay has a high affinity for RNA and binds to it, forming a dye-RNA complex. The amount of light emitted is proportional to the amount of RNA present.
RNA Sample Requirements
- Nanodrop
Quantification & Qualification of RNA
Units: ng/µL & Absorbance Units (A)
Not used as part of Novogene internal QC, but is a common tool used by clients.
Sample Requirements:
OD260/280 ≥ 2.0
OD260/230 ≥ 2.0
260nm: nucleic acids (RNA and DNA)
Because DNA and RNA absorb at the same wavelength, DNA contamination can inflate concentrations
280nm: proteins
230nm: contaminants of Trizol, guanidine thiocyanate, EDTA, phenol and guanidine hydrochloride etc….
RNA Sample Requirements
- Agilent 2100 Bioanalyzer / AATI Fragment analyzer / Tapestation
Quantification & Qualification of RNA
Agilent 2100 Bioanalyzer
Utilizes microfluidics technology to perform electrophoresis on a miniaturized scale.
Highly accurate, well known instrument, produces gel like images
AATI Fragment Analyzer
Capillary electrophoresis system that separates nucleic acids based on size and charge.
Suitable for integration into automated workflows for large-scale projects.
Used by Novogene
TapeStation
Automated electrophoresis system, samples are detected using fluorescence
Moderate resolution compared to Bioanalyzer.
Library Prep Options
bug chart look at ppt :)
Library Prep: PolyA Selection vs rRNA Depletion
Poly-A tail selection: oligo(dT) beads selectively pull down the mRNA, leaving everything else behind
rRNA depletion: rRNA is selectively depleted from the sample. In the prep step, tRNA and sRNA aren’t captured due to small size. Leaving mRNA, lncRNA, and circRNA primarily to be captured in the prep
so it is named after what it eliminates basically
Library Preparation: PolyA Selection vs rRNA Depletion
rRNA-Depletion uses
When other ncRNAs are of interest
Eukaryotic spp from FFPE
For all Prokaryotic spp
Library Prep: Directional vs Non-Directional Prep
Directional = Stranded = Strand Specific
Non-Directional = Non-Stranded
What: Directional mRNA library preparation preserves RNA strand directionality (3’ -5’ orientation); whereas, this information is lost in non-directional prep
How: Stranded/Directional prep uses dUTP in place of dTTP to label the second cDNA strand for selective degradation. This allows the first strand to be the only template left, ensuring that all of the transcripts share a common orientation
Note: rRNA depletion preps are always directional. polyA selection preps can either be directional or non-directional
Library Prep: Directional vs Non-Directional Prep
chart look at ppt :)
Library Prep: Directional vs Non-Directional Prep
- Why Non-Stranded mRNA Library? - so why non-directional?
- Simplicity: Non-stranded library prep is generally less complex and can be more straightforward to perform. It doesn’t require the additional steps needed to preserve strand information, which can simplify the workflow.
- Cost: Because it involves fewer steps and reagents, non-stranded library prep can be less expensive than stranded library prep. This can be an important consideration for clients with budget constraints.
- Experimental Design: Some experiments may not benefit from the additional information that stranded RNA-Seq provides. In cases where the orientation of the transcript does not impact the study’s outcome, a non-stranded will suffice. For example, if the goal is to measure overall gene expression levels rather than to study antisense transcription or overlapping genes, non-stranded RNA-Seq could be appropriate.
Why Stranded mRNA Library?
Strand Specificity: Stranded library prep allows researchers to determine the orientation of the RNA transcripts, which is crucial for understanding antisense transcription and enhancing transcript annotation.
Comprehensive Transcriptome Analysis: Stranded mRNA prep offers a clear and comprehensive analysis across the transcriptome, providing more data per assay, including
1. Full sequence and variant information
2. Higher discovery power to detect known and novel transcripts.
3. More accurate gene expression data
4. Increased alignment efficiency
5. Detecting features like gene fusion and allele-specific expression
Library Prep
- Low Input Options
Watchmaker Kit (San Jose Lab)
- rRNA depletion (directional - better for looking at antisense studies and studies that need direction lol and for understanding the transcriptome). rRNA eliminates rRNA and is used if the client wants ncRNA
- Input: 25ng
- Human Mouse Rat only
Cost: QC (1 time High sensitive QC w/ Bioanalyzer included), rRNA Depletion Lib Prep, & Sequencing at 40M PE150 reads (NovaSeq X Plus), no analysis
< 24 Samples: $329 per sample
24-100 Samples: $309 per sample
100+ Samples: $289 per sample
Takara Kit (Outsourced to ADH)
polyA selection (non-directional, less complex so less expensive and is good if client’s study does not require directionality)
Input: at least 10ng
Success of library prep and downstream results cannot be guaranteed
Cost: QC (1 time High sensitive QC w/ Bioanalyzer included), Takara SMART-seq V4 ultra low input RNA kit, sequencing 20M PE150 reads, no analysis
< 24 Samples: $450 per sample
24-100 Samples: $400 per sample
Prioritize Wachmaker kit for all low input samples (prob. because we need to outsource it and also that it is cheaper), UNLESS
- BSL concern
- Previous batch of samples was processed at ADH
- Non HMR species
Library Preparation: Globin Depletion
What is gmRNA?
- Globin mRNA (gmRNA) is the messenger RNA that encodes globin proteins.
Globulin and gmRNA are highly abundant in whole blood, overwhelming other RNA types. Their presence can interfere with the detection of low-abundance RNA species.
gmRNA-Depletion
- gmRNA is predominantly found in red blood cells. Any RNA samples derived from whole blood must undergo a globin depletion step
- PBMC(s) & plasma does not typically need gmRNA-depletion
peripheral blood mononuclear cells (PBMCs)
Library Preparation: Globin Depletion
GLOBINclear™ Kit - Thermo Fisher Scientific: Removes predominate amount of both α and β globin mRNA via a biotinylated Capture Oligo Mix
- For mRNA – paired with directional or non-directional library prep
- HUMAN MOUSE RAT ONLY
- Available in China Lab
TruSeq Stranded Total RNA with RiboZero Globin Depletion: Depletes samples of globin-encoding mRNA in addition to both cytoplasmic and mitochondrial rRNA using biotinylation (total RNA treatment and library preparation).
- For lncRNA
- HUMAN MOUSE RAT ONLY
- Available in China Lab
Watchmaker Kit with rRNA and globin depletion: Uses Polaris Depletion to enhance data quality by efficiently removing rRNA and globin mRNA, improving sensitivity and coverage, particularly in low-input samples
- For lncRNA
- HUMAN MOUSE RAT ONLY
- Available in San Jose Lab
Library Preparation
- FFPE Tissue
Degradation/Fragmentation
Requires rRNA depletion
Followed by Illumina TruSeq RNA Exome Prep
This prep is targeting on human exome coding area & using rRNA depletion
Specific to humans, FFPE from other sp
choosing a sequencer
we only use Noveseq Xplus now!
- it is less expensive ($7/gb)
- it is the newest platform by illumina
calculating read depth
150 basepairs
reads:
- PE150 = paired end sequencing, 150bp reads
- read pair = 2x150bp front and back = 300bp/read pair (read) total
gigabase pair
1Gb = 1,000,000,000 bp (1 billion base pairs)
calculation of Gb of reads
- 1,000,000,000 bp / 150 bp x 2 = 3,333,333 (3,33 million)
1Gb = 3.33 M PE150 reads
3Gb (the size of human genome) = 10M PE150 reads
6Gb = 20M PE150 reads
250 basepairs
reads:
- PE250 = paired end sequencing, 250bp reads
- read pair = 2x250bp = 500bp/read pair (read)
gigabase pair
- 1Gb = 1,000,000,000 bp (1 billion base pairs)
calculation of Gb to reads
1,000,000,000 bp / 250 bp x 2 = 2,000,000 (2 million)
1Gb = 2M PE250 reads
200Gb = 400M PE250 reads (SP lane)
400 Gb = 800M PE250 reads (SP flow cell)
math
Gb x 3.333 = M reads
M reads / 3.333 = Gb
Recommended Coverage
Eukaryotic, non-directional: 20M/sample (6G)
Eukaryotic, Directional: 30M/sample (9G) – 40M reads (12Gb)
Detection of less abundant transcripts: 50M (15Gb)– 100M reads (30Gb)
Prokaryotic: 2G/sample (6.6M)
analysis - data QC (WOBI)
Offered to all clients for free!
Content & significant:
- Data volume – whether meets the requirements
- Error rate distribution, Q20/Q30 – the quality of each base
- GC content distribution – whether GC and AT are equal and content is stable
- Data filtering – whether contains low quality reads or reads with adapters
- Mapping status – whether there is a contamination
workflow
QC
- original data
- data assessment
- mapping to reference genome
gene count
- expression quantification
quantitative analysis
- differential expression analysis
- GO enrichment
- KEGG enrichment analysis
- protein protein interaction analysis
standard analysis
- new transcript prediction
- alternative splicing analysis
- SNP and indel
- transcription factors analysis
gene count analysis
mapping reads to reference genome
- files provided in BAM format
Gene Expression Quantification & Distribution of Gene Expression Levels
- In RNA-seq experiments, gene expression level is estimated by the abundance of transcripts
Correlation analysis (For biological replicates only)
- Correlation of the gene expression levels between biological replicates. The closer the correlation coefficient is to 1, the higher similarity the samples have
- Principle Component Analysis (PCA)
—- Used to evaluate intergroup differences and intragroup sample duplication
—– can help identify and correct for batch effects or other technical variations that are not related to biological differences
Quantification Analysis
Differential Expression Analysis & Statistics (two or more groups of samples)
- The statistics of the number of differential genes (including up-regulation and down-regulation) for each comparison group at set expression threadholds (LogFold Change)
threshold
- Volcano Plots, Heatmaps, Venn Diagrams
Functional Enrichment / Pathway Analysis
- GO (gene ontology) = To annotate cellular component, molecular function and biological process of DEG
- Kegg = focuses on metabolic pathways & signal transduction pathways associated with DEG
- Reactome = curated database of human molecular pathways to annotate reactions, pathways, and biological process of DEG
- DO (Human Disease Ontology) enrichment = to investigate the human disease and gene function related to DEG (human only)
Protein Protein Interaction Analysis
- mRNA analysis can identify genes that are up- or down-regulated in certain conditions, which might affect protein levels and, consequently, protein interactions
Standard Analysis
Novel Gene Prediction
Alternative Splicing
- Alternative splicing (AS) is a regulated process during gene expression that results in a single gene coding for multiple proteins
- Detection of Differentially Expressed Isoforms
SNP/InDel Analysis
- Sequence variant found when comparing to the reference genome
Fusion Gene Analysis (for tumor sample and cancer cell line)
- A fusion gene is a hybrid gene formed from two previously separate genes. Fusion proteins produced by this change may lead to the development of some types of cancer
Recommendation: Opt for Directional Library Prep to make full use of the analysis package: Start and stop sight, strand specificity, novel gene prediction
Novomagic
What is Novomagic?
- It is an add-on function to the analysis our BI team performs to allow additional manipulation of the existing data (altering fold changes, targeting specific genes, regrouping samples, and re-visualizing charts and figures)
- NovoMagic can support you to select specific group of genes, analyze gene expression, identify differentially expressed genes and perform gene function analysis. Overall, 17 small tool kits are offered In the Toolkit item. In the future, Novogene will gradually launch more toolkits on NovoMagic
Do you need to purchase analysis to access Novomagic?
- Yes!! You must purchase Quantification or Standard analysis to have full access to Novomagic
How long is project data available?
- the data on Novomagic will be preserved for 1 year.
price list for x plus
Whole Transcriptome Sequencing (WTS) Intro…
lncRNA + CircRNA + mRNA + smallRNA
At Novogene we can do any of these parts individually, or we can do them all together (WTS Pipeline)
Long non-coding RNA
Definition:
Transcripts with lengths exceeding 200 nucleotides that are not translated into protein
Characteristics:
Polyadenylated (mRNA-like) or non-polyadenylated at 3′ end
Can be folded into a variety of specific secondary structures which contribute to their regulatory functions
lncRNAs do not have the capacity to translate into proteins.
Biological Function/Significance:
Regulation of gene transcription
Post-transcriptional regulation
Epigenetic regulation
Regulation of DNA replication timing and chromosome stability
Long non-coding RNA location and strategies
Location(s): Davis-US Lab (standard); San Jose-US Lab (Low Input/HMR Blood); or Beijing Lab (Standard/Globin Depletion)
Strategies
RNA QC: Gel Electrophoresis & Agilent 2100 Bioanalyzer
Library Prep: NEB directional with rRNA depletion by Ribo-Zero
Sequencing: PE150 on Illumina Novaseq6000 OR NovaSeq X Plus
Recommended: 12 Gb per sample (~40 M reads) on Illumina PE150
Quote Checklist:
Sample number
Species
Sample Origin (blood, tissue type, etc.)
Any BSL concerns?
Sequencing Depth
Material Sent (RNA/cell pellets, etc.)
Analysis (lncRNA only / circRNA only / lncRNA + circRNA)
Timeline
Circular RNA
Definition:
Circular RNAs (circRNAs) are a type of non-coding RNA that form a covalently closed loop structure, making them distinct from linear RNAs.
Characteristics:
circRNAs are highly stable compared to linear RNAs due to their resistance to exonuclease degradation.
They are derived from back-splicing events where a downstream splice donor is joined to an upstream splice acceptor.
circRNAs are often tissue-specific and exhibit conserved sequences across species.
Biological Function/Significance:
circRNAs can act as microRNA sponges, sequestering miRNAs and preventing them from binding to their target mRNAs.
They are involved in the regulation of gene expression and have been implicated in various diseases, including cancer and neurological disorders.
Due to their stability and specific expression patterns, circRNAs are being explored as potential biomarkers for disease diagnosis and therapy.
Circular RNA
Location(s): Tianjin Lab
Strategies
RNA QC: Nanodrop (prelim detection of conc.) –> AATI Fragment analyzer + Gel Electrophoresis
Library Prep: Abclonal Directional Library Prep with linear rRNA depletion by Ribo-Zero
Sequencing: PE150 on Illumina Novaseq6000 S4 Flowcell
Recommended: 8 Gb per sample (~26.7 M reads)
Quote Checklist:
Sample number
Species
Sample Origin
Any BSL concerns?
Sequencing Depth
Material Sent (RNA/cell pellets, etc.)
Analysis (yes/no)
Timeline
Small RNA
Definition:
Transcripts with lengths between 18-40nt that are not translated into protein
Characteristics:
5 ‘phosphate group and 3’ hydroxyl group
Small RNAs include microRNAs (miRNAs), small interfering RNAs (siRNAs), and Piwi-interacting RNAs (piRNAs)
They are known for their high specificity in binding to target messenger RNAs (mRNAs) to regulate gene expression.
Function:
Small RNA plays an important regulatory role in regulating almost all events at the cellular level, including individual development, cell proliferation and differentiation, tumor occurrence and development, etc.
Gene silencing (via RNA interference) and post-transcriptional regulation
Regulating mRNA degradation and translation
small rna
Location(s): Tianjin Lab
Strategies
RNA QC: Nanodrop (prelim detection of conc.) –> AATI Fragment analyzer + Gel Electrophoresis
Library Prep: Abclonal small RNA Library Prep for Illumina
Sequencing: SE50 on Illumina Novaseq6000
Recommended: 10M reads via Illumina SE50
Eukaryotic Whole Transcriptome Sequencing
Location(s): Tianjin Lab
Strategies
RNA QC: Nanodrop (prelim detection of conc.) –> AATI Fragment analyzer + Gel Electrophoresis
Small RNA
Library Prep: NEBNext Small RNA Library Prep
Sequencing - SE50 on Illumina Novaseq6000 SP Flowcell
Recommended: 20 M reads on Illumina SE50
lncRNA, mRNA, circRNA
Library Prep: Abclonal directional with rRNA depletion by Ribo-Zero
Sequencing: PE150 on Illumina Novaseq6000 S4 Flowcell
Recommended: 12 Gb per sample (~40 M reads) on Illumina PE150
WTS = lncRNA pipeline + smallRNA pipeline packaged into 1
Prokaryotic RNA Seq
Because rRNA depletion is used, all mRNA and lncRNA are captured every time
What about circRNA?
circRNAs do exist in prokaryotes, their prevalence and functional significance are not well understood
What about smallRNA?
This isn’t something a pipeline we have built out a Novogene. Prok RNA library prep size selection targets the cDNA in 250-300 bp, the cDNAs beyond of that range (including smallRNA) will also be included but should with quite insignificant percentage
Prokaryotic RNA Seq
Location(s): Beijing Lab
Strategies
RNA QC: Nanodrop (prelim detection of conc.) –> AATI Fragment analyzer + Gel Electrophoresis
Library Prep: Abclonal directional Library Prep for Illumina with rRNA depletion
Sequencing - PE150 on Illumina Novaseq6000 S4 Flowcell
Recommended: 2 Gb per sample (~6.7 M reads) on Illumina PE150
BSL Considerations
All our Novogene Labs are considered BSL1 – only
BSL restrictions are more strict in China than in US
Prokaryotic RNA is only processed in China, so be sure to confirm that the bacterial RNA is considered BSL1 and is able to be shipped to China (consult the biohazard form)
If there IS biohazard concerns, consult with TS on outsourcing
Dual RNA Seq / Metatranscriptomics
Dual RNA
- Dual RNA Seq show microbes or viruses sustain themselves within host organisms on a molecular, cellular, organismal or population level
Simultaneously capture all classes of coding and noncoding transcripts in both the pathogen an the host
Two species are present, both identities are known
Library Prep: rRNA depletion by ‘Proprietary rRNA depletion kit’ & AB Clonal®Fast RNA-seq Lib Prep Kit V2 for Illumina (Non-Directional(default) & Directional)
Recommended Seq Depth: 12 Gb
Goal: To see Host/Pathogen Interaction
MetaTranscriptomics
metatranscriptome refers to multiple transcriptomes across populations or communities, from natural environment samples, like sea water, soli, stool, ferment and more.
It mainly studies gene expression profile of all species as a whole in each environmental sample
Multiple species are present, identities are unknown
Library Prep: rRNA depletion by ‘Proprietary rRNA depletion kit’ & AB Clonal®Fast RNA-seq Lib Prep Kit V2 for Illumina (Non-Directional(default) & Directional)
Recommended Seq Depth: 6Gb
Goal: To see environmental chances and microbial community interactions
Long-Read RNA Sequencing Technologies
IsoSeq technology by PacBio offers full-length transcript sequencing without the need for assembly. It captures complete isoforms and accurately identifies splice variants, which is essential for understanding complex transcriptomes.
Applications: Isoform discovery, gene annotation, and alternative splicing studies.
QC & Library Prep: $549/sample
Sequencing: $2,899/ SMRT Cell (300Gb of raw data (NOT CCS data*)/SMART cell.)
up to 10 libraries can be pooled into one SMRT cell –> yields ~30Gb raw data per sample
Or $15/Gb (minimum of 30Gb/sample)
Kinnex (PacBio)
The Kinnex full-length RNA kit uses the MAS-Seq method to enhance throughput on PacBio platforms by concatenating cDNA molecules into longer HiFi libraries. This approach allows for high-throughput, cost-effective isoform sequencing, making it suitable for large-scale transcriptomic studies
Packaged Pricing (sold by M reads):
5M reads/sample: $900/sample
10M reads/sample: $1500/sample
Long-Read RNA Sequencing Technologies
RNA Seq on Nanopore (with cDNA conversion)
Nanopore sequencing with cDNA conversion enables the sequencing of RNA molecules by converting them into cDNA before sequencing. This method provides long reads that cover entire transcripts, offering insights into isoform structure and expression.
Applications: Long-read transcriptomics, alternative splicing analysis, and comprehensive gene expression profiling.
Up to 24 samples can be pooled per cell
Average data output per cell: 75-90Gb raw data, it will be influenced by species, genome size and sample quality.
Direct RNA Seq on Nanopore
Direct RNA sequencing on Nanopore technology sequences RNA molecules directly, without the need for cDNA conversion. This approach preserves the native RNA structure, including modifications, and provides real-time data.
Applications: RNA modification studies, real-time transcriptomics, and understanding RNA biology at a native state
As there’s no barcode in direct RNA library kit, only one sample could be loaded on a cell.
The data output/cell is normally ranges from 5G to 8G for a QC-pass sample.
Single Cell/Nuclei RNA-seq
Transcriptome Sequencing (RNA-seq) at the single cell or nuclei level
Differentiates RNA expressed by each individual cell rather than the whole tissue
Very expensive
$2.5-4k vs $120 - $250 per sample
Coverage is based on cells captured and reads per cell
10X Single Cell RNA-Seq
bulk rna seq
- measures the average gene expression levels in a group of cells, tissues, or biopsies
10X Single Cell RNA-Seq
Coverage (M reads) = # of captured cells x # reads per cell
G = (M Reads/10)*3
Recommendations:
Maximum capture is 10k cells
10X recommends a minimum of 20k reads per cell
NVG recommends 30-50k reads per cell
Example: 10,000 cells * 50,000 reads/cell = 500,000,000 (500M) reads
500M reads = 150Gb