Next Generation Sequencing Flashcards
Typical design of a gene panel for NGS
- Entire exonal sequence of the genes
- +10 base pairs into intronic sequences (NOT deep intronic sequences)
- Promoters are NOT covered (eg, TERT promoter)
- Large indels (about 100 bp or more) are usually missed due to insufficient priming
Hotspot Panels
Focus on hot spot regions which are frequently associated with SNVs and small indels
Panels are not faster, but can be run on poorer quality / less DNA
NGS sequencing is run in. . .
. . . batches, to reduce costs.
Meta-mutational data
For example, MSI or UV signature – patterns of mutation
Require larger DNA sequence input/reading, since these are effectively statistical assays that require a large N.
Overestimation of tumor percentage risks . . .
. . . a false negative.
Evidence Tiers
Evidence tiers are primarily determined by. . .
. . . evidence type, not necessarily evidence quality
Assessing VAF
Sample isolation techniques
Emulsion PCR
PCR, but the aqueous phase is interrupted and spread across many individual cells within an oil emulsion.
Enables many parallel reactions to occur simultaneously.
This is the fundamental technique which produces the massively parallel component of next generation sequencing – many reactions are run in tiny emulsion chambers which in theory may contain different substrate and allow for numerous separate but simultaneous PCRs.
Amplification in emulsion PCR
(454 method)
The genome is fragmented through one of numerous possible techniques.
3’ overhangs are digested and 5’ overhangs are filled in to create a library of blunt dsDNA fragments. Then, A and B dsDNA adaptor sequences are added to the ends of each DNA fragment. A and B adaptors have 3’ hydroxyls, but lack 5’ phosphates (to prevent A-B pairing).
The non-ligated half of the dsDNA adaptor sequences are melted off and the overhangs are filled in by PCR. PCR enrichment then ensues with A’ and B’ primers, which selectively result in amplification of the library fragments (A-A and B-B form lariats, A- and B- only extend linearly, therefore only A-B amplifies).
Illumina method for emulsion PCR amplification
Rather than using separate A and B adapters which both lack a 5’ phosphate, Illumina utilizes the same Y-shaped adapters with a region of homology at the library dsDNA interface, but which then branch out into nonhomologous strands with sequences of A and B’.
When the first round of PCR takes place, you then get your full adaptor sequences connected to the first PCR product (A and B’, B and A’), and the PCR amplifies.
Emulsion PCR setup in 454 NGS (after library preparation)
Ideally, you have created a libary fragment:emulsion bubble ratio such that each bubble only contains one fragment at most – minimizing fragments with multiple samples.
Each bubble also contains a magnetic bead with the A’ primer for PCR, as well as a B’ primer which is free floating in solution.
Sequencing step in 454 NGS
NGS is a pyrosequencing-based approach. After library preparation and amplification, beads with attached libary amplification product are singly isolated into picoliter wells.
Pyrosequencing via a flow-based sequencing by synthesis is performed in each well.
When the correct nucleotide is flowed in, it is added to the strand and a pyrophosphate is released. The pyrophosphate is then utilize by ATP synthase to make ATP, which powers firefly luciferase to cleave luciferin and create a flash of light, indicating that this was the correct nucleotide. This occurs across millions of bound library amplification products bound to the same bead.
What is the rate limiting step of the 454 NGS sequencing phase?
The speed at which nucleotides are flowed into the picoliter wells.
What is the biggest challenge in the 454 NGS sequencing phase?
Quickly washing the wells to ensure that only one nucleotide is present at a time for sequencing by synthesis.
Why does the signal:noise ratio decrease with your position in 454 NGS?
Not every position on the bead will incorporate every time, and so more and more beads will be synthesizing out of sequence with the rest as time goes on.
This mostly creates problems with serial nucleotides over 5, since the variability in signal:noise ratio makes it difficult to precisely estimate the expected value of many serial nucleotides.
Ilumina PCR setup
Rather than emulsion PCR, Ilumina NGS performs its initial PCR amplification of library components by clustering.
A and B primers/adapters are covalently attached to tiles in an 8 lane glass microfluidics chamber. Library fragments with attached A and B’ sequences are flowed and will bind to the primers. They will then be amplified by an in-situ PCR, and separation of library components is effectively achieved by clustering at a site on the tile.
They are then prepared for sequencing in-situ.
Ilumina pyrosequencing step
The PCR amplified library (still bound to the glass in clusters on the fluidics chamber) is subjected to sequencing by synthesis.
One fluorophore-conjugated nucleotide at a time (with all four having different colors) is added, enabling identification of cluster-specific chains sequence.
Key differences between Ilumina and 454:
In Ilumina, each nucleotide has its own color.
In Ilumina, the 3’OH is protected and must be unprotected between each new nucleotide, making Ilumina better for detecting serial nucleotide chain lengths.
ABI Solid NGS Sequencing
ABI Solid utilizes emulsion PCR similar to 454, but unlike 545 the sequencing step is based on sequencing by ligation.
Synthesis by ligation relies on T4 ligase, and is based on colors associated with nucleotide pairs. The end result is that each base is read twice, giving you an idea of where errors have been made. This makes ABI solid NGS a much more accurate method, however it is longer and more expensive. It also requires a different type of software for analysis.
Also limited by very short average read length compared to sequencing by synthesis – 50 nucleotides vs >100 nucleotides.
Due to being a later method which is cumbersome and requires distinct software, it has never really caught on.
IT NGS Sequencing
Very similar to 454, however rather than measuring the pyrophosphate via ATP synthase and luciferase, it measures the H+ produced via a pH meter built into the glass well. Flow-based sequencing by synthesis. Referred to as “proton detection sequencing.”
Fast, but has the same error problems that 454 sequencing does. Also the coverage is at best ~1 gigabase.
MinPore NGS Sequencing
A form of sequencing by current differential which takes one single molecule of DNA at a time through a molecular ratchet attached to a charged chamber.
Has distinct signals for 3 nucleotides combinations at a time, so effectively 64 distinct signals and 3 reads of each base location.
Errors come in the form of false indels created by imperfect ratcheting. Error rate ~4%.
Only costs $900, pen-sized, plugs into a laptop via USB.
FASTA format
> name
ACTGATGACTGCC. . . .
FASTQ format
FASTA + quality score
Common start sites as a sign of bad NGS
If you have tons of reads starting at the same site, something has gone wrong. This is the result of artificial duplication of few reads due to PCR.
You were probably only targeting a small subset of your alleles due to some problem early on – maybe poor sample quality, poor DNA shearing, non-random shearing, etc.
What can you do to slavage a somewhat poor quality NGS?
The quality varries with position in the NGS fragment, and you can track with with the FASTAQ quality score.
So. . . you can just read a sequence until its quality score drops beneath a predicted error rate of a pre-determine percentage. This is called quality trimming.
This means that instead of 100 bases where the last 25 are gibberish, you get 75 quality bases. With the number of reads you usually have in NGS, that length is sufficient, so you can salvage the data into good quality with a slightly lower average read count.
Adaptor trimming
Sometimes the adaptor sequence is also read in shorter NGS fragments.
But, since you know your adaptor sequences, you can automatically filter these out.
Size trimming
Sometimes when you have short NGS fragments, post-trimming your total usable length is only ~15 base pairs.
This makes mapping a problem, as this can map to many locations in the genome.
So, you can just exclude these fragments entirely.
Two approaches to mapping sequences
De novo: Purely piecing sequences together – no reference sequence.
Resequencing: True mapping onto a reference genome sequence.
Multiplex NGS
Very commonly done in order to save money on NGS runs, and actually very simple.
Build a distinct bar code sequence into the adaptors for different patients. This will be read along with the sequence and can be used to identify which patient the fragment is coming from at the sacrifice of a small number of bp reads, say 10.
These sequences are read and then “trimmed” during the demultiplexing step, which should happen before any other trimming.
What reference genome should you be using?
hg19
(And in rare cases you may need to convert old data from hg18 to hg19, but try to stick to hg19)
dbSNP
SNP database that codes all benign SNPs
Important reference tool
When does NGS with RNA become less of an option?
7-8 years is when it starts to get questionable, but some cases 10 years out can have decent quality reads.
“RT check”
Checks that RNA quality is sufficient for reverse transcription.
The metric is a measure of GAPDH RNA.
Coverage that we deem adequate (rather than inadequate/conditional)
100 reads
Amplicon vs hybrid capture sequencing
Intrepreting FASTQ quality score
The Phred score is given in ASCII code in order to compress data and correspond to the position in the FAST sequence.
Full, formal variant reporting format (genone, cDNA, protein)
SAM/BAM
The standard output file of an algorithm alignment tool.
SAM = sequence alignment map
BAM = binary alignment map (binary equivalent of SAM)
Header - contains infformatioin about the data file, such as genome build version, reference sequence.
Sections are identified by @##, where ## is a two letter code to identify the type of data being entered.
Includes two quality metrics: The original FASTQ quality score AND the mapping quality score, both of which play different but important roles in variant calling.
The central dogma of NGS analysis
Variant reporting/tiering for somatic mutations
Important questions to know before beginning NGS analysis
- DNA or RNA-based assay?
- Is the quality sufficient (FASTQ quality score, alignment quality score, mean start sites, etc)?
- What type of specimen was the material acquired from (precludes certain analyses)?
High expression artifact
Seen in RNA-based fusion assays
If a certain gene is expressed very highly, you will see a lot of artifactual “fusions” involving that gene.
Near-haploidization effect in oncocytic tumors
Multiple oncocytic tumors, including Hurthle cell adenocarcinoma, frequently display haploidization followed by reduplication, resulting in homozygosity of many genes.
Chromosome 7 is often unaffected by this process.