NGS Flashcards
What five areas need to be considered when assessing quality of NGS
Error rates of technology. Read length. Base calling algorithms. Alignment. Read depth coverage
In NGS what can contribute to error rates
Signal to noise ratio. Cross talk from nearby clusters or beads. Homopolymer counts. Incomplete extension. Position on the read ( worse at beginning or end).
Error rates typically: 1/10th% to several %
How does read length affect NGS quality
Too short a read and they might not be able to align correctly. Longer read lengths provide more information about relative genomic location but cost more. Paired end sequencing help align shorts reads and helps with rearrangements but is more expensive and time consuming.
What do base calling algorithms do in NGS
Identify bases and give them a quality score (phred score) based on noise estimates from image analysis. Can help improve error rates. The higher the phred score the better the quality.
Quality score important for rejecting low quality reads, trimming low quality bases, improving alignment accuracy, determine in consensus sequences
What must you be careful with base calling algorithms in NGS
They can remove real deletions. Therefore have to use special software designed for detecting deletions.
Whys alignment important and what are the issues in NGS
It’s important that alignment algorithms can cope with sequencing errors and real differences. Alignment is more difficult than Sanger due to the short reads. Paired end sequencing contributes to an increase in matched reads. Issues in repetitive regions/ shared homology.
Must produce well calibrated alignment quality values.
What is depth coverage in NGS
Measurement of the number of times a region has been sequenced during a run. Higher number of reads- the higher the data quality.covearge across regions are variable.
Inadequate coverage can result in a false negative result (miss real SNV). >30fold coverage is recommended. If this isn’t reached the nt needs repeating (Sanger).
What’s a FASTQ file
A file that contains all base calls and quality scores.
What’s a BAM file
A map file that enables the bases to be aligned to the reference genome
What’s a VCF file
A text file that contains information about known variants for comparing the patient to reference genome.
Give a basic overview of what needs to be checked for accurate detection of SNV in NGS
1) data must be aligned correctly. 2) alignment quality needs to be checked. 3) coverage of every base needs to be checked (>30x). 4) variant detection is performed. 5) check each base quality (phred) score. 6) check % reads the variant is seen in to determine real vs sequencing error (a threshold must be established).
What must be considered when assessing if a variant call is real in NGS
% times SNV appears in the forward and reverse strands. % times SNV called vs wild type. NTure of SNV (eg in a homopolymer region).
What’s a homopolymer
A stretch/run of the same base eg AAAAAA
Name 3 causes of error in NGS
Base calling errors. Alignment errors. Low coverage.
For a targeted NGS what has to happen before sequencing
An enrichment step. Either PCR based or hybridisation based
Discuss PCR based enrichment technologies for NGS
Requires a small starting amount of DNA. It’s cheap. Products will contain unwanted introns. Originally long range PCR performed. Now multiplexed enrichement kits.
Nextera (Illumina). 1) tagmentation (transposons simultaneously fragment and tag the DNA with adapters). 2) reduced cycle amplification (adds more motifs to fragments).
Fluidigm access array. 1) hybridisation sequence specific binder to DNA (primer contains universal tag sequence which allows binding of …) 2) annealing of barcode primer (contains a capture sequence appropriate for seq tech). 3) final applicants has barcode seq, pt ID, and is tagged for capture
What tags need to be added the the fragments DNA in library prep stage of NGS
Index sequence to ID pt sample. A primer site for the sequencing primer to anneal to. Capture sequence complementary to the sequencing technology for binding to the cell.
Discuss hybridisation enrichment methods for NGS
Based on capture of target regions. Fragmentation of DNA, tagging of DNA, capture of fragments using a RNA or DNA library.
Describe how sure select works
Sure select (Agilent). 1) shear DNA to produce sequence ready DNA. 2) prepare a biotinylated library of 120mer RNA baits of the RoI with adapter MIDs. 3) hybridise together. 4) separate out hybridised regions using sterptavidin beads and magnets. 5) wash beads and disgest RNA. Prep ready for sequencing.
Describe how haloplex works
Agilent. Involves an initial restriction endonuclease step.
1) digest and denature DNA (6 digests using different REs). 2) prepare probe library (biotinylated probe consisting of a universal primer site, a sequencing primer motif, an index for pt ID and sequences corresponding the the ER sites). 3) hybridise probe library to fragmented DNA (probe designed to bind to both ends of the fragmented DNA resulting in circular DNA). 4) purify and ligate (purify using streptavidin and magnets, than close the circular DNA by ligation. 5) amplify enriched fragments (PCR using a universal barcoded primer that amplifying the circular DNA producing linear tagged fragments ready for sequencing.
Describe bridge amplification in the Illumina NGS platform
- careful quantification of the concentration of the library required*
1) the template hybridises to the immobilised adapter region on the flow cell (p7). 2) initial extension results in a ds strand attached to the flow. 3) dsDNA is denatured removing the template DNA - leaves sequence attached to the flow. 4) the sequence then folds over and annuals to the complementary adapter sequence (p5) forming a bridgework. 5) 1st cycle extension results in a dsDNA bridged. 6) 2nd cycle denaturation results in two ssDNA strands (forward and reverse- one attached to p7 and one attached to p5). 7) cycle repeated x35 (folding, annealing, denaturing). 8) cluster is now formed ready for sequencing.
Describe NGS process for Illumina MiSeq, HiSeq, NextSeq
reverse terminator sequencing is carried out:
A) forward strand read, b) indexes are read, c) reverse strand is read.
1) forward strand sequenced first so the reverse strands are cleaved and washed off. 2) 3’ end of strands are chemically blocked (to prevent folding over) and primed. 3) all 4 Flourescently tagged nucleotides are added at once and are provided each cycle. A single nucleotide extension occurs as there’s a blocking group at 3’OH of ribose. 4) all unincorporated nucleotides are washed away. 5) flow cell illuminated and each clusters fluorescent signal recorded. 6) fluorescent group is cleaved from nucleotide. 7) the 3’ OH is unblocked, allowing a further nucleotide to be added. 8) cycle is repeated for every nucleotide added.
Describe emulsion PCR required for the ion torrent NGS platform
The library molecules are clonally amplified onto beads in spheres. Spheres produced using water and oil. Each sphere containers 1 bead, 1 molecule, reagents required for amplification.
Each sphere has probes attached that are complementary to the adapters of the library molecule. The molecular is amplified and attached to the bead.
Describe sequencing using the ion torrent
Emulsion beads are broken and cleaned up and the individual beads are loaded into the sensor wells by centrifugation.
Chip: high density array of micro wells. Beneath each well is an ion-sensitive layer and an ion sensor (pH meter).
1) The nucleotides (non Flourescently labeled) are added in order. 2) incorporation into the chain results in hydrolysis of the nucleotide triphisphate and net release of a H+ ion. 3) release of the H+ ion results in a shift of the pH of the surrounding solution that’s PROPORTIONAL to the number of nucleotides incorporated. (0.02pH units/nt). 4) pH change is detected by a semiconductor sensor , converted into voltage an digitalised.
After each flow of nucleotides, a wash step ensures nucleotides don’t stay in the wells. Due to the small size of the wells diffusion into and out of the wells is at 1/10per sec so there no need for enzymatic removal of reagents.
Name the advantages of whole exome sequencing
Targeted to 2% of genome that’s coding that has ~85% disease causing mutations. Looking at less of genome means it’s cheaper, produces less data for storage, less analysis time. Can analyse more sample (cheaper/quicker/multiplex). Don’t get data fatigue from looking at so much information.
What’s are some advantages of whole genome sequencing
Examination of entire genome, including non-coding regions. Can examine for indels, CNVs, SNVs. Has a more uniform coverage. PCR amplification step not required- no bias in GC rich areas or at heterozygous sites that could cause a false result).
What are some of the diagnostic benefits of exome sequencing
Accurate diagnosis of pts with Mendelian disorders, with atypical manifestations, has symptoms shared among several disorders, has a disorder with a long list of candidate genes eg Charcot Marie Tooth
Why carry out a targeted NGS approach to analysis
Cheaper. Coverage of RoI are better (can fill gaps with Sanger). Can interprets and fully report findings as in known genes. Reduced the number of VUS findings.
What’s a virtual gene panel
Where all the exome is sequenced, or a larger number of genes are sequenced together, but only a cohort are analysed per patient, depending on referral reason. Uses bioinformatics to only show the relevant genes for that gene panel. Allows a greater degree of flexibility as additional,genes can be ‘added’ to the panel without the need for re validation/ development of new panel by manufacturer. But the increase in breadth of genes reduces the depth at which each is covered.
Describe a clinical exome
Only genes with a known disease relationship included (but entire exome, not just associated with one disease)
What’s third generation sequencing
Sequencing of single molecules of DNA without the need to halt between read steps (whether enzymatic or other). It’s removes the need for production of clusters and therefore there are no synchronisation problems).
Name two third generation platforms
Pacific bio SMRT: single molecular sequencing in real time.
Oxford nano pore
Describe Pacific bio SMRT
Sequencing by synthesis. Realtime imaging.
uses a DNA polymerase anchored to the bottom surface a well. Diff Fluor labelled nucleotides enter the Well via diffusion. During incorporation, the labeled nucleotide is ‘held’ within the detection volume by the polymerase for tens of milliseconds. As each nucleotide is incorporated, the label, located on the terminal phosphate, is cleaved off and diffuses out of the Well
Produce really long reads (30-200x longer then 2nd gen)
Describe the Oxford nana pore process
The platform uses an exonuclease coupled to a modified α-hemolysin nanopore positioned within a lipid bilayer. As sequentially cleaved bases are directed through the nanopore, they are transiently bound by a cyclodextrin moiety. This disturbs the current through the nanopore in a manner characteristic for each base.
What is RNA-seq and what’s its key aims
Transcriptome profiling using deep sequencing technology. Aims: 1) to catalogue all species of transcript (mRNA, ncRNA). 2) determine transcriptional structure of genes (start sites, 5’ and 3’ ends, splicing patterns, other post transcription modifications). 3) quantify changing expression levels of each transcipt during development and unclear different conditions.
What’s a transcriptome
Complete set of transcripts and quantity in a cell for a specific developmental stage or physiological condition.
From BPG what sources should you use to determine the clinical significance of a VUS
RNA studies, LOH studies, in silico predictions, functional studies, co-occurrence with a known deleterious variant in same gene,co-segregation with disease in a family, species conversion, testing matched controls, sporadic, literature/databases.
What are the general considerations when using an external database
Accuracy of data ( normal population studies- is everyone normal?/ where is the data from/ has it been curated). Patient consent. Intellectual property rights of info. Adequate bio statistical support. Amount of data in database. Is it being continually updated. Cost of obtaining licence/access.
Ongoing large scale projects: 1000genomes. DDD project. NHLBI Exome seq project.
Name two population (normal) databases (CNV and SNV)
DGV. dbSNP. DbVar. ExAC. 1000 genome project. NHLBI Exome seq project.
Name some disease databases
DECIPHER. ClinVar. OMIM. DMuDB. HGMD
What are the general areas to consider when designing a NGS gene panel
Type of target enrichment. Gene/transcript selection. BED file. DNA quality. Barcoding of samples. Subpanels. IQC and EQA. How to and if to confirm variants. BPG. report structure. Cost. Validation (reproducibility, sensitivity and specificity). Bioinformatics pipeline (variant calling and filtering and annotation)
From BPG, in terms of data storage what is it essential that’s kept
Essential to store output file from the variant annotation step eg VCF and some labs may also retain the FASTQ and BAM files in order to analyse the read data in the future.
Must also keep a log of what bioinformatics processing was applied to the raw data to make the files.