Next Generation DNA Sequencing Flashcards

1
Q

What is “Next Generation Sequencing” (NGS)?

A

In general, the phrase “Next Generation Sequencing” refers to a methodology that uses;

  1. A ‘flowcell’ enables ‘massively parallel sequencing’ of millions of DNA molecules at the same time
  2. Requires ‘clonal amplification’ of individual DNA molecules into a cluster or on a bead to enable visualisation of the reaction.
  3. Relys on the chemical process of ‘sequencing by synthesis’ i.e. the DNA sequence is decoded by the stepwise synthesis of a DNA strand one nucleotide at a time.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What else is NGS commonly referrd to as?

A

aka “Second Generation Sequencing” in reference to coming after Sanger which is “First Generation”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the main workflow stages to performing NGS?

A
  1. DNA extraction
  2. Library preparation*
  3. Sequencing*
  4. Bioinformatic analysis
  5. Data Interpretation

*Varies based on platform/system used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is “library prep”?

A
  • NGS requires the preparation of “DNA libraries” ready for loading onto the NGS instrument
  • Library prep refers to the process of processing genomic DNA into a sample suitable to load onto an NGS instrument.
  • The main objective of library prep is to end with a sample of DNA fragments where the molecules from separate samples are fused with adapters so that multiple samples can be pooled together on a single run.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What different strategies can be taken towards library prep?

A

Two broad categories;

Targetted; Specific regions of the genome, ranging from a handful of genes up to the entire human exome are enriched during the library prep process such that only these regions are sequenced

Non-targetted: The DNA is processed into fragments with adapters required for seq etc but no enrichment is performed resulting the entire genome from a given sample being sequenced.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the advantages of targetted NGS vs non-targetted NGS?

A
  • Much cheaper per sample costs as far less sequencign is performed and many patients can be loaded onto a single sequencing run.
  • Coverage can be better as enrichment assays can be optimised to fill difficult to sequence regions
  • less chance of incidental findings
  • less chance of VUSs as genes in target region should be clearly linked to indication
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What two methods are available for targetted NGS strategies?

A
  1. PCR-based (amplicon) Methods
  2. Hybridisation (capture) Methods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe the principle of PCR-based methods for target enrichment.

A
  • Sequencing ready libraries can be produced from amplifying fragments with PCR primers.
  • PCR primers are designed to contain the adapters for flow-cell attachement and barcodes.
  • The fragments are then purified and sequenced.
  • There are many commercially available platforms based on amplicon sequencing with most companies offering predesigned off-the-shelf panels in addition to custom panels
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is one of the most popular amplicon-based library prep kits on the market?

A
  • Illumina’s Truseq methodology represents an alternative to PCR but based upon similar principles.
  • Rather than relying on error-prone PCR, Truseq uses a single primer extension methodology to generate target regions flanked by appropriate sequencing adaptors and molecular identifiers (MIDs)/barcodes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the avantages to using amplicon library prep?

A
  1. Is a relatively cheap method.
  2. Can focus on small regions and multiplex many samples per run.
  3. technically simple
  4. Can utilise long range PCR (LR-PCR) to amplify large genomic regions containing multiple exons - very useful for diseases where there is a common pseudogene (PKD, CAH)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are the disavantages to using amplicon library prep?

A
  1. There is a big drawback to using amplicon selection assays: PCR duplicates
  2. PCR duplicates should not be used to generate accurate read depth (vertical coverage) metrics and cannot be used to provide an estimate of copy number across the target area.
  3. Only unique reads should be used to generate coverage data, and it is therefore recommended to avoid the use of read depth for amplicon-based assays.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the principle of hybridisation-based methods for target enrichment.

A
  1. Relies on solution-based capture of target regions.
  2. genomic DNA must be fragmented
  3. then tagging of fragments with sequence-ready adaptors and barcodes
  4. the targetted regions are captured using RNA- or DNA-based oligonucleotide ‘baits’ containing biotin
  5. These oligos anneal to specific regions of the genome to result in a tiling of captured fragments representative of the entire region of interest.
  6. Magnetic beads coupled to Streptavidin can then be used to physically separate the fragments bound to the baits from the rest of the input DNA.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is one of the most popular hybridisation-based library prep kits on the market?

A
  1. Agilent’s SureSelect kit.
  2. Starting from gDNA, a shearing step produces small fragments
  3. Prepare library with sequencer specific adaptors and sample-specific barcodes
  4. Hybridise sample with biotinylated RNA library baits. Agilent uses ultra long 120mer RNA baits for the highest specificity.
  5. Select targeted regions using magnetic streptavidin beads
  6. Amplify and load on the sequencer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the avantages to using hybridisation-based library prep?

A
  1. Big advantage is that PCR duplicates can be bioinformatically removed because the DNA is fragmented prior to adding PCR adaptors
  2. This means that analysis of read-depth can give insight into copy number allthough this remains a challenge.
  3. Give better horizontal coverage than amplicon sequencing as difficult to sequence regions can be isolated by including many more baits than average for the rest of the panel.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the disavantages to using hybridisation-based library prep?

A
  1. Big diadvantage is that DNA must be sheared before beginning. This often required sonication of gDNA and can be very expensive and time consuming.
  2. Can be more expensive
  3. Can require extensive optimisation to improve coverage
  4. Technically more complicated than amplicon sequencing requiring more manual handling time
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What major library prep innovation facilitated the uptake of enrichment based library preparation without the drawbacks of performing expensive DNA shearing?

A

Illumina Nextera “tagmentation” of DNA simultaneously fragments and tags DNA without the need for mechanical shearing.

Tagmentation uses a transposase enzyme to simultaneously fragment gDNA and insert seuqncing adapters onto the dsDNA fragments.

Enzymatic fragmentation could be worse than physical shear methods when it comes to bias, but has shown to be consistent in the long-run and is now widely used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What levels of targetted NGS analysis are commonly performed in diagnostic laboratories?

A
  1. Targetted panel: Assay targets disease specific genes only.
  2. Clinical Exome: All gene with a known disease association are included (~7,000)
  3. Whole Exome: All protein coding regions of the genome are included.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the advantages and disadvantages of choosing a Targetted panel approach?

A

Advantages

  • More samples per run = cheaper per patient
  • Less compute power and storage required
  • Assay optimised for 100% coverage

Disadvantages

  • Inflexibility, can not incorporate novel disease loci without redesigning the capture design
  • Need high referral rate or runs could be delayed whilst waiting to ‘batch’
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What are the advantages and disadvantages of choosing a Clinical Exome approach?

A

Advantages

  • Virtual panels enable complete flexibility to use the same assay for many diseases
  • Enables efficient web-lab processes e.g. do not need batch for disease-specific assays = cost effective
  • Sequencing not wasted on non-disease associated genes

Disadvantages

  • Can not optimise assay so could be gaps
  • Newly discovered disease genes may not be included
  • Vast majority of the sequence data not used = wasted sequencing, compute and storage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What are the advantages and disadvantages of choosing a Whole Exome approach?

A

Advantages

  • Virtual panels enable complete flexibility to use the same assay for many diseases
  • Enables efficient web-lab processes e.g. do not need batch for disease-specific assays = cost effective
  • Newly discovered disease genes available - assay = future proof

Disadvantages

  • Can not optimise assay so could be gaps
  • Vast majority of the sequence data not used = wasted sequencing, compute and storage
  • Non-coding space not included
21
Q

Why is the idea of performing non-targetting sequencing i.e. Whole Genome Sequencing becoming more appealling?

A
  • Until very recently the time, cost and technical expertise required to generate and analyse WGS data largely precluded serious consideration of its use outside of research settings
  • The situation is changing rapidly and this can no longer be assumed to be the case for several key reasons;
  1. Instrument output is massively increased
  2. Better governancy frameworks for data storage and sharing means cloud usage is acceptable
  3. Better governancy around interpretation and inccidental findings.
22
Q

List some advantages of WGS over WES / targetted approaches

A
  • Allows examination of mist types of genome variant in coding and non-coding/regulatory spaces e.g. point mutations, indels, CNVs, SVs, repeat expansions
  • Reliable and uniform coverage due to removal of bias introduced by baits and PCR.
  • No wild-type bias caused by baits
  • Future-proof for vast majority of mutation types and not limited by current knownledge of genome architecture (i.e. rare transcripts) which could be missed woth targetted assays
  • Mitochondrial genome included
23
Q

Who are the major NGS instrument manufacturers?

A
  1. Illuminia (reverse-terminator) - by far the market leader
  2. Thermo Fisher (Ion Torrent)
  3. (Roche 454 pyrosequencing - no longer an instrument supplier)
24
Q

What are the key attributes of most NGS platforms?

A

Most platforms general facilite a series of

  1. automatically coordinated
  2. repeating chemical reactions
  3. typically carried out in a flow cell or plate
  4. which houses the immobilized templates and necessary reagents
25
Q

What are the key stages of the illumina sequencing process?

A
  1. Library prep - can be PCR/capture of Nextera
  2. Bridge amplification: library fragments are amplified in situ on the flow cell surfaces by use of a bridge amplification step to produce foci for sequencing.
  3. Reverse-terminator sequencing:
26
Q

What is the principle of reverse-terminator sequencing chemistry?

A

All four nucleotides carry a different fluorescent label.

Sequencing occurs as single-nucleotide addition reactions because a blocking group exists at the 3’-OH position of the ribose sugar, preventing additional base incorporation.

  1. The nucleotide is added by polymerase
  2. unincorporated nucleotides are washed away
  3. the flow cell is imaged
  4. the fluorescent groups are chemically cleaved
  5. the 3’-OH is chemically deblocked (i.e. termination is ‘reversed’)
  6. Repeat cycle
27
Q

What are the key stages in Ion Torrent sequencing?

A
  • Library prep - emulsion PCR: An oil-water emulsion partitions small reaction vesicles. Each contains one sphere, one library molecule and all necessary PCR reagents.
  • Individual beads are loaded into individual sensor wells
  • DNA sequencing by IonTorrent chemistry
28
Q

What is the principle of Ion Torrent sequencing chemistry?

A
  1. The principle is based on the release of a hydrogen ion as a by-product of the incorporation of a nucleotide into a strand of DNA by a polymerase
  2. The four non-fluorescent nucleotides are added individually in a sequential order
  3. If a nuc is incorporated the hydrogen ion creates a change in pH of the surrounding solution.
  4. The platform uses a chip with a high-density array of micro-machined wells, each well holding a different DNA template, below which is a semiconductor pH sensor.
  5. If >1 nucleotide is added (e.g. TTT tract) then the intensity of pH change is proportional.
  6. After the flow of each nucleotide, a wash step ensures nucleotides do not remain in the well
  7. Cycle repeated for different nucleotide
29
Q

What are the advantages of Illumina platforms and chemisty?

A
  1. Most widely used platform
  2. Many platforms available to catter for throughput requirements
  3. Performs well at homopolymer regions
30
Q

What are the disadvantages of Illumina platforms and chemisty?

A
  1. Low multiplexing capabilities of samples, can;t sequence hundreds of samples per run
  2. Sequencers more expensive than Ion Torrent
  3. Don’t offer reagent rental programmes
  4. Bridge amplification stage needs to be performed on separate instrument for some platforms (e.g. older HiSeqs)
  5. Higher throughput platforms often more technically challenging to use (manual handling of different reagents)
31
Q

What are the advantages of IonTorrent platforms and chemisty?

A
  1. Much fastest run times on market. Good for applications where time to result is vital.
  2. Low cost instruments as expensive optics not required.
  3. Flexible and scalable with a range of chips.
32
Q

What are the disadvantages of IonTorrent platforms and chemisty?

A
  1. Relative poor performance at homopolymer regions
  2. Library prep is more manual and technically challenging
  3. Lack of very high capacity instruments
33
Q

Why is ‘bioinformatics’ required for analysis of NGS data?

A
  • NGS generates hundreds of megabases to gigabases of nucleotide sequence output in a single instrument run in the form of short sequence reads.
  • For bioinformatic analysis, it is necessary to;
    • Assemble these millions of short sequence data
    • To extract sequence variants.
    • Assess the quality and depth of sequencing reads
    • Accurate SNP calling and genotype calling can be difficult, and there is often uncertainty associated with the results. Steps are required to quantify the quality of base calls and remove errors so they do not influence downstream analysis.
34
Q

What factors can affect the quality of the NGS data?

A

Can depend on several factors including;

  1. Signal to noise levels
  2. Cross talk from nearby beads or clusters
  3. De-phasing of molecules within a cluster
  4. Seq context e.g. homopolymer count
  5. Position of a variant on read
35
Q

How is the quality of the base call measured?

A

A Phred quality score is a measure of the quality of the identification of the nucleobases generated by automated DNA sequencing.

It was originally developed for Phred base calling to help in the automation of DNA sequencing in the Human Genome Project.

36
Q

How are Phred scores calculated and interpreted?

A

The formular used to calculate:

q(phred) = -10log10P(error)

37
Q

Is base-calling error consistent accross different NGS platforms / chemistries?

A

Different NGS platforms are prone to different error patterns;

  1. IonTorrent: the variance of signal intensity for a homopolymer length is large, resulting in high error rates (~1%) in insertion and deletion (indel) calls and at repeat tract sites.
  2. Illumina: The main complication arises from the synthesis process becoming desynchronised between different copies of DNA templates in the same cluster (overall rate ~0.1%)
38
Q

What file type is produced once base calling quality has been assessed?

A
  • The FASTQ format is a text based format for storing both a biological sequence (Nucleotide sequence) and its corresponding quality scores.
  • Both the sequence letter and quality score are encoded with a single ASCII character which has recently become the de facto standard
39
Q

What is single-end vs. paired-end sequencing?

A
  • In single-end reading, the sequencer reads a fragment from only one end to the other, generating the sequence of base pairs.
  • In paired-end reading it starts at one read, finishes this direction at the specified read length, and then starts another round of reading from the opposite end of the fragment
40
Q

What are the advantages and disadvantages of paired-end sequencing?

A
  • Paired-end reading improves the ability to identify the relative positions of various reads in the genome, making it much more effective than single-end reading in resolving structural rearrangements such as gene insertions, deletions, or inversions.
  • It can also improve the assembly of repetitive regions. This degree of accuracy may not be required for all experiments.
  • However, paired-end reads are more expensive and time-consuming to perform than single-end reads.
41
Q

What is ‘DNA sequence alignement’?

A

A sequence alignment is a way of arranging multiple sequences of DNA to one-another in order to identify regions of similarity.

The general the sequence reads produced by the sequencing run are aligned to the reference Human genome sequence.

42
Q

Why is sequence alignment so important?

A
  • The accuracy of the alignment has a crucial role in variant detection.
  • Incorrectly aligned reads may lead to errors in SNP detection and genotype calling.
  • Once accurate alignement is complete then mismtaches with the reference can be accurately predicted as either true ismatches or artefact.
43
Q

Why is sequence alignment so challenging for NGS technologies?

A
  • Alignment is substantially more difficult for NGS data than Sanger because of the shorter read lengths
  • When the read length becomes too short then they will not accurately align. The platforms with shorter read lengths therefore produce more junk data.
  • Accurate alignment is also limited in repetitive regions, and regions of shared homology e.g. within closely related gene families and pseudogenes
  • Paired end sequencing increases the number of matched reads.
44
Q

What are the key considerations for a good sequence alignment algorithm?

A
  • It is important for alignment algorithms to cope with the sequencing errors, as well as the potentially real differences (point mutations and indels) between the reference genome and the sequenced genome.
  • It is important for aligners to produce well calibrated alignment (or mapping) quality values, as variant calls and their posterior probabilities depend on these scores
45
Q

What is ‘sequencing depth’?

A

Also known as the ‘vertical coverage’ - refers to the number of times that a specific genomic site is sequenced during a sequencing run.

This does not mean that every targeted base is sequenced every time; some nucleotides may be read 100 or more times, while others might only be read once or twice, or not at all.

46
Q

Why is ‘sequencing depth’ important and what factors can affect sequencing depth?

A
  • Important because: The higher the number of times that a base is sequenced, the better the quality of the base/variant call.
  • Inadequate depth can cause failure to detecta nucleotide variation as the alt allele is not present in sufficient reads, leading to false negative results.
  • Sequence context (GC content) can cause for coverage
  • Inadequate instrument capacity or too many samples per run can cause poor depth.
  • At least 20 fold depth is recommended for germline genetic variants.
47
Q

What is sequencing ‘coverage’?

A
  • Aka ‘Horizontal’ covergage.
  • Refers to the proportion of bases in a target region that are sequenced to the required depth.
  • e.g. a 100bp target region needs to be sequenced to a minimum of 20x
  • Only 80bp have a >20x depth
  • Coverage of target region = 80%
  • Coverage of every base needs to be checked, as seq < the required depth is at risk of producing a false negative result
  • Regions are referred to as ‘gaps’ and often repeated with Sanger sequencing to complete testing.
48
Q

What applications are NGS used for in diagnostic laboratories?

A
  1. Rare Disease: Targetted NGS, Clinical Exome, Whole Exome (WES), Deep Sequencing (mosaicism)
  2. Prenatal Diagnosis: Exomes (PAGE study)
  3. Non-invasive Prenatal testing: low-coverage WGS (aneuploidy, targetted SGD)
  4. Pre-implantation Genetic Screening/Diagnosis (PGS/D): low-covergae WGS
  5. Tumour profiling, MRD etc: Deep Sequencing
49
Q

What applications are NGS more commonly used for in research settings?

A
  1. Disease gene discovery: WES, WGS, targetted fine-mapping
  2. Gene Expression analysis: RNA-seq
  3. Gene Regulation analysis: ChIP-seq