18.02.16 NGS analysis and quality Flashcards

Question 1

Q

What are the causes of error in NGS data?

Answer

A

1) Base calling errors
2) Alignment errors
3) Low coverage sequencing (for which there is a high probability that one of the two chromosomes has been sampled at a specific site). Specifically, as the coverage decreases there is a higher probability that only one of the two chromosomes of a diploid individual has been evaluated within the study.

Question 2

Q

What are the factors that influence NGS error rate?

Answer

A

NGS error rates depend on several factors, including signal to noise levels, cross talk from nearby beads or clusters, homopolymer count, incomplete extension and position on read (slightly worse at beginning and end of read).

Question 3

Q

What are the error patterns seen with the Roche 454 platform?

Answer

A

Roche (454) platforms and Ion torrent- The signal intensity is rounded to an integer to give the number of monomers of the corresponding base that were incorporated. A read error occurs whenever the signal intensity is more than 0.5 from the true value. Most errors are overcalls (65-75%) or undercalls (20-30%). Miscalls are much rarer (<5%) and are typically due to undercall (insertions or deletions of base from the sequence)/overcall pairs.

Main problem – the variance of signal intensity for a homopolymer length is large, resulting in high error rates in insertion and deletion (indel) calls.

Question 4

Q

What are the error patterns seen with the Illumina platforms?

Answer

A

Illumina platforms, indel errors are rare but the overall miscall error rate is typically around 1%.

The main complication arises from the synthesis process becoming desynchronised between different copies of DNA templates in the same cluster.

The two major triggers ae inverted repeats and GGC sequences. Base calling becomes less accurate in later cycles as the extent of asynchrony is exacerbated with each sequencing cycle.

Question 5

Q

What are the error patterns seen with the SOLiD platfor?

Answer

A

The SOLiD platform (ligation technology/long reads) uses a two-base encoding chemistry scheme in which each fluorescent dye colour represents four dinucleotide combinations (oligos). Distinctive reagents include DNA ligase, and a universal primer in this technology. Each base of the DNA template is examined twice in this system and a length m nucleotide sequence is represented as a length m-1 colour sequence.

A major complication in colour calling arises from biases in fluorescence intensities that appear in later machine cycles.

Question 6

Q

What method is used to calculate base calling accuracy?

Answer

A

Although the calculation of quality varies between platforms, the base calling accuracy is the most common metric and the calculation is related to the historically relevant Phred score (Q score), introduced in 1998 for Sanger sequence data. The phred score quality value, Q, uses a mathematical scale to convert the estimated probability of an incorrect call by the sequencer at a given base, e, to a log scale

Question 7

Q

What do phred quality scores of 10 and 30 translated to?

Answer

A

Score = 10
Probability base is called wrong 1 in 10 = 90% accuracy of the call

Score = 30
Probability based is called wrong = 1 in 1000 = 99.9% accuracy of the base call.

Question 8

Q

Why are quality values important in NGS data?

Answer

A

Quality values are important for rejecting low quality reads, trimming low quality bases, improving alignment accuracy and determining consensus sequence and variant calls

Question 9

Q

What is single-end reading?

Answer

A

In single-end reading, the sequencer reads a fragment from only one end to the other, generating the sequence of base pairs.

Question 10

Q

What is paired-end reading, what is its advantage?

Answer

A

In paired-end reading it starts at one read, finishes this direction at the specified read length, and then starts another round of reading from the opposite end of the fragment.

Paired-end reading improves the ability to identify the relative positions of various reads in the genome, making it much more effective than single-end reading in resolving structural rearrangements such as gene insertions, deletions, or inversions. It can also improve the assembly of repetitive regions. This degree of accuracy may not be required for all experiments, however, and paired-end reads are more expensive and time-consuming to perform than single-end reads.

Question 11

Q

Why is sequence alignment more difficult in NGS than it is for Sanger?

Answer

A

Alignment is substantially more difficult for NGS data than Sanger because of the shorter read lengths, when the read length becomes too short then they will not accurately align.

The platforms with shorter read lengths therefore produce more junk data e.g. Suzuki et al found that only half of the reads were mapped to the reference in the data sets of SOLiD, whereas the Roche platforms have a much higher alignment percentage due to the longer read lengths. Paired end sequencing increases the number of matched reads.

Question 12

Q

Why is accurate alignment important?

Answer

A

The accuracy of the alignment has a crucial role in variant detection. Incorrectly aligned reads may lead to errors in SNP detection and genotype calling. It is important for alignment algorithms to cope with the sequencing errors, as well as the potentially real differences (point mutations and indels) between the reference genome and the sequenced genome.

Accurate alignment is also limited in repetitive regions, and regions of shared homology e.g. within closely related gene families and pseudogenes.

It is important for aligners to produce well calibrated alignment (or mapping) quality values, as variant calls and their posterior probabilities (calculated by incorporating the information from the next gen sequencing data as well as some prior information) depend on these scores.

Question 13

Q

What is the depth of coverage?

Answer

A

The depth of coverage is a measure of the number of times that a specific genomic site is sequenced during a sequencing run. This does not mean that every targeted base is sequenced every time; some segments may be read 100 or more times, while others might only be read once or twice, or not at all. The higher the number of times that a base is sequenced, the better the quality of the data.

Question 14

Q

What is required for the assembly and alignment of NGS data?

Answer

A

To assemble, alignment and analysis of NGS data requires an adequate number of overlapping reads, or coverage. In practice, coverage across a sequenced region is variable, and factors other than Poisson-like randomness of library preparation that may contribute to this variability include differential ligation of adaptors to template sequences and differential amplification during clonal template generation.

Question 15

Q

Why should the coverage depth be 20-30x?

Answer

A

Studies have shown that coverage of less than 20-30 fold begin to reduce the accuracy of SNP calls in NGS. Although the numbers vary between groups, currently, at least 30 fold coverage is recommended in whole genome scans for rare genetic variants in human genomes.

Question 16

Q

Summarise the stages for accurate detection of variants in NGS data.

Answer

Study These Flashcards

A

The data must be aligned correctly.
Various algorithms are available, diagnostic packages allow you to alter settings to minimise the risk of missing pathogenic mutations such as small deletions etc.
Alignment quality MUST to be checked.
Variant detection can then be performed: each base call will have a quality score based on a Phred score.
A threshold needs to be set to decide the percentage of reads (normally around 10%) that a variant is seen in to determine if a variant is likely to be sequencing errors and those that are likely to be real.
Coverage of every base needs to be checked, any base with a coverage of less than 30 is at risk of producing a false negative result and therefore needs to be repeated (normally by Sanger sequencing). However, every laboratory should validate coverage areas as higher coverage may be necessary to be confident with the results.
Looking at the number of times a variant appears in forward and reverse reads, the number of times the mutation is called versus the wild type and the nature of the mutation e.g. if it is in homopolymer tracts can give you a clue to if the variant is real. The use of database such as Leiden Open Variation Database (LOVD), DatabasE of genomic variation and Phenotype in Humans using Ensembl Resources (DECIPHER), or Catalogue of Somatic Mutations in Cancer (COSMIC) can confirm these variations.
Sequencing errors are still likely to be present in the data sets due to problems associated with each technology. Therefore, at present diagnostic laboratories are still confirming the presence of any variants detected by Sanger sequencing (guidelines currently suggest that NGS used in diagnostic testing include confirmation of the final result using a companion technology (2016)) .

18.02.16 NGS analysis and quality Flashcards

(16 cards)