18.02.16 NGS analysis and quality Flashcards
What are the causes of error in NGS data?
1) Base calling errors
2) Alignment errors
3) Low coverage sequencing (for which there is a high probability that one of the two chromosomes has been sampled at a specific site). Specifically, as the coverage decreases there is a higher probability that only one of the two chromosomes of a diploid individual has been evaluated within the study.
What are the factors that influence NGS error rate?
NGS error rates depend on several factors, including signal to noise levels, cross talk from nearby beads or clusters, homopolymer count, incomplete extension and position on read (slightly worse at beginning and end of read).
What are the error patterns seen with the Roche 454 platform?
Roche (454) platforms and Ion torrent- The signal intensity is rounded to an integer to give the number of monomers of the corresponding base that were incorporated. A read error occurs whenever the signal intensity is more than 0.5 from the true value. Most errors are overcalls (65-75%) or undercalls (20-30%). Miscalls are much rarer (<5%) and are typically due to undercall (insertions or deletions of base from the sequence)/overcall pairs.
Main problem – the variance of signal intensity for a homopolymer length is large, resulting in high error rates in insertion and deletion (indel) calls.
What are the error patterns seen with the Illumina platforms?
Illumina platforms, indel errors are rare but the overall miscall error rate is typically around 1%.
The main complication arises from the synthesis process becoming desynchronised between different copies of DNA templates in the same cluster.
The two major triggers ae inverted repeats and GGC sequences. Base calling becomes less accurate in later cycles as the extent of asynchrony is exacerbated with each sequencing cycle.
What are the error patterns seen with the SOLiD platfor?
The SOLiD platform (ligation technology/long reads) uses a two-base encoding chemistry scheme in which each fluorescent dye colour represents four dinucleotide combinations (oligos). Distinctive reagents include DNA ligase, and a universal primer in this technology. Each base of the DNA template is examined twice in this system and a length m nucleotide sequence is represented as a length m-1 colour sequence.
A major complication in colour calling arises from biases in fluorescence intensities that appear in later machine cycles.
What method is used to calculate base calling accuracy?
Although the calculation of quality varies between platforms, the base calling accuracy is the most common metric and the calculation is related to the historically relevant Phred score (Q score), introduced in 1998 for Sanger sequence data. The phred score quality value, Q, uses a mathematical scale to convert the estimated probability of an incorrect call by the sequencer at a given base, e, to a log scale
What do phred quality scores of 10 and 30 translated to?
Score = 10
Probability base is called wrong 1 in 10 = 90% accuracy of the call
Score = 30
Probability based is called wrong = 1 in 1000 = 99.9% accuracy of the base call.
Why are quality values important in NGS data?
Quality values are important for rejecting low quality reads, trimming low quality bases, improving alignment accuracy and determining consensus sequence and variant calls
What is single-end reading?
In single-end reading, the sequencer reads a fragment from only one end to the other, generating the sequence of base pairs.
What is paired-end reading, what is its advantage?
In paired-end reading it starts at one read, finishes this direction at the specified read length, and then starts another round of reading from the opposite end of the fragment.
Paired-end reading improves the ability to identify the relative positions of various reads in the genome, making it much more effective than single-end reading in resolving structural rearrangements such as gene insertions, deletions, or inversions. It can also improve the assembly of repetitive regions. This degree of accuracy may not be required for all experiments, however, and paired-end reads are more expensive and time-consuming to perform than single-end reads.
Why is sequence alignment more difficult in NGS than it is for Sanger?
Alignment is substantially more difficult for NGS data than Sanger because of the shorter read lengths, when the read length becomes too short then they will not accurately align.
The platforms with shorter read lengths therefore produce more junk data e.g. Suzuki et al found that only half of the reads were mapped to the reference in the data sets of SOLiD, whereas the Roche platforms have a much higher alignment percentage due to the longer read lengths. Paired end sequencing increases the number of matched reads.
Why is accurate alignment important?
The accuracy of the alignment has a crucial role in variant detection. Incorrectly aligned reads may lead to errors in SNP detection and genotype calling. It is important for alignment algorithms to cope with the sequencing errors, as well as the potentially real differences (point mutations and indels) between the reference genome and the sequenced genome.
Accurate alignment is also limited in repetitive regions, and regions of shared homology e.g. within closely related gene families and pseudogenes.
It is important for aligners to produce well calibrated alignment (or mapping) quality values, as variant calls and their posterior probabilities (calculated by incorporating the information from the next gen sequencing data as well as some prior information) depend on these scores.
What is the depth of coverage?
The depth of coverage is a measure of the number of times that a specific genomic site is sequenced during a sequencing run. This does not mean that every targeted base is sequenced every time; some segments may be read 100 or more times, while others might only be read once or twice, or not at all. The higher the number of times that a base is sequenced, the better the quality of the data.
What is required for the assembly and alignment of NGS data?
To assemble, alignment and analysis of NGS data requires an adequate number of overlapping reads, or coverage. In practice, coverage across a sequenced region is variable, and factors other than Poisson-like randomness of library preparation that may contribute to this variability include differential ligation of adaptors to template sequences and differential amplification during clonal template generation.
Why should the coverage depth be 20-30x?
Studies have shown that coverage of less than 20-30 fold begin to reduce the accuracy of SNP calls in NGS. Although the numbers vary between groups, currently, at least 30 fold coverage is recommended in whole genome scans for rare genetic variants in human genomes.