Quality of NGS Reads, Reference mapping and SNP analyses Flashcards
What are Quality control (QC) sequencing reads?
Short DNA sequences provided by the sequencing instruments. The length may vary between 50 and 40,000
bases
What formats do QC sequencing reads come in?
Binary standard flowgram format (SFF format): Roche 454
FASTQ format: a text-based format; example: Illumina reads
What format are Illumina reads in?
FASTQ format: a text-based format including base ATCG
What does a quality score/Phred score of 10 compared to 60 mean?
QC=10 means 1/10 chance the base has been correctly placed
QC=60 means 1/10x106 or 1/1million = 99.9999% certainty
What does an N represent when using QC sequencing? (Instead of a ATCG)
No phred score could be calculated, therefore N was placed instead of a base.
Why do we accept all bases with a QC higher than 20?
20 QC = 99% certainty in the base above 20 = more than 99% certain whereas 10 QC is only 90% accurate.
What could cause poor distribution when looking at per base sequence quality (QC reads)? How could this be avoided? ( QC page with red, yellow green)
Degradation of quality over the duration of long runs
quality trimming could be helpful in such cases.
How does the Per tile sequence quality function?
Record average Phred score for each tile in a flow cell.
Each Flow cell contains 8 lanes - 2 columns each lane - 50/60 tiles per collumn
Each tile is imaged 4 times, once per base and a colour is given blue to red
RED = Poor quality
Blue = at or above average
What does per sequence QC tell you and why might this be low?
Displays one peak showing average quality per sequence, this needs to be 20+.
This could be caused by poor imaging or other technical problems during the run.
Low-quality reads must be removed avoid to problems, such as wrongly called bases, which could introduce bias in the downstream analysis.
what does Per Sequence GC Content tell you in QC? What does differentiation suggest?
GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content. Differentiation/ jagged not smooth lines may suggest specific contaminant for example: adaptor dimer.
What is adapter sequence removal and why do we carry it out in QC?
adaptor sequences that were used for the sequencing reaction need to be removed before sequencing, this is the process of defining these sequencing so they can be removed from the sequencing.
What does the Per Base Sequence Content tell you, and what are normal levels?
The proportion of each base position and GC content
This should be equal or similar overrepresented sequences such as adapter dimers or rRNA in a sample may cause bias
What does low/ high Sequence duplication tell you?
A low level of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).
Explain the process of De novo assembly?
De novo assembly is used when we have no priori knowledge of the correct sequence.
1. Reads are taken from genomic information Short-reads” typically range in size 35 – 1,000 bp “long-reads” typically range in size 1,000 – 500,000 bp.
2.Contig creation - Contigs are a set of overlapping oriented reads. A single contig is constructed from two or more overlapping and oriented reads.
- Scaffold formation - Scaffolds are a set of joined-oriented contigs. A single scaffold is constructed from two or more joined and oriented contigs.
- Assembled genome/ chromosome - A single chromosome is constructed from two or more joined and oriented scaffolds.
Why do we have gaps in de novo assembly and how to avoid?
Regions of DNA in which the same sequence of bases is repeated or sections of the genome where DNA is densely packed.
Paired-end DNA sequencing reads provide high-quality alignment across DNA regions containing repetitive sequences, and produce long contigs for de novo sequencing by filling gaps in the consensus sequence.