Quality of NGS Reads, Reference mapping and SNP analyses Flashcards
What are Quality control (QC) sequencing reads?
Short DNA sequences provided by the sequencing instruments. The length may vary between 50 and 40,000
bases
What formats do QC sequencing reads come in?
Binary standard flowgram format (SFF format): Roche 454
FASTQ format: a text-based format; example: Illumina reads
What format are Illumina reads in?
FASTQ format: a text-based format including base ATCG
what is a quality score?
given by software from info received from instrument and reflects confidence.
What does a quality score/Phred score of 10 compared to 60 mean?
QC=10 means 1/10 chance the base has been incorrectly placed
QC=60 means 1/10x10^6 or 1/million = 99.9999% certainty
Why do we accept all bases with a QC higher than 20?
Phred score 20 gives a 99% accuracy which is better than a Phred score of 10 - 90% accuracy
What could cause poor distribution when looking at per base sequence quality (QC reads)? How could this be avoided? ( QC page with red, yellow green)
Degradation of quality over the duration of long runs.
Quality trimming could be helpful in such cases.
How does the Per tile sequence quality function?
Record average Phred score for each tile in a flow cell.
Each Flow cell contains 8 lanes - 2 columns each lane - 50/60 tiles per collumn
Each tile is imaged 4 times, once per base and a colour is given blue to red
RED = Poor quality
Blue = at or above average
what does Per Sequence GC Content tell you in QC? What does differentiation suggest?
GC content across the whole length of each sequence in a file and compares it to a modelled normal distribution of GC content. Differentiation/ jagged not smooth lines may suggest specific contaminant for example: adaptor dimer.
What is adaptor sequence removal and why do we carry it out in QC?
The primary reads from the sequencer also carry the adaptor sequences that were used for the sequencing reaction, and these need to be removed before processing. They interfere with downstream analyses.
What does the Per Base Sequence Content tell you, and what are normal levels?
The proportion of each base position and GC content
This should be equal or similar
Overrepresented sequences such as adapter dimers or rRNA in a sample may cause bias
What does low/ high sequence duplication tell you?
A low level of duplication may indicate a very high level of coverage of the target sequence.
A high level of duplication is more likely to indicate some kind of enrichment bias (eg PCR over amplification).
Explain the process of De novo assembly?
- Reads are taken from genomic information Short-reads” typically range in size 35 – 1,000 bp “long-reads” typically range in size 1,000 – 500,000 bp.
- Contig creation - Contigs are a set of overlapping oriented reads. A single contig is constructed from two or more overlapping and oriented reads. Reads must overlap by a minimum of base pairs or kmers before joining.
- Scaffold formation - Scaffolds are a set of joined-oriented contigs. A single scaffold is constructed from two or more joined and oriented contigs.
- Assembled genome/ chromosome - A single chromosome is constructed from two or more joined and oriented scaffolds.
What are is 3 K-mer for ATCGATCAC?
3-mer #0: ATC
3-mer #1: TCG
3-mer #2: CGA
3-mer #3: GAT
3-mer #4: ATC
3-mer #5: TCA
3-mer #6: CAC
What is Reference Mapping and what is it useful for ?
Reference Mapping is alligning sequencing reads to similar strains to determine Single Nucleotide Polymorphisms (SNPs), insertions and deletions (indels)