QC of NGS data Flashcards
What is a FASTQ file?
A FASTQ file is a text-based format used to store nucleotide sequences along with their corresponding quality scores.
What are the four lines in a FASTQ file sequence entry?
- Sequence Identifier: A header line starting with ‘@’ followed by a unique identifier.
- Raw Sequence: The actual nucleotide sequence (A, T, C, G).
- Quality Identifier: A line starting with ‘+’ that may or may not repeat the sequence identifier.
- Quality Scores: A string of ASCII characters representing the quality scores for each nucleotide in the sequence.
What do Phred quality scores indicate?
Phred quality scores indicate the confidence level of each base call made by the sequencer.
How is a Phred quality score calculated?
The Phred quality score Q is calculated using the formula: Q = -10 * log10(P), where P is the probability that a given base call is incorrect.
What does a Phred score of 30 indicate?
A Phred score of 30 corresponds to a 0.1% chance of error (or 99.9% accuracy).
What are the two main phases of the NGS data analysis workflow?
The two main phases are Primary Analysis and Secondary Analysis.
What does the Primary Analysis phase involve?
- Raw Data Processing: Converting raw instrument signals into sequence data (FASTQ files).
- Quality Control (QC): Assessing and filtering data based on quality scores to ensure high-quality reads.
- Genome Mapping: Aligning sequences to a reference genome to identify their locations.
What are key steps in Primary Analysis?
- Quality Check: Evaluating the quality of sequences and removing low-quality reads.
- Adapter Removal: Filtering out sequences that contain adapter contamination.
- High-Quality Filtered Data: Producing cleaned data with statistics on quality metrics.
What applications are focused on in the Secondary Analysis phase?
- ChIP-Seq: Analyzing transcription factor binding sites and identifying motifs.
- RNA-Seq: Investigating differential gene expression and transcript analysis.
- Whole Genome DNA Sequencing: Performing variant calling and genomic feature identification.
What are good characteristics of sequencing data?
- High quality scores (indicating low error rates).
- Low adapter contamination.
- Low duplication rates (to avoid biases in data interpretation).
- No GC bias (ensuring even representation across GC content).
What are some trimming techniques?
- Filtering: Removing all reads below a certain quality threshold.
- Cropping: Trimming bases from both ends of reads to remove low-quality regions.
- Removing Short Reads: Discarding reads that are too short after cropping.
What are common tools for trimming?
- Trimmomatic
- Trim Galore
- FASTX Toolkit
- Galaxy Trimming Tools
Why is proper handling of primers and adapters important?
Proper handling of primers and adapters is crucial during library preparation and sequencing to prevent contamination and ensure accurate results.