Genomics: Reference Mapping Flashcards
quality of HGS Reads and SNP Analyses
What are the different sequencing read formats?
Sequencing read formats are file types used to store DNA sequence data generated by sequencing technologies.
Sequence Reads: What are the 2 formats?
Binary Standard Flowgram Format (SFF): used for sequencing data from older platforms like Roche 454.
Includes sequence data, quality scores, and flowgram information.
FASTQ Format: A text-based format widely used in modern sequencing technologies like Illumina.
Contains four lines per read - Readable, widely compatible with bioinformatics tools, and contains both sequence and quality information.
Why is it important to have quality score in sequencing reading?
Quality scores indicate the accuracy of each base call in sequencing reads. They help in detecting errors, filtering low-quality data, improving confidence in variant calling, optimizing data usage, and ensuring consistency across experiments. A higher quality score means greater confidence in the sequencing data.
What does a Q30 score mean in sequencing?
A Q30 score means there is a 1 in 1,000 chance of an incorrect base call, representing 99.9% accuracy for that base.
How are quality scores used in filtering sequencing data?
Quality scores allow researchers to remove low-quality bases or reads from data to ensure only high-confidence sequences are used in analysis, reducing errors in downstream processes.
Why are quality scores important for variant calling?
Quality scores provide confidence in the base calls, helping distinguish between true genetic variants and sequencing errors, which is critical for accurate mutation detection.
How do quality scores improve comparability across experiments?
By providing a standard measure of sequencing accuracy, quality scores ensure that data from different experiments or platforms can be compared and assessed consistently.
QC: What is the per base quality distribution?
QC: When would you see poor distribution?
The per-base quality distribution shows the range of quality scores for all bases at each position in the sequencing reads. It helps evaluate the reliability of the sequencing data by visualizing how quality varies across the length of the read.
When would you see poor quality distribution?
Poor quality distribution is typically observed:
- At the ends of long reads: Quality often decreases due to instrument or chemistry limitations.
- In degraded samples: DNA degradation or poor library preparation can reduce overall quality.
- After long sequencing runs: General quality degradation over time during sequencing.
- In reads with high error rates: Caused by factors like incorrect adapter ligation or contamination.
How can poor quality distribution be addressed?
Poor quality can be mitigated by applying quality trimming, which removes low-quality bases (e.g., from the ends of reads) before downstream analysis.
QC: Why is per tile sequence quality useful?
It helps detect localized problems on the flow cell, such as:
Uneven illumination.
Contamination or damage.
This ensures problematic tiles are excluded from downstream analysis.
How is per tile quality assessed?
It is visualized using heat maps or graphs that show quality distribution. Uniform colour indicates consistent quality, while deviations highlight problem areas.
What is the role of quality scores in quality control?
Quality scores help identify low-quality reads caused by poor imaging or technical issues during sequencing. These scores ensure only high-quality reads are used for downstream analysis.
Why must low-quality reads be removed from downstream analysis?
Low-quality reads can introduce errors (e.g., wrongly called bases) that bias the results of downstream processes, such as alignment, assembly, or variant calling.
How are low-quality reads identified?
Low-quality reads are identified using in-built software tools (e.g., FastQC) that evaluate quality scores and flag sequences with overall poor quality values.