Bioinformatics Pipelines Flashcards
What are the main steps in a bioinformatics pipeline?
1) Quality Control
2) Alignment
3) Pre-Processing
4) Variant Calling
5) Sample and Run level QC (Coverage etc)
4) Variant Filtering and Annotation
What happens in the alignment process?
Sequence alignment is the process of determining where the reads align in the reference genome. This process uses a Phred based mapping quality score to indicate confidence in the aligment process. The standard way of storing this is in a BAM file.
What happens in the variant calling process?
Variant calling is the process of identifying variations between the sequence and the reference genome. The typical input is a BAM file. Variant calling is a heterogeneous collection of algorithms based on the types of sequencing variants. The accuracy of variant calling is highly dependent on the quality of called bases and aligned reads. Therefore pre variant calling processing such as indel realignment and quality score recalibration are frequently used.
What is involved with variant filtering?
Variant filtering is the process whereby variants representing false positive artefacts are flagged or filtered from the original VCF on the basis of alignment and calling associated meta data such as mapping quality, base calling quality etc. This is usually a post variant calling step but some variant callers include it as part of the variant calling algorithm
Some attributes a variant can be filtered on include
- Consequence (synonymous)
- BED file
- Population frequency
- Location (exonic/intronic)
- Quality
- Intra-run frequency
- Depth (allele frequency)
- Conservation
- Reports in databases (HGMD, Clinvar etc)
What should be carried out as part of the initial validation of a pipeline?
The initial validation acts as a ‘dry’ validation using a truth set to assess the pipelines output. An example of this is the Genome in a bottle. There are some bias with this data, however it is useful for a baseline sensitivity and specificity. A ‘wet’ validation using Genome in a bottle can also be considered.
The sensitivity of the pipeline should then be established using clinical data. i.e. variants detected by another method as part of a diagnostic service and then also detected by NGS. The resulting sensitivity should have a sensitivity confidence interval of 95%. This can be achieved if the pipeline detected 60/60 variants with no false negatives. Detecting 300/300 variants would give a 95% confidence interval >0.99
What should be done for small changes in the pipeline?
Prior to any changes being merged, a round of validation should be performed. A validation dataset should be maintained to standardise and simplify this process.
What can be carried out as part of the variant calling preprocessing? (GATK recommended)
Mark Duplicates:
This step identifies read pairs that are likely to have originated from duplicates from the same original DNA fragments through an artifactual process. These are considered to be non-independent observations, so the program tags all but of the read pairs within each set of duplicates, causing them to be ignored by default during the variant discovery process. This step constitutes a major bottleneck since it involves making a large number of comparisons between all the read pairs belonging to the sample, across all of its readgroups. It is followed by a sorting operation (not explicitly shown in the workflow diagram) that also constitutes a performance bottleneck, since it also operates across all reads belonging to the sample. Both algorithms continue to be the target of optimization efforts to reduce their impact on latency.
Recalibrate Base Quality Scores:
This consists of applying machine learning to detect and correct for patterns of systematic errors in the base quality scores, which are confidence scores emitted by the sequencer for each base. Base quality scores play an important role in weighing the evidence for or against possible variant alleles during the variant discovery process, so it’s important to correct any systematic bias observed in the data. Biases can originate from biochemical processes during library preparation and sequencing, from manufacturing defects in the chips, or instrumentation defects in the sequencer. The recalibration procedure involves collecting covariate statistics from all base calls in the dataset, building a model from those statistics, and applying base quality adjustments to the dataset based on the resulting model.
What is involved with the two quality control steps of the pipeline
1) The first step is carried out on the FASTQ files and looks at the sequencing quality using tools such as FASTQC. Adapters can also be trimmed at this step and removal of low quality reads
2) The second QC step which occurs after alignment looks and run and sample quality and can include aspects such as coverage and alignment metrics.