Informatics In Sequencing Flashcards
NGS bioinformatics workflow
Step 1: analysis of raw data
Step 2: read alignment and variant calling
Step 3: annotation and variant prioritisation
Base calling
Conversion of fluorescence signals into actual sequence data with quality scores
A Q-score of 30 (Q30) corresponds to a 0.1 percent error rate in base calling, and is widely considered a benchmark for high quality data.
Fastq files
Fastq files are important to the first quality control step, as contains all the raw sequencing reads, the file names and quality values, with higher numbers indicative of higher qualities
Sequences SMA phred based quality in one file
Read alignment
A) Align and create Binary Alignment Map (bam) files
B) post alignment processing
- its objective is to increase the variant call accuracy and quality of the downstream process, by reducing base call and alignment artifacts
- it consists of filtering of duplicate reads, intensive local realignment and base quality score recalibration
C) variant calling
- aims to identify variants using the post- processed BAM file.
- from basic comparing the sample to reference, to advanced algorithms which include machine learning statistical methods
- joint cohort variant calling allows identification of systematic errors/biases but also better noise- signal separation and in turn calling of rare variants with more certainty
Variant annotation
- integral part of data interpretation
- adds biological context in an automated way
- gene overlap, functionality, conservation, overlap with disease database
Common tools:
- VEP
- ANNOVAR
- snpEff
Common databases:
- Ensembl
(Predicts variant consequences, protein function prediction, linkage disequilibrium data and variant conservation across species)
- RefSeq
- UCSC genome browser
One basic step in the annotation is to provide the variants context. That is in which gene the variant is located, it’s position within the gene and the impact of the variation
(Missense, nonsense, synonymous, stop-loss, etc)
Other databases and functions
Population frequency databases:
- gnomAD
Disease database:
- OMIM
- PanelApp
- Clinvar
Variant filtering and prioritisation
Phenotype and MOI
Population frequency
Pathogenicity prediction scores and conservation
Variant of interest
| |
Clinical interpretation
Allele frequency
Computational data
Functional data
Segregation data
Genotype phenotype correlation
Scale —> good to bad
Benign
Likely benign
Uncertain significance
Likely pathogenic
Pathogenic