NGS pipeline Flashcards
What are the 5 steps in a NGS bioinformatics pipeline?
- Quality control
- Mapping
- Pre-processing
- Variant calling
- Variant annotation
What involved in the quality control step of a bioinformatics pipeline?
Assessing the quality of raw sequencing data (FASTQ)
e.g. FASTQC, multiQC- can check:
- Per base sequencing quality
- GC content
- Per tile sequencing quality
- Sequence length distribution
What is the Qscore? and how can it be used to assess quality?
Q score is a metric calculated using intensity to noise ratios and indicates the probability that a given base has been called incorrectly by the sequencer.
Score > probability of incorrect base call > Accuracy
Q10 = 1 in 10 = 90% Q20= 1 in 100 = 99% Q30= 1 in 1000 = 99.9% Q40= 1 in 10000 = 99.99% Q50= 1 in 100000 = 99.999%
Bioinformatics community recommends base call to be =>Q30
What is Trimmomatic and what is its role in quality control?
Trimmomatic is part of the QC process and involves removing adapter sequences and removing low quality bases from reads. This is completed using a sliding window that will trim/remove bases that fall below a defined Q score.
What is BWA?
BWA is a popular alignment tool. Uses the Burrows-Wheeler Transform to compress the sequencing data (ref and reads). The algorithm works by seeding alignments with maximal exact matches (MEM) and extending seeds with the affine-gap penalty Smith-Waterman based alogrithm.
Aligns reads between 70bp-1Mbp to reference genomes
Very low compute cost, due to compression
What is Bowtie2 and how does it differ from BWA?
Bowtie2 is an aligner which uses Burrows-Wheeler Transform to compress reference and reads and matching is also based upon maximal exact matches (MEM) and seeds are extended.
Bowtie2 will choose the 1st MEM, whereas BWA will check multiple MEMs before aligning and extending
What is involved in the pre-processing step of the bioinformatics pipeline?
Removing duplicate reads, local realignment, indexing e.g. rmdup. Removing duplicates is important as PCR duplicates will inflate coverage
What is variant calling? and what are the two types?
Variant calling is the process of identifying variants from sequencing data. There are two types:
- Bayesian based-
Modeling the distribution of observed data using Bayesian statistics to calculate genotypes probabilities
e.g. Freebayes/Platypus - Heuristic based-
Variant are called based upon defined factors such as minimum allele counts, read quality cut offs or read depths
e.g. Varscan
What is variant annotation?
The process of annotating SNVs identified within a VCF with information from a variety of sources:
- Variant databases e.g. ClinVar
- Population frequency information e.g. dbSNP, ESP, 1k genomes, EXAC, gnomad
- In silico tools results e.g. SIFT, Polphen
- Literature searches e.g. Pubmed IDs/links
e.g. VEP, Annovar
What is (vertical) coverage?
Coverage (vertical coverage) is the number times a particular base has been sequenced. A greater depth of coverage increases the confidence in the variant call.
Coverage differs depending on the investigation. looking for >95% at the following x.
Germline= ~20-30x Somatic= ~200-300x due to low level variants
What is horizontal coverage?
How much of the RIO (or genome) has been sequenced
What are QA methods do you undertake when implementing a new a pipeline?
- Validation/verification
a. Testing samples with known result to determine gold-standard
b. Calculate sensitivity, specificity and accuracy
c. Use of external software (support, community, updates) - Code review and version control
a. Git and implementations e.g. GitHub, bitbucket - Mention of ACGS best practice guidelines
- Software testing
a. Part of software life cycle
b. Unit, system, user acceptance
What is the difference between QA and QC?
QA- Quality assurance- Process orientated- Getting it right the first time- comprised of the different methods taken to ensure a quality requirement is met.
QC- Product orientated- Seeks to measure the number of quality requirements of a product (in real time). Designed to stop a faulty test being reported not to stop them from occurring
What QC process are carried out during the entire NGS workflow?
- Extraction - DNA quatification- Qubit
- Library prep- Adapter contamination, insert size check
- Post sequencing:
a. File size check
b. SAV quality metrics
i. Cluster density, %Q>30, reads PF, phasing (fallen behind), pre-phasing (jumped ahead), error rate (PhiX control)
c. FASTQC metrics
d. Samtools
i. Insert size, coverage, %mapped
e. VerifyBAMId- Identifying contamination
What are the 4 categories of alignment algorithms?
- Database only e.g. BLAST
- Pairwise alignment e.g. Needle
- Multiple sequence alignment e.g. ClustalW
- Genomic analysis e.g. BWA