NGS pipeline Flashcards

Question 1

Q

What are the 5 steps in a NGS bioinformatics pipeline?

Answer

A

Quality control
Mapping
Pre-processing
Variant calling
Variant annotation

Question 2

Q

What involved in the quality control step of a bioinformatics pipeline?

Answer

A

Assessing the quality of raw sequencing data (FASTQ)
e.g. FASTQC, multiQC- can check:

Per base sequencing quality
GC content
Per tile sequencing quality
Sequence length distribution

Question 3

Q

What is the Qscore? and how can it be used to assess quality?

Answer

A

Q score is a metric calculated using intensity to noise ratios and indicates the probability that a given base has been called incorrectly by the sequencer.

Score > probability of incorrect base call > Accuracy

Q10 = 1 in 10 = 90%
Q20= 1 in 100 = 99%
Q30= 1 in 1000 = 99.9%
Q40= 1 in 10000 = 99.99%
Q50= 1 in 100000 = 99.999%

Bioinformatics community recommends base call to be =>Q30

Question 4

Q

What is Trimmomatic and what is its role in quality control?

Answer

A

Trimmomatic is part of the QC process and involves removing adapter sequences and removing low quality bases from reads. This is completed using a sliding window that will trim/remove bases that fall below a defined Q score.

Question 5

Q

What is BWA?

Answer

A

BWA is a popular alignment tool. Uses the Burrows-Wheeler Transform to compress the sequencing data (ref and reads). The algorithm works by seeding alignments with maximal exact matches (MEM) and extending seeds with the affine-gap penalty Smith-Waterman based alogrithm.

Aligns reads between 70bp-1Mbp to reference genomes

Very low compute cost, due to compression

Question 6

Q

What is Bowtie2 and how does it differ from BWA?

Answer

A

Bowtie2 is an aligner which uses Burrows-Wheeler Transform to compress reference and reads and matching is also based upon maximal exact matches (MEM) and seeds are extended.

Bowtie2 will choose the 1st MEM, whereas BWA will check multiple MEMs before aligning and extending

Question 7

Q

What is involved in the pre-processing step of the bioinformatics pipeline?

Answer

A

Removing duplicate reads, local realignment, indexing e.g. rmdup. Removing duplicates is important as PCR duplicates will inflate coverage

Question 8

Q

What is variant calling? and what are the two types?

Answer

A

Variant calling is the process of identifying variants from sequencing data. There are two types:

Bayesian based-
Modeling the distribution of observed data using Bayesian statistics to calculate genotypes probabilities
e.g. Freebayes/Platypus
Heuristic based-
Variant are called based upon defined factors such as minimum allele counts, read quality cut offs or read depths
e.g. Varscan

Question 9

Q

What is variant annotation?

Answer

A

The process of annotating SNVs identified within a VCF with information from a variety of sources:

Variant databases e.g. ClinVar
Population frequency information e.g. dbSNP, ESP, 1k genomes, EXAC, gnomad
In silico tools results e.g. SIFT, Polphen
Literature searches e.g. Pubmed IDs/links

e.g. VEP, Annovar

Question 10

Q

What is (vertical) coverage?

Answer

A

Coverage (vertical coverage) is the number times a particular base has been sequenced. A greater depth of coverage increases the confidence in the variant call.

Coverage differs depending on the investigation. looking for >95% at the following x.

Germline= ~20-30x
Somatic= ~200-300x due to low level variants

Question 11

Q

What is horizontal coverage?

Answer

A

How much of the RIO (or genome) has been sequenced

Question 12

Q

What are QA methods do you undertake when implementing a new a pipeline?

Answer

A

Validation/verification
a. Testing samples with known result to determine gold-standard
b. Calculate sensitivity, specificity and accuracy
c. Use of external software (support, community, updates)
Code review and version control
a. Git and implementations e.g. GitHub, bitbucket
Mention of ACGS best practice guidelines
Software testing
a. Part of software life cycle
b. Unit, system, user acceptance

Question 13

Q

What is the difference between QA and QC?

Answer

A

QA- Quality assurance- Process orientated- Getting it right the first time- comprised of the different methods taken to ensure a quality requirement is met.

QC- Product orientated- Seeks to measure the number of quality requirements of a product (in real time). Designed to stop a faulty test being reported not to stop them from occurring

Question 14

Q

What QC process are carried out during the entire NGS workflow?

Answer

A

Extraction - DNA quatification- Qubit
Library prep- Adapter contamination, insert size check
Post sequencing:
a. File size check
b. SAV quality metrics
i. Cluster density, %Q>30, reads PF, phasing (fallen behind), pre-phasing (jumped ahead), error rate (PhiX control)
c. FASTQC metrics
d. Samtools
i. Insert size, coverage, %mapped
e. VerifyBAMId- Identifying contamination

Question 15

Q

What are the 4 categories of alignment algorithms?

Answer

A

Database only e.g. BLAST
Pairwise alignment e.g. Needle
Multiple sequence alignment e.g. ClustalW
Genomic analysis e.g. BWA

Question 16

Q

What is global and local alignment?

Answer

Study These Flashcards

A

Global- Algorithm that forces the alignment to span the entire length of all query sequences

e.g. Needleman-wunsch

Local- Algorithm identifies regions of similarity within two sequences

e.g. Smith-Waterman

Question 17

Q

What are the 4 steps in a Needleman-wunsch alignment?

Answer

Study These Flashcards

A

Complete matrix comparing 2 sequences
First rows have gap penalty 0, -1, -2…
Score table e.g. match +1., mismatch 0. Calculate and record the highest candidate score (from top, top left and left)
choose path with highest score (including 0s and -1s)

Movements up represent gaps

Question 18

Q

What are the 4 steps in a Smith-Waterman alignment

Answer

Study These Flashcards

A

Complete matrix comparing 2 sequences
First rows set to 0
Score table e.g. match +1., mismatch 0 (no negative all 0)
choose path with highest score and stop when you hit 0.

Movements up represent gaps

NGS pipeline Flashcards

(18 cards)