NGS pipeline Flashcards

1
Q

What are the 5 steps in a NGS bioinformatics pipeline?

A
  1. Quality control
  2. Mapping
  3. Pre-processing
  4. Variant calling
  5. Variant annotation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What involved in the quality control step of a bioinformatics pipeline?

A

Assessing the quality of raw sequencing data (FASTQ)
e.g. FASTQC, multiQC- can check:

  1. Per base sequencing quality
  2. GC content
  3. Per tile sequencing quality
  4. Sequence length distribution
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the Qscore? and how can it be used to assess quality?

A

Q score is a metric calculated using intensity to noise ratios and indicates the probability that a given base has been called incorrectly by the sequencer.

Score > probability of incorrect base call > Accuracy

Q10 = 1 in 10 = 90%
Q20= 1 in 100 = 99%
Q30= 1 in 1000 = 99.9%
Q40= 1 in 10000 = 99.99%
Q50= 1 in 100000 = 99.999%

Bioinformatics community recommends base call to be =>Q30

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Trimmomatic and what is its role in quality control?

A

Trimmomatic is part of the QC process and involves removing adapter sequences and removing low quality bases from reads. This is completed using a sliding window that will trim/remove bases that fall below a defined Q score.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is BWA?

A

BWA is a popular alignment tool. Uses the Burrows-Wheeler Transform to compress the sequencing data (ref and reads). The algorithm works by seeding alignments with maximal exact matches (MEM) and extending seeds with the affine-gap penalty Smith-Waterman based alogrithm.

Aligns reads between 70bp-1Mbp to reference genomes

Very low compute cost, due to compression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Bowtie2 and how does it differ from BWA?

A

Bowtie2 is an aligner which uses Burrows-Wheeler Transform to compress reference and reads and matching is also based upon maximal exact matches (MEM) and seeds are extended.

Bowtie2 will choose the 1st MEM, whereas BWA will check multiple MEMs before aligning and extending

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is involved in the pre-processing step of the bioinformatics pipeline?

A

Removing duplicate reads, local realignment, indexing e.g. rmdup. Removing duplicates is important as PCR duplicates will inflate coverage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is variant calling? and what are the two types?

A

Variant calling is the process of identifying variants from sequencing data. There are two types:

  1. Bayesian based-
    Modeling the distribution of observed data using Bayesian statistics to calculate genotypes probabilities
    e.g. Freebayes/Platypus
  2. Heuristic based-
    Variant are called based upon defined factors such as minimum allele counts, read quality cut offs or read depths
    e.g. Varscan
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is variant annotation?

A

The process of annotating SNVs identified within a VCF with information from a variety of sources:

  1. Variant databases e.g. ClinVar
  2. Population frequency information e.g. dbSNP, ESP, 1k genomes, EXAC, gnomad
  3. In silico tools results e.g. SIFT, Polphen
  4. Literature searches e.g. Pubmed IDs/links

e.g. VEP, Annovar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is (vertical) coverage?

A

Coverage (vertical coverage) is the number times a particular base has been sequenced. A greater depth of coverage increases the confidence in the variant call.

Coverage differs depending on the investigation. looking for >95% at the following x.

Germline= ~20-30x
Somatic= ~200-300x due to low level variants
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is horizontal coverage?

A

How much of the RIO (or genome) has been sequenced

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are QA methods do you undertake when implementing a new a pipeline?

A
  1. Validation/verification
    a. Testing samples with known result to determine gold-standard
    b. Calculate sensitivity, specificity and accuracy
    c. Use of external software (support, community, updates)
  2. Code review and version control
    a. Git and implementations e.g. GitHub, bitbucket
  3. Mention of ACGS best practice guidelines
  4. Software testing
    a. Part of software life cycle
    b. Unit, system, user acceptance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between QA and QC?

A

QA- Quality assurance- Process orientated- Getting it right the first time- comprised of the different methods taken to ensure a quality requirement is met.

QC- Product orientated- Seeks to measure the number of quality requirements of a product (in real time). Designed to stop a faulty test being reported not to stop them from occurring

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What QC process are carried out during the entire NGS workflow?

A
  1. Extraction - DNA quatification- Qubit
  2. Library prep- Adapter contamination, insert size check
  3. Post sequencing:
    a. File size check
    b. SAV quality metrics
    i. Cluster density, %Q>30, reads PF, phasing (fallen behind), pre-phasing (jumped ahead), error rate (PhiX control)
    c. FASTQC metrics
    d. Samtools
    i. Insert size, coverage, %mapped
    e. VerifyBAMId- Identifying contamination
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 4 categories of alignment algorithms?

A
  1. Database only e.g. BLAST
  2. Pairwise alignment e.g. Needle
  3. Multiple sequence alignment e.g. ClustalW
  4. Genomic analysis e.g. BWA
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is global and local alignment?

A

Global- Algorithm that forces the alignment to span the entire length of all query sequences

e.g. Needleman-wunsch

Local- Algorithm identifies regions of similarity within two sequences

e.g. Smith-Waterman

17
Q

What are the 4 steps in a Needleman-wunsch alignment?

A
  1. Complete matrix comparing 2 sequences
  2. First rows have gap penalty 0, -1, -2…
  3. Score table e.g. match +1., mismatch 0. Calculate and record the highest candidate score (from top, top left and left)
  4. choose path with highest score (including 0s and -1s)

Movements up represent gaps

18
Q

What are the 4 steps in a Smith-Waterman alignment

A
  1. Complete matrix comparing 2 sequences
  2. First rows set to 0
  3. Score table e.g. match +1., mismatch 0 (no negative all 0)
  4. choose path with highest score and stop when you hit 0.

Movements up represent gaps