20.02.16 NGS - Bioinformatics Flashcards

1
Q

What is a bioinformatic pipeline generally split into

A
  • Quality control
  • Sequence alignment
  • Variant Calling
  • Annotation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a BCL file

A
  • Base calls per cycle.
  • Binary file containing the base call and quality for each tile in each cycle
  • Raw file produced by Illumina platforms (except MiSeq).
  • Must be converted to FASTQs for bioinformatic analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are FASTQ files

A
  • Text based format for storing both a nucleotide sequence and it’s corresponding quality scores.
  • Generally the input file for most BI pipelines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are BAM files

A
  • Binary Alignment File.
  • Binary format for storing sequence data.
  • Formed when a set of FASTQs have been aligned to a reference genome using an alignment algorithm.
  • Can be used during analysis to visualise variants or to check quality/coverage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are CRAM files

A

-A very compressed version of a BAM. Will become more common as we try to reduce amount of data stored long-term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a VCF file

A
  • Variant Call Format file is used to show genetic variations in a sample (only the differences between reference genome)
  • Produced when a variant calling program is applied to a BAM file.
  • Contains chromosome position, ref and alternate alleles, quality scores and if filtered.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is an AnnotatedVCF

A

A VCF that has had additional information added.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Purpose of QC stage of BI pipeline

A
  • To check inherent quality of the data. Allows user to check if there are any major problems with dataset.
  • Essential to improve alignment and downstream analysis.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Give some examples of the quality metrics checked in QC stage

A

Per base sequence quality, per sequence quality scores, per base n content, sequence duplication levels, overrepresented sequences, kmer content.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Give an example of tools used in QC step

A
  • FastQC= checks FASTQ files for various QC metrics. Can run several files in parallel, reducing analysis time.
  • Others include= PIQA, ShortRead, but don’t allow parallelization thus less suited for high-throughput testing
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What happens in Alignment stage

A

-Short reads produced by sequencing methods are mapped to the corresponding position within the reference genome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What do alignment algorithms need to do

A
  • Assign each read a unique location and find corresponding location within reference genome.
  • Must be able to tolerate some deviation from reference to allow for variation between the genomes.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Give examples of alignment algorithms

A
  • BLAST (Basic local alignment search tool), used historically. Heuristic approach that is inefficient for larger NGS data sets.
  • BWT (Burrows Wheeler Transform). Uses filtering and indexing to incease efficiency, reducing time and memory for alignment.
  • MAQ or Stampy, Hash-based algorithm. Build hash tables to index sample data before scanning the reference data.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does BWT do to align reads

A

-Filters first to exclude parts of the reference genome where there are no matches (k-mer index or pigeonhole principle).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Why is it important to select alignment algorithm based on sequencing platform

A
  • Alignment algorithm is most accurate when error model employed is chosen according to seq technology.
  • e.g. Illumina platform is sensitive to mismatches than indels.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Examples of error models employed by alignment algorithms.

A
  • Hamming distance= calculates number of positions at which two bases are different between reference genome and short read sequences.
  • Edit distance= calculates the number of operations required to convert the short read into an exact match of the reference genome.
  • Weight edit distance= same as edit distance but also assigns weights to given variants allowing them to be differentiated from each other. E.g. base quality scores can be used to weight variants (helps improve the accuracy of the alignment).
17
Q

Variant calling steps

A
  • Pre-processing
  • Variant calling
  • Annotation
18
Q

What happens in variant pre-processing step

A
  • Data is refined prior to variant calling.
  • Removes duplicates, realigns and re-calibrates quality scores to give more accurate error scores
  • Picard tools is a package that can do this
19
Q

What happens in variant calling step (SNPs)

A
  • SNVs, Indels genotypes are called separately
  • Genotype is determine, calculating the instances in which an allele is observed (het= between 20-80%, hom= over 80%).
  • Requires high base quality to start with.
  • If base quality is lower (often with exome data, where read depth is lower) then probabilistic genotype calling methods are used, otherwise poor qualty variants are filtered out. Based on bayes theorem to predict genotype likelihoods and give confidence scores- improving accuracy. Used by GATK and Samtools.
20
Q

What happens in variant calling step (Indels)

A
  • Since NGS often uses short reads, indels larger than read length (~150bp) are difficult to identify. e.g. GATK Unifiedgenotyper detected SNVs with high concordance (99.8%), but only 92% of indels
  • Pindel is an algortihm that identifies break points by determining where the corresponding end of the pair is mapped. Able to detect very large deletions but detects more false positives over smaller indels
21
Q

What happens in variant annotation step

A
  • Provides additional information on the variants identified. e.g. transcripts, gene symbols, HGVS nomenclature and consequence.
  • Data is obtained from various databases
  • Repeat regions are often masked to aid annotation (RepeatMasker).
22
Q

What factors affect NGS error rates

A

-Signal to noise levels, cross talk from nearby beads or clusters, homopolymer count, incomplete extension and position on read.

23
Q

NGS error patterns in Roche (454) and Ion Torrent

A
  • Large variance in signal intensity for homopolymer length, leads to high error rates in indels calling
  • Overall miscall error rate is ~5%
24
Q

NGS error patterns in Illumina

A
  • Overall miscall error rate is ~1%.
  • Errors occur when synthesis becomes synchronized between different copies of the DNA template in the same cluster. Often the result of inverted repeats or GGC sequences.
  • Base calling becomes less accurate in later cycles as the extent of asynchrony is exacerbated with each cycle.
25
Q

What is a Phred score

A
  • Quality score that uses a mathematical scale to convert hte estimated probability of an incorrect call by the sequencer at a given base to a log scale.
  • Phred 10 (Q10)= 1 in 10 probability of a base called wrong (accuracy of base call 90%)
  • Phred 20 (Q20)= 1 in 100 probability of a base called wrong (accuracy of base call 99%)
  • Phred 30 (Q30)= 1 in 1000 probability of a base called wrong (accuracy of base call 99.9%)
26
Q

Why are quality values important

A

-Used to reject low quality reads, trim low quality bases, improve alignment accuracy

27
Q

What is single/paired end sequencing

A
  • Single read= sequences a fragment in one direction
  • Paired read= sequences from both directions, creating a pair of reads.
  • Benefits of paired read= better at resolving structural rearrangements and improve assembly of repetitive regions. This is because it can identify relative positions of reads in genome.
  • Disadvantages of paired sequencing= more expensive, takes longer.
28
Q

Why is accurate sequence alignment important

A
  • Crucial for variant detection. Incorrect alignment leads to errors in SNP detection and genotype calling.
  • Alignment more difficult for NGS compared to Sanger as reads are shorter. If too short then they won’t accurately align.
29
Q

Why is depth of coverage important

A
  • A measure of the number of times that a specific genomic site is sequenced during a sequencing run.
  • Accuracy in NGS requires an adequate number of overlapping reads.
  • Coverage can be variable due to: library preparation (differential ligation of adaptors or amplification)
30
Q

How is a bioinformatic pipeline validated

A
  • Assess pipeline output again a truth set (genome in a bottle). Can then calculate sensitivity and specificity.
  • Sensitivity should be calculated from >10 individuals, including patients tested in house and validated by sanger.
  • Tested over multiple runs to look at reproducibility.
  • Validation samples should be downsampled to test the limit of detection (to know if you can detect all your variants at stated limit, e.g. 20x).