- Base calls per cycle. - Binary file containing the base call and quality for each tile in each cycle - Raw file produced by Illumina platforms (except MiSeq). - Must be converted to FASTQs for bioinformatic analysis

- Text based format for storing both a nucleotide sequence and it's corresponding quality scores. - Generally the input file for most BI pipelines

- Binary Alignment File. - Binary format for storing sequence data. - Formed when a set of FASTQs have been aligned to a reference genome using an alignment algorithm. - Can be used during analysis to visualise variants or to check quality/coverage

- Variant Call Format file is used to show genetic variations in a sample (only the differences between reference genome) - Produced when a variant calling program is applied to a BAM file. - Contains chromosome position, ref and alternate alleles, quality scores and if filtered.

20.02.16 NGS - Bioinformatics Flashcards by Rachel Challis

What is a bioinformatic pipeline generally split into

Quality control
Sequence alignment
Variant Calling
Annotation

How well did you know this?

Not at all

Perfectly

What is a BCL file

Base calls per cycle.
Binary file containing the base call and quality for each tile in each cycle
Raw file produced by Illumina platforms (except MiSeq).
Must be converted to FASTQs for bioinformatic analysis

How well did you know this?

Not at all

Perfectly

What are FASTQ files

Text based format for storing both a nucleotide sequence and it’s corresponding quality scores.
Generally the input file for most BI pipelines

How well did you know this?

Not at all

Perfectly

What are BAM files

Binary Alignment File.
Binary format for storing sequence data.
Formed when a set of FASTQs have been aligned to a reference genome using an alignment algorithm.
Can be used during analysis to visualise variants or to check quality/coverage

How well did you know this?

Not at all

Perfectly

What are CRAM files

-A very compressed version of a BAM. Will become more common as we try to reduce amount of data stored long-term

How well did you know this?

Not at all

Perfectly

What is a VCF file

Variant Call Format file is used to show genetic variations in a sample (only the differences between reference genome)
Produced when a variant calling program is applied to a BAM file.
Contains chromosome position, ref and alternate alleles, quality scores and if filtered.

How well did you know this?

Not at all

Perfectly

What is an AnnotatedVCF

A VCF that has had additional information added.

How well did you know this?

Not at all

Perfectly

Purpose of QC stage of BI pipeline

To check inherent quality of the data. Allows user to check if there are any major problems with dataset.
Essential to improve alignment and downstream analysis.

How well did you know this?

Not at all

Perfectly

Give some examples of the quality metrics checked in QC stage

Per base sequence quality, per sequence quality scores, per base n content, sequence duplication levels, overrepresented sequences, kmer content.

How well did you know this?

Not at all

Perfectly

Give an example of tools used in QC step

FastQC= checks FASTQ files for various QC metrics. Can run several files in parallel, reducing analysis time.
Others include= PIQA, ShortRead, but don’t allow parallelization thus less suited for high-throughput testing

How well did you know this?

Not at all

Perfectly

What happens in Alignment stage

-Short reads produced by sequencing methods are mapped to the corresponding position within the reference genome.

How well did you know this?

Not at all

Perfectly

What do alignment algorithms need to do

Assign each read a unique location and find corresponding location within reference genome.
Must be able to tolerate some deviation from reference to allow for variation between the genomes.

How well did you know this?

Not at all

Perfectly

Give examples of alignment algorithms

BLAST (Basic local alignment search tool), used historically. Heuristic approach that is inefficient for larger NGS data sets.
BWT (Burrows Wheeler Transform). Uses filtering and indexing to incease efficiency, reducing time and memory for alignment.
MAQ or Stampy, Hash-based algorithm. Build hash tables to index sample data before scanning the reference data.

How well did you know this?

Not at all

Perfectly

What does BWT do to align reads

-Filters first to exclude parts of the reference genome where there are no matches (k-mer index or pigeonhole principle).

How well did you know this?

Not at all

Perfectly

Why is it important to select alignment algorithm based on sequencing platform

Alignment algorithm is most accurate when error model employed is chosen according to seq technology.
e.g. Illumina platform is sensitive to mismatches than indels.

How well did you know this?

Not at all

Perfectly

Examples of error models employed by alignment algorithms.

Study These Flashcards

Hamming distance= calculates number of positions at which two bases are different between reference genome and short read sequences.
Edit distance= calculates the number of operations required to convert the short read into an exact match of the reference genome.
Weight edit distance= same as edit distance but also assigns weights to given variants allowing them to be differentiated from each other. E.g. base quality scores can be used to weight variants (helps improve the accuracy of the alignment).

Variant calling steps

Study These Flashcards

Pre-processing
Variant calling
Annotation

What happens in variant pre-processing step

Study These Flashcards

Data is refined prior to variant calling.
Removes duplicates, realigns and re-calibrates quality scores to give more accurate error scores
Picard tools is a package that can do this

What happens in variant calling step (SNPs)

Study These Flashcards

SNVs, Indels genotypes are called separately
Genotype is determine, calculating the instances in which an allele is observed (het= between 20-80%, hom= over 80%).
Requires high base quality to start with.
If base quality is lower (often with exome data, where read depth is lower) then probabilistic genotype calling methods are used, otherwise poor qualty variants are filtered out. Based on bayes theorem to predict genotype likelihoods and give confidence scores- improving accuracy. Used by GATK and Samtools.

What happens in variant calling step (Indels)

Study These Flashcards

Since NGS often uses short reads, indels larger than read length (~150bp) are difficult to identify. e.g. GATK Unifiedgenotyper detected SNVs with high concordance (99.8%), but only 92% of indels
Pindel is an algortihm that identifies break points by determining where the corresponding end of the pair is mapped. Able to detect very large deletions but detects more false positives over smaller indels

What happens in variant annotation step

Study These Flashcards

Provides additional information on the variants identified. e.g. transcripts, gene symbols, HGVS nomenclature and consequence.
Data is obtained from various databases
Repeat regions are often masked to aid annotation (RepeatMasker).

What factors affect NGS error rates

Study These Flashcards

-Signal to noise levels, cross talk from nearby beads or clusters, homopolymer count, incomplete extension and position on read.

NGS error patterns in Roche (454) and Ion Torrent

Study These Flashcards

Large variance in signal intensity for homopolymer length, leads to high error rates in indels calling
Overall miscall error rate is ~5%

NGS error patterns in Illumina

Study These Flashcards

Overall miscall error rate is ~1%.
Errors occur when synthesis becomes synchronized between different copies of the DNA template in the same cluster. Often the result of inverted repeats or GGC sequences.
Base calling becomes less accurate in later cycles as the extent of asynchrony is exacerbated with each cycle.

What is a Phred score

- Quality score that uses a mathematical scale to convert hte estimated probability of an incorrect call by the sequencer at a given base to a log scale. - Phred 10 (Q10)= 1 in 10 probability of a base called wrong (accuracy of base call 90%) - Phred 20 (Q20)= 1 in 100 probability of a base called wrong (accuracy of base call 99%) - Phred 30 (Q30)= 1 in 1000 probability of a base called wrong (accuracy of base call 99.9%)

Why are quality values important

-Used to reject low quality reads, trim low quality bases, improve alignment accuracy

What is single/paired end sequencing

- Single read= sequences a fragment in one direction - Paired read= sequences from both directions, creating a pair of reads. - Benefits of paired read= better at resolving structural rearrangements and improve assembly of repetitive regions. This is because it can identify relative positions of reads in genome. - Disadvantages of paired sequencing= more expensive, takes longer.

Why is accurate sequence alignment important

- Crucial for variant detection. Incorrect alignment leads to errors in SNP detection and genotype calling. - Alignment more difficult for NGS compared to Sanger as reads are shorter. If too short then they won't accurately align.

Why is depth of coverage important

- A measure of the number of times that a specific genomic site is sequenced during a sequencing run. - Accuracy in NGS requires an adequate number of overlapping reads. - Coverage can be variable due to: library preparation (differential ligation of adaptors or amplification)

How is a bioinformatic pipeline validated

- Assess pipeline output again a truth set (genome in a bottle). Can then calculate sensitivity and specificity. - Sensitivity should be calculated from >10 individuals, including patients tested in house and validated by sanger. - Tested over multiple runs to look at reproducibility. - Validation samples should be downsampled to test the limit of detection (to know if you can detect all your variants at stated limit, e.g. 20x).

20.02.16 NGS - Bioinformatics Flashcards

(30 cards)