NGS Flashcards
What constitutes a bioinformatics pipeline?
A bioinformatics pipeline is composed of several software algorithms assembled into a workflow that process raw sequencing data, generating a list of annotated sequence variants
Explain what demultiplexing is, the file formats involved and what tool you could use to perform demultiplexing
Demultiplexing is the first step in the pipeline and is the process of assigning each read to the appropriate sample, based on the “barcode” or index used in the library preparation
A binary base call or BCL file is input to a tool such as BCL2fastq.
BCL files are a binary representation of the raw sequencing data, and contain the base and the confidence in the call as a quality score
For every cycle of the sequencing run a call for every location identified on the flow cell (tiles and lanes) is added.
FastQ file is output. A FASTQ file is a text file where each read is represented by four lines containing information about the sequence itself and each cluster on the slide, a quality score identifier line and a quality value line
Explain the process of adapter trimming, the file formats involved and what tool you could use to perform adapter trimming
Adapter trimming can be performed to remove contamination and over-representation of known sequences, e.g. Illumina adapter sequences or unique molecular identifiers (UMIs).
Trim Galore, Cutadapt, UMICollapse and Fastp
A FASTQ file is input and a trimmed FASTQ file is output
Explain the process of alignment, the file formats involved and what tool you could use to perform alignment
Alignment is the process of mapping sequenced fragments to the reference genome. A FASTQ file is input and a SAM - sequence alignment map file is output, which contains all the details of the reads, alignment position and size, the size and names of the chromosomes in the reference genome to which the alignment was made
Burrows-Wheeler Alignment MEM
What four considerations are taken into account when choosing an alignment tool?
Accuracy.
Ability to handle errors, mismatches and gaps.
Speed (turn around time).
Processing requirements/compute should be reasonable.
Which approaches to alignment allow for the alignment of NGS data?
A fast, binary search strategy has been developed from a Burrows-Wheeler Transform and FM index
What post-processing steps would you perform on a SAM file?
Reads are sorted into the correct order by Picard and the SAM file is compressed to a binary alignment map (BAM) file i.e. a binary equivalent of a SAM containing identical information, requiring less storage space.
Explain the process of variant calling, the file formats involved and what tool you could use to perform variant calling
Variant calling is the process by which the differences between the mapped reads and the reference sequence are analysed and classified as genuine variants or artefacts. A BAM file is input to produce a Variant Call Format (VCF) file
A VCF is a tab delimited file that describes the position of variants with respect to a reference and also stores genotype information. 8 columns are required: chromosome, position, ID, reference, alt, quality, filter i.e., pass vs fail, info contains allele frequency and depth information
An example is VarScan, Mutect2, HaplotypeCaller
What considerations are taken into account when selecting variant callers for germline vs somatic data
A probabilistic/bayesian approach is suited to calling germline variants and calculates the likelihood of different genotypes given the observed data at a nucleotide position. Heterozygous: 50:50 allele split between the ref and alt allele or two alt alleles.
Homozygous: 100% alternative allele. Examples include FreeBayes and GATK.
A heuristic approach is suited to somatic variant calling as variants are found in a small fraction of cells, therefore identifying variants that meet requirements/parameters such as minimum read depth, frequency, supporting reads is more suited. Cut offs for read depth, base quality and variant allele frequency (VAF) are used to identify variants. VarScan.
Haplotype caller is a heuristic germline caller
DeepVariant uses neural networks
Explain the process of CNV calling and what tool you could use to perform CNV calling
Copy Number Variant calling identifies deletions or duplications across particular genome sequences and is normally based on comparing read depth across a particular sequence, ExomeDepth, CNVKit
Explain the process of variant annotation and tools that can be used
Variant annotations take information from input VCFs and utilise data to annotate the variant. Alamut Batch or Cancer Genome Interpreter, ALISSA
How is the overall quality of a run assessed?
Overall run-level metrics highlight how well the run has performed on a technical level in terms of read quality and signal purity. The most important metrics to note are the % reads >Q30, the % clusters passing filter (PF) and the cluster density. There may be validated thresholds for these metrics that are used to pass or fail a run.
The error rate and yield can also be reported.
What is the Q30
The percentage of reads with a 99.9% base call accuracy.
What are clusters PF
Indicates signal purity from each cluster. Over clustered flow cells with many overlapping clusters lead to poor template generation and lower %PF.
What is cluster density
Optimal cluster density results in optimal run quality, reads passing filter, Q30 scores, and total data output.
Under clustering leads to lower data output, while over clustering leads to poor run performance, lower Q30 scores, the possible introduction of sequencing artefacts, and lower total data output.