NGS Flashcards

1
Q

What constitutes a bioinformatics pipeline?

A

A bioinformatics pipeline is composed of several software algorithms assembled into a workflow that process raw sequencing data, generating a list of annotated sequence variants

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain what demultiplexing is, the file formats involved and what tool you could use to perform demultiplexing

A

Demultiplexing is the first step in the pipeline and is the process of assigning each read to the appropriate sample, based on the “barcode” or index used in the library preparation

A binary base call or BCL file is input to a tool such as BCL2fastq.

BCL files are a binary representation of the raw sequencing data, and contain the base and the confidence in the call as a quality score

For every cycle of the sequencing run a call for every location identified on the flow cell (tiles and lanes) is added.

FastQ file is output. A FASTQ file is a text file where each read is represented by four lines containing information about the sequence itself and each cluster on the slide, a quality score identifier line and a quality value line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the process of adapter trimming, the file formats involved and what tool you could use to perform adapter trimming

A

Adapter trimming can be performed to remove contamination and over-representation of known sequences, e.g. Illumina adapter sequences or unique molecular identifiers (UMIs).

Trim Galore, Cutadapt, UMICollapse and Fastp

A FASTQ file is input and a trimmed FASTQ file is output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the process of alignment, the file formats involved and what tool you could use to perform alignment

A

Alignment is the process of mapping sequenced fragments to the reference genome. A FASTQ file is input and a SAM - sequence alignment map file is output, which contains all the details of the reads, alignment position and size, the size and names of the chromosomes in the reference genome to which the alignment was made
Burrows-Wheeler Alignment MEM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What four considerations are taken into account when choosing an alignment tool?

A

Accuracy.
Ability to handle errors, mismatches and gaps.
Speed (turn around time).
Processing requirements/compute should be reasonable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Which approaches to alignment allow for the alignment of NGS data?

A

A fast, binary search strategy has been developed from a Burrows-Wheeler Transform and FM index

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What post-processing steps would you perform on a SAM file?

A

Reads are sorted into the correct order by Picard and the SAM file is compressed to a binary alignment map (BAM) file i.e. a binary equivalent of a SAM containing identical information, requiring less storage space.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the process of variant calling, the file formats involved and what tool you could use to perform variant calling

A

Variant calling is the process by which the differences between the mapped reads and the reference sequence are analysed and classified as genuine variants or artefacts. A BAM file is input to produce a Variant Call Format (VCF) file

A VCF is a tab delimited file that describes the position of variants with respect to a reference and also stores genotype information. 8 columns are required: chromosome, position, ID, reference, alt, quality, filter i.e., pass vs fail, info contains allele frequency and depth information

An example is VarScan, Mutect2, HaplotypeCaller

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What considerations are taken into account when selecting variant callers for germline vs somatic data

A

A probabilistic/bayesian approach is suited to calling germline variants and calculates the likelihood of different genotypes given the observed data at a nucleotide position. Heterozygous: 50:50 allele split between the ref and alt allele or two alt alleles.
Homozygous: 100% alternative allele. Examples include FreeBayes and GATK.

A heuristic approach is suited to somatic variant calling as variants are found in a small fraction of cells, therefore identifying variants that meet requirements/parameters such as minimum read depth, frequency, supporting reads is more suited. Cut offs for read depth, base quality and variant allele frequency (VAF) are used to identify variants. VarScan.

Haplotype caller is a heuristic germline caller

DeepVariant uses neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Explain the process of CNV calling and what tool you could use to perform CNV calling

A

Copy Number Variant calling identifies deletions or duplications across particular genome sequences and is normally based on comparing read depth across a particular sequence, ExomeDepth, CNVKit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain the process of variant annotation and tools that can be used

A

Variant annotations take information from input VCFs and utilise data to annotate the variant. Alamut Batch or Cancer Genome Interpreter, ALISSA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How is the overall quality of a run assessed?

A

Overall run-level metrics highlight how well the run has performed on a technical level in terms of read quality and signal purity. The most important metrics to note are the % reads >Q30, the % clusters passing filter (PF) and the cluster density. There may be validated thresholds for these metrics that are used to pass or fail a run.

The error rate and yield can also be reported.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the Q30

A

The percentage of reads with a 99.9% base call accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are clusters PF

A

Indicates signal purity from each cluster. Over clustered flow cells with many overlapping clusters lead to poor template generation and lower %PF.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is cluster density

A

Optimal cluster density results in optimal run quality, reads passing filter, Q30 scores, and total data output.
Under clustering leads to lower data output, while over clustering leads to poor run performance, lower Q30 scores, the possible introduction of sequencing artefacts, and lower total data output.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What sample quality metrics are assessed during QC

A

Per Base Sequencing Quality represents the quality across all bases at each position in the FastQ file.
A warning is seen if the lower quartile is < 10 or the median is < 25 for any base.
A sample fails per base sequencing quality if the lower quartile is < 5 or if the median is < 20 for any base.

Fold 80 base penalty (F80) is the fold of additional sequencing required to ensure that 80% of the target bases achieve the mean coverage. High F80 values = lower on-target rate, higher capture inefficiency

GC/AT dropout reflect the degree of inadequate coverage of a particular region based on GC or AT content.

17
Q

Explain the Illumina “sequencing by synthesis” NGS technology

A

Single-stranded fragments bind randomly to the inside surface of flow cell channels.
Unlabelled nucleotides and enzyme are added to initiate solid-phase bridge amplification.
The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate.
Denaturation of the strands leaves single stranded templates anchored to the substrate.
DNA replication is repeated until several million dense clusters of double-stranded DNA are generated on the flow cell.
The first sequencing cycle begins by adding four labelled reversible terminators, primers, and DNA polymerase.
After laser excitation, the emitted fluorescence from each cluster is captured and the first base is identified.
The next cycle repeats the incorporation of four labelled reversible terminators, primers, and DNA polymerase.
After laser excitation, the image is captured as before, and the identity of the second base is recorded.
The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time.

18
Q

What is long read sequencing in the context of ONT

A

ONT have developed a third-generation sequencing approach that utilises a transmembrane nanopore and the principles of electrophoresis to sequence DNA and RNA without the need for PCR.
Nanopore sequencing utilises long-reads, in contrast to short-reads obtained by Illumina SBS technology.
Long-read sequencing offers improved characterisation of copy number variants and ability to distinguish between highly homologous regions.
ONT is not widely used in the NHS yet, however many academic publications are based on data generated using ONT.

19
Q

What is soft-filtering and what tools can be used

A

Variants are filtered based on factors affecting quality
Select Variants
A VCF is input and output

20
Q

What tools can be used to perform QC?

A

Assemble QC metrics such as per base sequence quality
Recommended Tool: FastQC and MultiQC
Input: fastq
Output: html report

21
Q

What considerations are taken into account for selecting bioinformatics pipeline tools?

A
  1. Is the tool in active development? The github repository should be regularly updated with regular bugs fixes.
  2. Is the tool open source? Open source tools are free to use and the code is readily available to download.
  3. How compatible is the tool with the laboratory equipment and the current pipeline? Tools should be written in interoperable programming languages and not have too many dependencies that are difficult to fulfil.
  4. How widely used is the tool and does it have a good reputation? The tool should be peer reviewed and used by several other reputable laboratories.
  5. What are the best practice guidelines recommendations for tool usage? The Broad Institute have published best practice workflows that recommend tools for analysing NGS data.
  6. Does the tool perform more than one useful task? Choosing tools that output quality metrics are beneficial as they provide important information regarding the quality of the data.
22
Q

What are in silico predictive bioinformatics tools and give some examples

A

In silico predictive software allows assessing the effect of amino acid substitutions on the structure or function of a protein. Missense analysis tools include Align GVGD, SIFT or PolyPhen