6. NGS: read mapping Flashcards
What is the similarity/difference between assembly and read mapping?
Assembly: first genome of a species
- de novo assembly (high quality assembly: expensive/difficult)
- genome structure, gene inventory, etc
Read mapping: next genome(s) of that species (or a close relative)
- de novo assembly serves as reference: read mapping, re-sequencing, alignment
- cheaper: lower coverage required
- different questions
What is read mapping aka
AKA
mapping, alignment, read sequencing
What is reference bias?
What is a possible solution?
Mapping bias / reference bias
- reads containing non-reference alleles
- might not be mapped or incorrectly mapped
- receive lower mapping scores (mismatches!)
- effect increases with genetic distance from the reference genome
- affects downstream analysis (e.g., population genomics)
solution
* map to multiple reference genomes
NGS read mapping - approaches?
1: (pre-proccess &) identify short matches
* hash-table based (Sanger, NGS)
* BWT based (NGS)
2: extend seeds to longer alignments (e.g., DP)
What is a pan-genome?
collection of common and unique genomes present in given species.
combines genetic information of all genomes sampled –> large and diverse range of genetic material
What is a suffix tree?
compressed trie containing all suffixes of the given text as their keys and positions in the text as their values.
define genotype
In a broad sense, the term “genotype” refers to the genetic makeup of an organism; in other words, it describes an organism’s complete set of genes. In a more narrow sense, the term can be used to refer to the alleles, or variant forms of a gene, that are carried by an organism.
What is a hash table
eg dictionary in python
How is hashing used for read mapping?
- create hash table with substrings of length k called k-mers extracted from reference genome as keys, and their positions on the reference as values
- Then, some k-mers from each read are looked up in the hash table to find candidate locations of the read on the reference (hits)
In hash -table based read mapping, what are the drawbacks of short kmers? of long kmers?
What is the solution?
smaller templates (substrings, k-mers)
- allow to map in the presence of more errors/variation
- are not very effective seeds (match too many regions)
- dilemma!
solution: spaced seeds
What are spaced seeds?
templates: optimized for a given read length and
expected error/mismatch rate
pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions.
usually represented as a sequence of zeroes and ones
consecutive seeds vs. spaced seeds
consecutive seeds: one template, needs contiguous matches
spaces seed: template allows mismatches
What is a BW matrix?
BW matrix:
rows are sorted cyclical rotations of string$ - equivalent to sorting the suffixes
What is BWT?
Burrows-Wheeler Transform
- full-text indexing approach
- suffix array-like index structure
What are the advantages of BWT-based approaches for read-mapping?
- space efficient, can be efficiently compressed
- original string can be recovered
- can fit in RAM of most computers
- pre-computed indices can be stored & shared
What are the most widely used read mapper tools?
BWA, Bowtie2
What problem does mapping of long reads present and how can this be addressed?
- higher number of sequencing errors (and different types of errors)
- hash-based and BWT-based approaches
- longer reads: many short seeds per read
- higher error rates –> use chaining
- use representative seeds only (minimizers)
What is SAM and BAM?
Alignment output
* compact file size
* analysis without loading the alignment into memory
* can be used by alignment viewers (IGV, samtools)
* SAM (Sequence Alignment/Map) format - human readable (sort of), can be parsed easily
* BAM format - same information but in binary format
In SAM format, what is the flag code?
tell us if the strand is mapped, (un)paired, etc
What is mapping quality?
When is this relevant?
Origin? Usage?
mapping quality= probability of read aligned to incorrect region
Relevant with repetitive genome regions, SNPs, sequencing errors
➡ reads map equally well to multiple positions
➡ reduced probability of assigning a read to the correct genomic location
➡ measure the confidence that a read actually comes from the position it is aligned to
first introduced with MAQ mapping software (2008)
different programs compute this differently! (and some don’t compute it at all)
What is a single bp variant? What is an SNP?
SNPs
* a single bp change with a frequency ofat least 1% within a population
* bi-allelic
others: rare, just called s bp variant
What can cause single bp variants?
What do we need to consder in order to resolve this?
What approaches?
Cause:
errors, paralogs, diploidy, polyploidy, …
considerations:
read depth, read quality, alignment quality, frequency and ratio of SNPs, number of equally good alignments per read, …
approaches
- allele counting
- probabilistic methods
What is variant calling?
Name one package and the tools to do this
Variant calling entails identifying single nucleotide polymorphisms (SNPs) and small insertions and deletion (indels) from next generation sequencing data
bcftools
- bcftools mpileup
* for every position, considers data from all reads covering the position
* calculates genotype likelihoods
- based on mapping qualities of the reads, base qualities, the probability of local misalignment, per-base alignment quality (BAQ) - bcftools call
* calls the most likely genotype under the assumption of Hardy-Weinberg equilibrium, using allele frequencies estimated from the data
or provided explicitly by the user
What is VCF?
Variant Call Format
* text based
* meta-information, genome variation by position
* can include info about structural variation
What are structural variants?
size?
relevant for?
etc
variation generally >50bp
~2% of the human genome
relevant for medicine, molecular biology, evolution
least studied type of variation, because they are very difficult to detect with short reads
What are different types of structural variants?
- insertions, deletions, duplications
- inversions, rearrangements
- copy number variations
Structural variants with:
short reads
long reads
genome alignment
short reads
- need to take into consideration distance and orientation of
paired reads; split reads
- some specialized mappers exist
- none of them can identify all SV types
- limit inherent in the short reads
long reads
- in theory advantageous, span SV regions
- in practice: not very many methods yet available
- often specific for sequencing technology
genome alignment
- required: de novo assembly, spanning the SV
➡ useful & important! methods are still evolving!
Describe fasta format
- text-based format
- representing either Nts or AA using single-letter codes.
- begins with a single-line description
- followed by lines of sequence data
Describe fastq format
- text-based format
- for storing biological sequence + corresponding quality scores.
- Both encoded with a single ASCII character
EXAM QUESTION
Describe 2 structural variants and how to detect them during mapping of long-read sequencing data (2022)
- insertions
- deletions
- duplications
- inversions, rearrangements
- copy number variations
long reads are split into sub-reads and mapped - anlaysis can then determine if they are one of the SVs above