6. NGS: read mapping Flashcards
What is the similarity/difference between assembly and read mapping?
Assembly: first genome of a species
- de novo assembly (high quality assembly: expensive/difficult)
- genome structure, gene inventory, etc
Read mapping: next genome(s) of that species (or a close relative)
- de novo assembly serves as reference: read mapping, re-sequencing, alignment
- cheaper: lower coverage required
- different questions
What is read mapping aka
AKA
mapping, alignment, read sequencing
What is reference bias?
What is a possible solution?
Mapping bias / reference bias
- reads containing non-reference alleles
- might not be mapped or incorrectly mapped
- receive lower mapping scores (mismatches!)
- effect increases with genetic distance from the reference genome
- affects downstream analysis (e.g., population genomics)
solution
* map to multiple reference genomes
NGS read mapping - approaches?
1: (pre-proccess &) identify short matches
* hash-table based (Sanger, NGS)
* BWT based (NGS)
2: extend seeds to longer alignments (e.g., DP)
What is a pan-genome?
collection of common and unique genomes present in given species.
combines genetic information of all genomes sampled –> large and diverse range of genetic material
What is a suffix tree?
compressed trie containing all suffixes of the given text as their keys and positions in the text as their values.
define genotype
In a broad sense, the term “genotype” refers to the genetic makeup of an organism; in other words, it describes an organism’s complete set of genes. In a more narrow sense, the term can be used to refer to the alleles, or variant forms of a gene, that are carried by an organism.
What is a hash table
eg dictionary in python
How is hashing used for read mapping?
- create hash table with substrings of length k called k-mers extracted from reference genome as keys, and their positions on the reference as values
- Then, some k-mers from each read are looked up in the hash table to find candidate locations of the read on the reference (hits)
In hash -table based read mapping, what are the drawbacks of short kmers? of long kmers?
What is the solution?
smaller templates (substrings, k-mers)
- allow to map in the presence of more errors/variation
- are not very effective seeds (match too many regions)
- dilemma!
solution: spaced seeds
What are spaced seeds?
templates: optimized for a given read length and
expected error/mismatch rate
pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions.
usually represented as a sequence of zeroes and ones
consecutive seeds vs. spaced seeds
consecutive seeds: one template, needs contiguous matches
spaces seed: template allows mismatches
What is a BW matrix?
BW matrix:
rows are sorted cyclical rotations of string$ - equivalent to sorting the suffixes
What is BWT?
Burrows-Wheeler Transform
- full-text indexing approach
- suffix array-like index structure
What are the advantages of BWT-based approaches for read-mapping?
- space efficient, can be efficiently compressed
- original string can be recovered
- can fit in RAM of most computers
- pre-computed indices can be stored & shared