Alignment Flashcards
Mention some pitfalls for construction of alignments
Repeats in the genome Poor ref. quality Read errors Regions not in ref. Surprise sample
What is wrong with simply searching in the reference?
It takes too much time/CPU
What is wrong with using BLAST?
It gives unnecessary output and finds only local alignments
What is the smart solution?
Find possible matches.
Do precise alignment for these.
Describe the principle for hash based algorithms
Index the reference: Make a dict with keys being kmers and values being positions
Search your read against the k-mer keys (seeds).
Do an alignment in the area of the seed and report best alignment
What is spaced seeds?
Using a longer kmer but not requiring everything to match
What are drawbacks of hash-based approaches?
Memory!
What is BWT?
A reversible transformation of the genome. Works through a suffix array. First do rotation of the input, then sort lexicographically and output the last column. Repeats cluster together and they are easier to compress.
It can be reversed by the use of LF-mapping.
How to look up in BWT?
Recreate the F string and find the last base in read here. See if the L of the row matches the second last one etc. Use the suffix array to find out where in the genome it is.
Why is BWT clever?
You store very little data and calculate missing parts when you need them.
What does bwa mem do?
Uses multiple short seeds across the reads. Extends the seed if several matches are found. Mostly for longer reads.
Explain SAM/BAM formats
First line is header (@), the rest are alignments where you can see the read, info about alignment (e.g. paired or not), mapping quality, to where it maps, the edit distance etc.