sequence assembly Flashcards
1
Q
read number
A
- possible overlaps increases exponentially with read number
- for n reads, 2n2 - 2n possible overlaps
2
Q
assembly algorithms
A
- depends on length of reads
- greedy (phrap)
- overlap graph (OLC)
- limited by read number
- de bruijn (velevet)
- overcomes read number
- string graph (SGA)
- read length increasing so move back to phrap/olc
3
Q
Greedy approach (phrap)
A
- simple local method
- merge 2 sequences with largest overlap
- continue overlapping until no more possibilities
- can’t use with global information
- paired end reads/mate pairs
- essential for repetitive genomes
- crossmatch program (SW)
- HGP
4
Q
OLC
A
- overlap layout consensus
- find best match between suffix of one read and prefix of another
- dynamic programming to find optimal overlap alignment
- allow mismatches to accommodate for errors
- similarity threshold
- filter out fragments with insufficiently long common substrings
- local multiple alignments to derive consensus
- if fragment has multiple overlaps choose best one
- lay out to create best path
- long reads only (need to identify every overlap - computationally challenging
5
Q
k-mers
A
- short reads e.g. illumina
- short sequence of set length that is unique in the genome
- in alphabet of n symbols, nk k-mers exist
- k is predefined
- for full assembly:
- all k-mers in genome must be present and appear only once
- take first k-mer and shift along one base
6
Q
simple assembly
A
- sanger sequencing
- reads (k-mers) represented as nodes in graph
- edges represent alignments
- hamiltonian cycle to construct circular genome
- concatenate first 2 nucleotides in each read
- cover all nodes once and return to start
- circular DNA assumed (start anywhere)
7
Q
improve simple assembly
A
- chop reads into all possible k-mers
- nodes represent k-mers
- each successive node shifted along sequence by one nucleotide
- removes redundancy
- only one path exists traversing each node once
- doesn’t scale to large graphs
8
Q
de bruijn graph
A
- solution to simple assembly
- k-mer prefixes and suffixes represented as nodes
- each occurs once in the graph
- edges = whole k-mer sequence
- eulerian cycle trhough graph
- edge determines which path is correct where multiple overlaps exist
- only one path visiting each node once
9
Q
eulerian cycle
A
- visit every edge of a graph exactly once
- path visiting each node once possible if:
- all k-mers in genome present (unlikely)
- all k-mers error free (unlikely)
- each k-mer occurs once (difficult with repeats, use paired end reads)
- genome is single circular chromosome
- linear needs eulerian path
- can be adapted to overcome
10
Q
k-mer size
A
- limited by read size
- read size too long → not all k-mers present
- compromise between k-mer size and whoe genome representation
- illumina:
- 100-200bp reads, k-mer of 100+
- reads never have all possible 100-mers in genome
- longer k-mers overcome repeats
- trial and error
11
Q
de bruijn graph k-mers
A
- break each 100bp read into 46 overlapping 55-mers
- ensures nearly all 55-mers in genome detected
12
Q
velvet
A
- de novo assembly
- short read sequencing (illumina, 454)
- removes errors from short read sequences for high quality unique contigs
- paired end read and lon gread infromation to find repeated areas between contigs
- use of de bruijn
13
Q
velvet de bruijn graph
A
- cut reads into 50-mers overlapping by k-1 nucleotides
- small k-mers: increased ocnnectivity, more ambiguous repeats
- large k-mers: opposite
- compromise specificity and sensitivity
14
Q
velvet nodes
A
- each node attached to complementary twin node
- creates a block displaying the sequence
- last nucleotide of each k-mer
- last k-mer of block overlaps with first k-mer of its destination
- series of reverse complement k-mers
- branches where multiple overlap possibilities exist
15
Q
velvet hash table
A
- records ID of first read and its position for each k-mer
- recorded with reverse complement (sequencing in both directions)
- create node if distinct interruption point
- trace reads through graph
- create directed arcs where needed (interruptions)
- merge blocks when only one path possible
- if node only has one outoging arc, and destination node has onyl one ingoing arc
- prevent information loss