Lecture 17 - Assembly Flashcards
what does whole genome shotgun sequencing use
short reads sampled from all chromosomes of a given genome
how long are the short reads used for whole genome shotgun sequencing
100-250 bp
what makes assembly possible
over-sampling, reads will overlap with other reads
what is sequencing coverage
the number of times a specific region in a genome is sequenced
what is de novo sequence assembly
the reconstruction of a sequence up to chromosome length, without reference to an existing genome
why do errors in sequencing occur
genome sequences typically contain repeat regions
what are paired-end reads used for
to help resolve local repeats
what are paired-end reads
sequences from either end of a longer sequence of known length
why do paired ends help map reads over repetitive regions more precisely
the exact length is known
why are paired ends used over longer contigs
longer reads are more expensive and less accurate at the ends
how do greedy assembly methods work
joining best overlapping reads if consistent with existing assembly
what is the disadvantage of greedy assembly methos
the final result is not guaranteed to be optimal
what do nodes represent in the context of graphs for assembly
sequences
what do edges represent in the context of graphs for assembly
directional (3’ -> 5’) overlap
what is the goal in sequence overlap when represented as a graph
to find a single, non-overlapping path connects nodes
what does overlap-layout consensus create
a graph with a node for each read and edges connecting overlapping reads
what do edges represent in overlap-layout-consensus
pairs of reads that overlap sufficiently well (e.g. at least 20 bp overlap)
what is the disadvantages of overlap-layout-consensus
may have high computational overhead for paired overlap calculations, lots of memory is required
what do De Bruijn Graphs (DBG) use
exact substrings of length k
what does DBG create
a graph of k-mers that overlap by k-1 letters
what are the differences between overlap-layout-consensus and DBG
- overlap-layout-consensus uses long reads while DBG uses k-mers
- the rules for overlap are different
- finding a path for overlap-layout-consensus is more difficult than finding a path for DBG
what should the sequencing depth be for DBG
~40x
if k-mers are of length 31 how many possible k-mers are there
4^31
what does the DBG graph look like if there are no repeat k-mers and no errors
a single long chain