sequence assembly Flashcards

1
Q

read number

A
  • possible overlaps increases exponentially with read number
  • for n reads, 2n2 - 2n possible overlaps
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

assembly algorithms

A
  • depends on length of reads
  • greedy (phrap)
  • overlap graph (OLC)
    • limited by read number
  • de bruijn (velevet)
    • overcomes read number
  • string graph (SGA)
  • read length increasing so move back to phrap/olc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Greedy approach (phrap)

A
  • simple local method
  • merge 2 sequences with largest overlap
  • continue overlapping until no more possibilities
  • can’t use with global information
    • paired end reads/mate pairs
    • essential for repetitive genomes
  • crossmatch program (SW)
  • HGP
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

OLC

A
  • overlap layout consensus
  • find best match between suffix of one read and prefix of another
  • dynamic programming to find optimal overlap alignment
  • allow mismatches to accommodate for errors
    • similarity threshold
  • filter out fragments with insufficiently long common substrings
  • local multiple alignments to derive consensus
  • if fragment has multiple overlaps choose best one
    • lay out to create best path
  • long reads only (need to identify every overlap - computationally challenging
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

k-mers

A
  • short reads e.g. illumina
  • short sequence of set length that is unique in the genome
  • in alphabet of n symbols, nk k-mers exist
    • k is predefined
  • for full assembly:
    • all k-mers in genome must be present and appear only once
    • take first k-mer and shift along one base
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

simple assembly

A
  • sanger sequencing
  • reads (k-mers) represented as nodes in graph
  • edges represent alignments
  • hamiltonian cycle to construct circular genome
  • concatenate first 2 nucleotides in each read
  • cover all nodes once and return to start
  • circular DNA assumed (start anywhere)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

improve simple assembly

A
  • chop reads into all possible k-mers
    • nodes represent k-mers
    • each successive node shifted along sequence by one nucleotide
  • removes redundancy
  • only one path exists traversing each node once
  • doesn’t scale to large graphs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

de bruijn graph

A
  • solution to simple assembly
  • k-mer prefixes and suffixes represented as nodes
    • each occurs once in the graph
    • edges = whole k-mer sequence
  • eulerian cycle trhough graph
  • edge determines which path is correct where multiple overlaps exist
  • only one path visiting each node once
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

eulerian cycle

A
  • visit every edge of a graph exactly once
  • path visiting each node once possible if:
    • all k-mers in genome present (unlikely)
    • all k-mers error free (unlikely)
    • each k-mer occurs once (difficult with repeats, use paired end reads)
    • genome is single circular chromosome
      • linear needs eulerian path
  • can be adapted to overcome
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

k-mer size

A
  • limited by read size
    • read size too long → not all k-mers present
    • compromise between k-mer size and whoe genome representation
  • illumina:
    • 100-200bp reads, k-mer of 100+
    • reads never have all possible 100-mers in genome
  • longer k-mers overcome repeats
  • trial and error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

de bruijn graph k-mers

A
  • break each 100bp read into 46 overlapping 55-mers
  • ensures nearly all 55-mers in genome detected
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

velvet

A
  • de novo assembly
  • short read sequencing (illumina, 454)
  • removes errors from short read sequences for high quality unique contigs
  • paired end read and lon gread infromation to find repeated areas between contigs
  • use of de bruijn
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

velvet de bruijn graph

A
  • cut reads into 50-mers overlapping by k-1 nucleotides
  • small k-mers: increased ocnnectivity, more ambiguous repeats
  • large k-mers: opposite
  • compromise specificity and sensitivity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

velvet nodes

A
  • each node attached to complementary twin node
  • creates a block displaying the sequence
    • last nucleotide of each k-mer
    • last k-mer of block overlaps with first k-mer of its destination
  • series of reverse complement k-mers
  • branches where multiple overlap possibilities exist
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

velvet hash table

A
  • records ID of first read and its position for each k-mer
  • recorded with reverse complement (sequencing in both directions)
  • create node if distinct interruption point
  • trace reads through graph
  • create directed arcs where needed (interruptions)
  • merge blocks when only one path possible
    • if node only has one outoging arc, and destination node has onyl one ingoing arc
    • prevent information loss
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

velvet error removal

A
  • focus on topological features
  • first remove tips, then create bubbles
  • finish by removing erroneous connections
17
Q

tips

A
  • chain of nodes disconnected on one end
  • criteria for removal:
    • <2kbp (can be created by 2 nearby errors)
    • minority count - if tip less common than other possible path
18
Q

bubbles

A
  • multiple redundant paths that start and end at the same nodes
  • caused by small seuqencing errors
  • remove with tour bus algorithm
    • compares redundant paths with dynamic programming
    • merge if similar enough
19
Q

problems with velvet

A
  • memory needed to create graphs
  • data loss
    • loss of connectivity by chopping into k-mers
  • coverage
    • uneven coverage means low coverage areas lost in error removal