sequence assembly Flashcards

Question 1

Q

read number

Answer

A

possible overlaps increases exponentially with read number
for n reads, 2n² - 2n possible overlaps

Question 2

Q

assembly algorithms

Answer

A

depends on length of reads
greedy (phrap)
overlap graph (OLC)
- limited by read number
de bruijn (velevet)
- overcomes read number
string graph (SGA)
read length increasing so move back to phrap/olc

Question 3

Q

Greedy approach (phrap)

Answer

A

simple local method
merge 2 sequences with largest overlap
continue overlapping until no more possibilities
can’t use with global information
- paired end reads/mate pairs
- essential for repetitive genomes
crossmatch program (SW)
HGP

Question 4

Q

OLC

Answer

A

overlap layout consensus
find best match between suffix of one read and prefix of another
dynamic programming to find optimal overlap alignment
allow mismatches to accommodate for errors
- similarity threshold
filter out fragments with insufficiently long common substrings
local multiple alignments to derive consensus
if fragment has multiple overlaps choose best one
- lay out to create best path
long reads only (need to identify every overlap - computationally challenging

Question 5

Q

k-mers

Answer

A

short reads e.g. illumina
short sequence of set length that is unique in the genome
in alphabet of n symbols, n^k k-mers exist
- k is predefined
for full assembly:
- all k-mers in genome must be present and appear only once
- take first k-mer and shift along one base

Question 6

Q

simple assembly

Answer

A

sanger sequencing
reads (k-mers) represented as nodes in graph
edges represent alignments
hamiltonian cycle to construct circular genome
concatenate first 2 nucleotides in each read
cover all nodes once and return to start
circular DNA assumed (start anywhere)

Question 7

Q

improve simple assembly

Answer

A

chop reads into all possible k-mers
- nodes represent k-mers
- each successive node shifted along sequence by one nucleotide
removes redundancy
only one path exists traversing each node once
doesn’t scale to large graphs

Question 8

Q

de bruijn graph

Answer

A

solution to simple assembly
k-mer prefixes and suffixes represented as nodes
- each occurs once in the graph
- edges = whole k-mer sequence
eulerian cycle trhough graph
edge determines which path is correct where multiple overlaps exist
only one path visiting each node once

Question 9

Q

eulerian cycle

Answer

A

visit every edge of a graph exactly once
path visiting each node once possible if:
- all k-mers in genome present (unlikely)
- all k-mers error free (unlikely)
- each k-mer occurs once (difficult with repeats, use paired end reads)
- genome is single circular chromosome
  - linear needs eulerian path
can be adapted to overcome

Question 10

Q

k-mer size

Answer

A

limited by read size
- read size too long → not all k-mers present
- compromise between k-mer size and whoe genome representation
illumina:
- 100-200bp reads, k-mer of 100+
- reads never have all possible 100-mers in genome
longer k-mers overcome repeats
trial and error

Question 11

Q

de bruijn graph k-mers

Answer

A

break each 100bp read into 46 overlapping 55-mers
ensures nearly all 55-mers in genome detected

Question 12

Q

velvet

Answer

A

de novo assembly
short read sequencing (illumina, 454)
removes errors from short read sequences for high quality unique contigs
paired end read and lon gread infromation to find repeated areas between contigs
use of de bruijn

Question 13

Q

velvet de bruijn graph

Answer

A

cut reads into 50-mers overlapping by k-1 nucleotides
small k-mers: increased ocnnectivity, more ambiguous repeats
large k-mers: opposite
compromise specificity and sensitivity

Question 14

Q

velvet nodes

Answer

A

each node attached to complementary twin node
creates a block displaying the sequence
- last nucleotide of each k-mer
- last k-mer of block overlaps with first k-mer of its destination
series of reverse complement k-mers
branches where multiple overlap possibilities exist

Question 15

Q

velvet hash table

Answer

A

records ID of first read and its position for each k-mer
recorded with reverse complement (sequencing in both directions)
create node if distinct interruption point
trace reads through graph
create directed arcs where needed (interruptions)
merge blocks when only one path possible
- if node only has one outoging arc, and destination node has onyl one ingoing arc
- prevent information loss

Question 16

Q

velvet error removal

Answer

Study These Flashcards

A

focus on topological features
first remove tips, then create bubbles
finish by removing erroneous connections

Question 17

Q

tips

Answer

Study These Flashcards

A

chain of nodes disconnected on one end
criteria for removal:
- <2kbp (can be created by 2 nearby errors)
- minority count - if tip less common than other possible path

Question 18

Q

bubbles

Answer

Study These Flashcards

A

multiple redundant paths that start and end at the same nodes
caused by small seuqencing errors
remove with tour bus algorithm
- compares redundant paths with dynamic programming
- merge if similar enough

Question 19

Q

problems with velvet

Answer

Study These Flashcards

A

memory needed to create graphs
data loss
- loss of connectivity by chopping into k-mers
coverage
- uneven coverage means low coverage areas lost in error removal

sequence assembly Flashcards

(19 cards)