Algos Problem 3: Assembly of Genomes Flashcards
Hamiltonian path / cycle ?
Path: visits each vertex exactly once.
A Hamiltonian cycle (or Hamiltonian circuit) is a cycle that visits each vertex exactly once.
How do we represent graphs in computer
code / data structures?
Adjacency matrix [aka array]
(also called transition matrix for directed graphs)
(Q: anything special about the adjacency matrix for
undirected graphs?)
Adjacency list (queue)
When is a graph eulerian?
A Graph is Eulerian if It Contains an Eulerian Cycle.
Every Eulerian graph is balanced
A graph is balanced if indegree = outdegree for each node
Eulerian trail?
For the existence of Eulerian trails it is necessary that zero or two vertices have an odd degree.
If such a walk exists, the graph is called traversable or semi-eulerian.[4]
What is a heuristic?
In computer science, a heuristic is a problem-solving strategy or method that is not guaranteed to find the optimal solution, but is designed to find a satisfactory solution in a reasonable amount of time.
Sometimes there is more than one solution to a de bruijn graph for a genome assembly. What can you do to avoid the repeat problem?
DNA Sequencing with Read-pairs
- Multiple identical copies of genome
- Randomly cut genomes into large equally sized fragments of size InsertLength
- Generate read-pairs: two reads from the ends of each fragment (separated by a fixed distance)
Paired de Bruijn graph for a collection of paired k-mers:
– Represent every paired k-mer as an edge between its paired prefix and paired suffix.
– Glue ALL nodes with identical labels.
Contig?
A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA.[1] In bottom-up sequencing projects, a contig refers to overlapping sequence data
Contig: maximal non-branching paths (i.e. fragments) for which in(v) = out(v) = 1 for all nodes except first and last
Four Ridiculously Unrealistic Assumptions in genome assembly
- Perfect coverage of genome by reads (every k-merfrom the genome is represented by a read)
- Reads are error-free.
- Multiplicities of k-mers are known
- Distances between reads within read-pairs are exact.
also: We don’t know the orientation of reads (forward/reverse) Etc., etc., etc.
Errors in reads –> what in the graph?
Errors in Reads Lead to Bubbles in the
De Bruijn Graph