Genome analysis - de novo assembly Flashcards
What is a de novo genome assembly?
When you sequence and assemble a new genome that does not have a reference genome.
Why are we interested in de novo assembly?
- It gives an inventory of the genetic information in an organism and tells us what the organism can do and how it has evolved.
- I can make a good reference for functional genomics and genome wide association studies for the future.
What is the shotgun approach? What is needed for this approach?
The shotgun approach is a way of sequencing and assembling a de novo genome.
- Randomly fragment the DNA and sequence the fragments
- Find overlaps between the reads
- Assemble overlaps into contigs
- Assemble contigs into scaffolds.
To be able to find the overlaps we need high sequence coverage, especially if we are using short read sequencing.
What is sequence coverage (read depth)?
How many times each base has been sequenced.
What are contigs and scaffolds?
During de novo assembly we sequence fragments of the genome and we find overlaps between the reads. Those overlapped reads are assembled as continuous sequences without gaps called contigs.
Then, in the scaffolding step, the contigs are connected by large-insert (pair-end/mate-pair) reads, which generally originate from large DNA fragments or fosmid inserts of several kilobases in length. The ordered set of connected contigs is defined as a ‘scaffold’ that is providing information about the contigs relative positions and orientations within the genome.
How can we know the expected number of contigs?
Number of contigs = Ne^-co
N = number of reads
c = coverage
o = 1- T/L
L = read length
T = minimum detectable overlap
G = genome size
This can also be used to estimate the coverage that your sequencing needs.
How does the minimum detectable read value, read length and coverage affect the expected number of contigs?
When the minimum detectable overlap is higher, more individual contigs are generated. Each read will contribute to a smaller portion of the genome because it has fewer overlaps with other reads. As a result, more contigs will be produced.
Longer reads will give fewer individual contigs because it is easier to find overlaps without needing as high coverage.
Higher coverage helps in finding overlaps, especially needed for short reads.
What are overlaps?
The more similar the end of one read is to the beginning of another, the more likely they are to have been originated from overlapping stretches of the genome.
What are the general problems in de novo assemblies?
- Bias due to technology and/or sequence composition
- Sequencing errors
- Heterozygosity in the data within the genome
These problems have to be solved with experimental design or assembly algorithms.
Why do we need high coverage for the shotgun approach?
To find the overlaps between the reads. Longer reads also makes it easier to find the overlaps but generally longer reads have lower sequencing quality.
What is the key computational challenges with using long vs short reads for assembly?
The key challenge with shotgun assembling long reads is that it becomes computationally challenging to overcome the higher error rate.
Short reads makes it harder to find the overlaps and it limits the ability to resolve the repeats. We also need higher coverage - we need to sequence more - to fin the overlaps.
What is the benefit of using long vs short reads for assembling genomes?
Having longer reads will generally reduce the number of contigs because it is easier to find the overlaps. It gives higher continuity of the assembly which we want when we are for example comparing our assembly to other genomes to find larger structural variations.
Short reads have high accuracy, high throughput and high resolution. Good to use if you are interested in the sequences of genes when the demands on continuity are not as high.
What is minimum detectable overlap?
How long the overlap needs to be to be detected. If this value is higher, more individual contigs are generated because fewer of the reads will overlap and fewer reads will be connected into continuous sequences.
What are greedy assembly algorithms?
The first try in solving the assembly challenges.
These algorithms aim to assemble the genome by locally selecting the most promising overlaps based on simple parameters such as sequence similarity and overlap length and then merging the two fragments. This is repeated until no more merges can be done.
It chooses the most parsimonious explanation for the data.
What algorithms would you use for short reads vs long reads assembly?
If you have long reads use overlap assemblers.
If you have short reads use de bruin assemblies.
If you have both long and short reads then look at what the majority of your reads are and the correct with the other.
What is the long read assembly pipeline (overlaps graphs)?
Reads →
overlap (build overlap graphs) →
Layout (bundle stretches of overlaps into contigs, contig graph and determine the path through the graph) →
Consensus (pick most likely nucleotide sequence for each contig →
Contigs.
Explain how overlap graphs are constructed
An overlap graph is constructed such that the nodes are sequencing reads. We put and edge between the nodes if the end of one read overlaps with the beginning of the other read.
The overlap graph is like a representation of the relationship between the k-mers but we cannot tell the exact sequence by just looking at an overlap graph because it usually ends up being big and messy.
Explain the layout step of the overlap assembly pipeline
In the layout we remove the edges of the overlap graph that are redundant and some edges can also be inferred from looking at other edges and we remove these as well. We them emit contigs from the non-branching stretches and then determine the path through the graph.
What is a path in the context of overlap graphs?
Sequence of nodes such that form each node there is an edge to the next node in the sequence.
Solving the assembly is the problem of identifying a path through the graph.
There are Hamiltonian paths (visits each node once) and Eulerian paths (visits all the edges once). The Eulerian path is the easier one to solve.
In an overlap graph we use the Hamiltonian path and in a de bruin graph we use the Eulerian path.
What is the consensus step of the overlap assembly pipeline?
In the consensus step we line all the reads that make up a contig up and choose the consensus one for the assembly by using multiple sequence alignment.
How are overlaps found in the overlap assembly pipeline?
The overlaps are found by comparing all reads against each other to find regions of overlaps. A seed and extend algorithm is then used where you choose a k-mer size and look for exact matches of that length and then extend to both sides.
The overlaps found can either be true or they can be false due to the k-mer being the end of a repetitive sequence.
Give examples of overlap assemblers
Celera and later Canu.
What is the short read assembly pipeline?
Error correction ((remove errors in our sequences) shrinks the assembly graph, reducing time and memory requirements) →
Graph construction (de Bruijn graph) →
Graph Cleaning →
Contig assembly →
Scaffolding →
Gap Filling.
Explain the error correction step of the long read pipeline.
The error rate of short reads is usually lower than for long reads but the beginning of the read usually has higher accuracy than the end.
In this step we count how many times each k-mer is occurring in all reads, k-mers that have errors in then should be very few.
We do this because it shrinks the assembly graph, reduces time and reduces sensitivity to errors during the assembly.
Explain how de bruin graphs are constructed.
De bruin assemblers use hash tables to find overlaps similar to the overlap assemblers but they do not find the full overlaps because they do not extend the overlapping k-mers.
k-mers are nodes and the adjacent k-mers are linked together using edges in the graph. Every node is a unique k-mer and the edges are overlaps of length k-1.
Why are de bruijn graphs less messy than overlap graphs?
Because we do error correction before the assembly and each node is a unique k-mer so there is less redundancy.
What are tips in de bruijn graphs?
In de Bruijn graphs, “tips” are structures that represent the ends of branches or dead ends in the graph.
These are sequences of nodes that are not fully connected to the rest of the graph and have no outgoing edges, meaning they do not extend further.
Tips often indicate areas of the genome where sequencing coverage is low, sequencing errors are present, or where the true sequence diverges from the reference. During graph clean-up we remove those as well as bubbles.
What are bubbles in de bruin graphs?
“Bubbles” are structures that represent alternative paths between two points in the graph. Bubbles occur when there are multiple possible routes through the graph that eventually reconverge at a common point.
They can arise due to genomic variations, such as single nucleotide polymorphisms (SNPs), small insertions or deletions (indels), or sequencing errors.
Why would you use different assembly algorithms for long and short reads?
Overlap assemblers work well for long reads but they are problematic for short reads because we need many more reads to get the same sequence coverage as the long reads and then the computing overlaps will take too long.
Another problem is that the overlaps are shorter and it is hard to find the true ones, this leads to us needing even higher coverage (read depth). Shorter k-mer length increases computation time but it also usually means less complex graphs. However if we have too long k-mers and a lot of sequencing errors then we will have a hard time finding the overlaps because we have no similar kmers.
What are the basic statistics for determining if your assembly is of good quality?
- number of contigs
- number of scaffolds
- largest contig
- total length of the assembly
- N50 - the contig length such that using equal or longer contigs sum up to 50% of the bases in the assembly. Sort from longest to shortest and N50 is found where the sum of the lengths >= 50% of the assembly length.
- L50 count of smallest number of contigs whose length sum makes up to 50% of genome size.