4. NGS: Assembly Flashcards
online databases:
uniprot - sprot vs trembl?
- sprot: manually curated
- trembl: computer annotated
online databases:
uniprot?
uniprot: functionally annotated protein sequences
online databases:
Genbank
eg Arabidopsis thaliana?
sequence data obtained and submitted by scientists
2,700,417 Arabidopsis thaliana sequence entries
(full length and partial protein-coding genes, non-coding genes, nuclear fragments/chromosomes/scaffolds, organellar genes/genomic regions,…)
online databases:
genome resource?
complete set of protein-coding genes for one genome (nt, aa)
several exist (species specific, NCBI, ENSEMBL, UniProt, JGI …)
.fastq file?
4 rows?
- Raw and pre-processed data
- text-based format
- storing both a biological sequence and its corresponding quality scores.
- Both encoded with a single ASCII character for brevity.
row 1 - @ sequence identifier - location on flow cell
row 2 - raw sequence letters
row 3 - a plus sing ‘+’ (optionally more)
row 4 - quality of the read (=same length as sequence row)
Sequence data: coverage?
each genome position is sequenced multiple times
coverage: average number of times each base is covered by independent reads
Sequence data - optimal coverage? Sanger? Illumina?
optimal coverage depends on objective, sample, sequencing technology, …
* Sanger: a coverage of 8x is sufficient
* Illumina: 50-100x coverage is typical
Principle of assembly
overlaps?
overlaps of identical sequence regions are assumed to originate from the same genomic location
this assumption is often not met!
- same sequence, different origins
- different sequence, same origin
What are the challenges of assembly?
- untrimmed poor-quality reads
- errors
- base calling errors
- deletion
- insertion
- low coverage, bad linkage
- unknown orientation
- contamination
- high heterozygosity
- polyploidy
- repeats !!!
What problem do repeats present in assembly?
if repeats < fragments: ok, can be assembled correctly
if:
repeats > fragments or
repeats»_space; fragments
reads can’t be matched correctly
and some repeats in eukaryotes are much larger than the size of reads! (eg transposable elements!)
What types of repeats?
- simple repeats
- tandem or dispersed gene families
- segmental duplications
- interspersed repeats (transposable elements)
- DNA transposons (corn Ac element)
- viral retrotransposons (yeast Ty, fly Copia elements)
- non-viral retrotransposons (SINEs, LINEs)
- polyploids!!
Assembly: general steps ?
- identify read overlaps, assemble into contigs
- determine order and orientation of contigs: scaffolds
- finish into a chromosome-level assembly
Assembly: different approaches for contigs?
greedy
OLC (overlap layout consensus)
de bruijn
Assembly: greedy approach
what is generated?
what next?
what may this result in?
when to terminate?
generate a look-up table with the prefixes and suffixes of all sequence reads
Then, given a starting read, extend it with another read, every extension: based on greatest overlap
–> however may result in misassembly
terminate when conflicting information is found
* e.g., two or more reads could extend but do not overlap each other
* result >=1 contigs
Assembly: greedy approach
usage?
results? problems?
Example?
still used to assemble organellar genomes from short-read shotgun data
- smaller size than nuclear genome
- single (circular) genome
- several hundred copies per cell
- generally good results (but problems with repeats!)
computer lab: NOVOPlasty
OLC algorithm
what is it? what type of data structure used and what do elements represent?
OLC algorithm: overlap graph
overlap (string) graph
* nodes represent sequence reads
* edge weights: prefix and suffix overlap
OLC algo
goal and approach?
goal & approach
* find genome / contigs by traversing the graph
OLC
what can you say about computing the overlap graph? how is it done?
how is it updated?
compute overlap graph (huge! messy!)
find overlaps with suffix trees and/or dynamic programming
layout
* remove redundancy in the graph where possible
* compute contigs from parts of the overlap graph consensus
* pick for each contig position one nucleotide
OLC approach is suitable/not suitable for …?
computationally not feasible for NGS short-reads
* shorter reads & higher coverage
➡ too many pairwise comparisons
➡ too few unique overlaps
* higher error rates than Sanger, different error patterns
* repeats!!
OLC - problems?
multiple possible paths = multiple possible genomes
* contigs: set of longest contigous segments which can unambiguously be identified - usually hundreds of thousands!
repeats!
* repeats that can’t be resolved are often left out
What was OLC first developed for and what does that mean for us?
- first developed and used for Sanger data
- updated for new long-read technologies! (Pacific Biosciences, Oxford Nanopore)
not so good for short reads of NGS
Best approach for NGS ?
many short reads –>
de Bruijn graph
De Bruijn graph - method?
- sequence reads
- break reads into k-mers (e.g., 4-mers)
- use these to construct a DBG - (drop k-mers that are too (in) frequent)
- target sequence: path that visits each edge once…at least in theory!
= eulerian path (cycle)
(output: also contigs!)
What could be the reasons for a bubble in a genome assembly graph?
biological:
- SNP
- heterozygosity
technological:
- sequencing errors
DBG - What are some more complex topological features?
- spur
- bubble
- frayed rope
Shotgun sequencing definition
Shotgun sequencing is a laboratory technique for determining the DNA sequence of an organism’s genome.
The method involves randomly breaking up the genome into small DNA fragments that are sequenced individually.
A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to reconstitute the genome.
How to determin the best k-mer size
break reads into k-mers, plot their abundances, repeat for multiple values for k
* choose k with optimal separation of signal from noise
* choose k large enough to reduce the number of (redundant) nodes
* choose k small enough to reduce the number of subgraphs (gaps)
Software using the de Bruijn approach
- data structures?
- errors?
- implementations?
- which to choose?
- auxiliary data structures store information about reads, k-mers, indices, positions, paired reads, etc
- handling of sequencing errors, repeats, etc
- incorporation of quality information
- different implementations (for genomes)
- Velvet, AbySS, AllPaths, Meraculous, SOAPdenovo, …
- different data structures!
- hash table, Bloom filter, FM index, etc
no single optimal implementation & parameter settings ➜ try different software/parameters
Def contif
set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region
Def linkage
- close location of genes or other DNA markers to each other on chromosomes.
- The closer the genes are to each other on a chromosome, the more likely they are linked or inherited together from parents to offspring
Def consensus in OLC
- pick for each contig position one nucleotide
Def haplotype
- A haplotype is a physical grouping of genomic variants (or polymorphisms) that tend to be inherited together.
- A specific haplotype typically reflects a unique combination of variants that reside near each other on a chromosome.
Def meta-data
- data about the data!
- from the same gene, individual, population, species
- more than one sequence
- more than one data type
What is scaffolding in genome assembly?
- infer order, orientation, distance of contigs
- link contigs with Ns into scaffolds
How can scaffolding be achieved?
- with mate pairs
- with long reads
- with long-range linkage information
Scaffolding with long-range linkage information
example?
originally designed for?
what does it identify?
what not known?
example: HiC, chromosome conformation capture
* originally designed to study 3D genome structure
* identify genomic regions physically adjacent in 3D → closer in 1D
* can be used to scaffold a draft genome assembly
* exact distance & orientation of contigs not known
Scaffolding with long reads
how? 2 ways
problem?
- long reads can link ≥2 contigs
- align contigs and long reads OR
- use hybrid assemblers that use both short and long reads as input
- problem: long reads generally have much higher error rates!
What issues are there in finishing a genome assembly?
How can this be solved?
Finishing a genome assembly
* most short read assemblies remain as scaffolds
- no chromosomal assignment of scaffolds
- large gaps
* filling gaps and chromosomal-level assembly is expensive and time-consuming
Can be resolved with long reads: now more chromosome-level assemblies
How are assemblies evaluated?
- placement of paired reads (orientation, distance)
- computation of quality metrics
- numbers & lengths of contigs and scaffolds
- N50 value
- quantify expected gene content for a given lineage - BUSCO
What is the N50 value
used to evaluate assemblies
N50: 50% of the entire assembly is contained in contigs/scaffolds of at least this length
Nuclear genome:
how is it really, and how is this data represented ?
nuclear genome
* diploid
* homozygous & heterozygous positions
* haplotypes
* multiple linear chromosomes
currently available data
* collapsed sequence data, pseudohaploid
* information about heterozygous positions*
* fully resolved haplotypes*
*requires special software and/or long or linked reads!
what does scaffold length and quality depend on?
scaffold length & quality depends on repeat content
What are some analysis challanges for NGS data?
data management (storage)
algorithms - always need new
analysis pipelines - need more expertise
multiple genome assemblies from the same species
* intraspecific diversity!
* inventory vs. working with pan-genomes
missing & fragmented & erroneous genome regions
What is a Hamiltonian path ?
passes through every VERTEX exactly once
What is an Eulerian path?
passes through every EDGE exactly once
EXAM QUESTION
what is a scaffold? Why do most genome assemblies consist of scaffolds? Give 2 reasons (2022)
- infer order, orientation, distance of contigs
- link contigs with Ns into scaffolds
EXAM QUESTION
How to use Hamiltonian path in sequence assembly, two problems (2020)
DBG
- break reads into k-mers (e.g., 4-mers) –> these become the edges
- break these into k-1-mers –> these become the nodes
- use these to construct a DBG - (drop k-mers that are too (in) frequent)
- target sequence: path that visits each edge once…at least in theory!
= eulerian path or cycle - not hamiltonian! nodes can be reused.
How do OLC and de Bruijn assembly methods deal with repeats?
Why?
Both handle unresolvable repeats by essentially leaving them out.
Unresolvable repeats break the assembly into fragments.
Fragments are contigs (short for contiguous)