4. NGS: Assembly Flashcards
online databases:
uniprot - sprot vs trembl?
- sprot: manually curated
- trembl: computer annotated
online databases:
uniprot?
uniprot: functionally annotated protein sequences
online databases:
Genbank
eg Arabidopsis thaliana?
sequence data obtained and submitted by scientists
2,700,417 Arabidopsis thaliana sequence entries
(full length and partial protein-coding genes, non-coding genes, nuclear fragments/chromosomes/scaffolds, organellar genes/genomic regions,…)
online databases:
genome resource?
complete set of protein-coding genes for one genome (nt, aa)
several exist (species specific, NCBI, ENSEMBL, UniProt, JGI …)
.fastq file?
4 rows?
- Raw and pre-processed data
- text-based format
- storing both a biological sequence and its corresponding quality scores.
- Both encoded with a single ASCII character for brevity.
row 1 - @ sequence identifier - location on flow cell
row 2 - raw sequence letters
row 3 - a plus sing ‘+’ (optionally more)
row 4 - quality of the read (=same length as sequence row)
Sequence data: coverage?
each genome position is sequenced multiple times
coverage: average number of times each base is covered by independent reads
Sequence data - optimal coverage? Sanger? Illumina?
optimal coverage depends on objective, sample, sequencing technology, …
* Sanger: a coverage of 8x is sufficient
* Illumina: 50-100x coverage is typical
Principle of assembly
overlaps?
overlaps of identical sequence regions are assumed to originate from the same genomic location
this assumption is often not met!
- same sequence, different origins
- different sequence, same origin
What are the challenges of assembly?
- untrimmed poor-quality reads
- errors
- base calling errors
- deletion
- insertion
- low coverage, bad linkage
- unknown orientation
- contamination
- high heterozygosity
- polyploidy
- repeats !!!
What problem do repeats present in assembly?
if repeats < fragments: ok, can be assembled correctly
if:
repeats > fragments or
repeats»_space; fragments
reads can’t be matched correctly
and some repeats in eukaryotes are much larger than the size of reads! (eg transposable elements!)
What types of repeats?
- simple repeats
- tandem or dispersed gene families
- segmental duplications
- interspersed repeats (transposable elements)
- DNA transposons (corn Ac element)
- viral retrotransposons (yeast Ty, fly Copia elements)
- non-viral retrotransposons (SINEs, LINEs)
- polyploids!!
Assembly: general steps ?
- identify read overlaps, assemble into contigs
- determine order and orientation of contigs: scaffolds
- finish into a chromosome-level assembly
Assembly: different approaches for contigs?
greedy
OLC (overlap layout consensus)
de bruijn
Assembly: greedy approach
what is generated?
what next?
what may this result in?
when to terminate?
generate a look-up table with the prefixes and suffixes of all sequence reads
Then, given a starting read, extend it with another read, every extension: based on greatest overlap
–> however may result in misassembly
terminate when conflicting information is found
* e.g., two or more reads could extend but do not overlap each other
* result >=1 contigs
Assembly: greedy approach
usage?
results? problems?
Example?
still used to assemble organellar genomes from short-read shotgun data
- smaller size than nuclear genome
- single (circular) genome
- several hundred copies per cell
- generally good results (but problems with repeats!)
computer lab: NOVOPlasty
OLC algorithm
what is it? what type of data structure used and what do elements represent?
OLC algorithm: overlap graph
overlap (string) graph
* nodes represent sequence reads
* edge weights: prefix and suffix overlap
OLC algo
goal and approach?
goal & approach
* find genome / contigs by traversing the graph
OLC
what can you say about computing the overlap graph? how is it done?
how is it updated?
compute overlap graph (huge! messy!)
find overlaps with suffix trees and/or dynamic programming
layout
* remove redundancy in the graph where possible
* compute contigs from parts of the overlap graph consensus
* pick for each contig position one nucleotide
OLC approach is suitable/not suitable for …?
computationally not feasible for NGS short-reads
* shorter reads & higher coverage
➡ too many pairwise comparisons
➡ too few unique overlaps
* higher error rates than Sanger, different error patterns
* repeats!!