4. NGS: Assembly Flashcards

1
Q

online databases:

uniprot - sprot vs trembl?

A
  • sprot: manually curated
  • trembl: computer annotated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

online databases:
uniprot?

A

uniprot: functionally annotated protein sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

online databases:

Genbank

eg Arabidopsis thaliana?

A

sequence data obtained and submitted by scientists

2,700,417 Arabidopsis thaliana sequence entries
(full length and partial protein-coding genes, non-coding genes, nuclear fragments/chromosomes/scaffolds, organellar genes/genomic regions,…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

online databases:

genome resource?

A

complete set of protein-coding genes for one genome (nt, aa)

several exist (species specific, NCBI, ENSEMBL, UniProt, JGI …)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

.fastq file?

4 rows?

A
  • Raw and pre-processed data
  • text-based format
  • storing both a biological sequence and its corresponding quality scores.
  • Both encoded with a single ASCII character for brevity.

row 1 - @ sequence identifier - location on flow cell
row 2 - raw sequence letters
row 3 - a plus sing ‘+’ (optionally more)
row 4 - quality of the read (=same length as sequence row)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sequence data: coverage?

A

each genome position is sequenced multiple times

coverage: average number of times each base is covered by independent reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Sequence data - optimal coverage? Sanger? Illumina?

A

optimal coverage depends on objective, sample, sequencing technology, …
* Sanger: a coverage of 8x is sufficient
* Illumina: 50-100x coverage is typical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Principle of assembly

overlaps?

A

overlaps of identical sequence regions are assumed to originate from the same genomic location

this assumption is often not met!
- same sequence, different origins
- different sequence, same origin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the challenges of assembly?

A
  • untrimmed poor-quality reads
  • errors
    • base calling errors
    • deletion
    • insertion
  • low coverage, bad linkage
  • unknown orientation
  • contamination
  • high heterozygosity
  • polyploidy
  • repeats !!!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What problem do repeats present in assembly?

A

if repeats < fragments: ok, can be assembled correctly

if:
repeats > fragments or
repeats&raquo_space; fragments
reads can’t be matched correctly

and some repeats in eukaryotes are much larger than the size of reads! (eg transposable elements!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What types of repeats?

A
  • simple repeats
  • tandem or dispersed gene families
  • segmental duplications
  • interspersed repeats (transposable elements)
    • DNA transposons (corn Ac element)
    • viral retrotransposons (yeast Ty, fly Copia elements)
    • non-viral retrotransposons (SINEs, LINEs)
  • polyploids!!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Assembly: general steps ?

A
  • identify read overlaps, assemble into contigs
  • determine order and orientation of contigs: scaffolds
  • finish into a chromosome-level assembly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Assembly: different approaches for contigs?

A

greedy

OLC (overlap layout consensus)

de bruijn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Assembly: greedy approach

what is generated?

what next?

what may this result in?

when to terminate?

A

generate a look-up table with the prefixes and suffixes of all sequence reads

Then, given a starting read, extend it with another read, every extension: based on greatest overlap

–> however may result in misassembly

terminate when conflicting information is found
* e.g., two or more reads could extend but do not overlap each other
* result >=1 contigs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Assembly: greedy approach

usage?

results? problems?

Example?

A

still used to assemble organellar genomes from short-read shotgun data

  • smaller size than nuclear genome
  • single (circular) genome
  • several hundred copies per cell
  • generally good results (but problems with repeats!)

computer lab: NOVOPlasty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

OLC algorithm

what is it? what type of data structure used and what do elements represent?

A

OLC algorithm: overlap graph

overlap (string) graph
* nodes represent sequence reads
* edge weights: prefix and suffix overlap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

OLC algo

goal and approach?

A

goal & approach
* find genome / contigs by traversing the graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

OLC

what can you say about computing the overlap graph? how is it done?

how is it updated?

A

compute overlap graph (huge! messy!)
find overlaps with suffix trees and/or dynamic programming

layout
* remove redundancy in the graph where possible
* compute contigs from parts of the overlap graph consensus
* pick for each contig position one nucleotide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

OLC approach is suitable/not suitable for …?

A

computationally not feasible for NGS short-reads
* shorter reads & higher coverage
➡ too many pairwise comparisons
➡ too few unique overlaps
* higher error rates than Sanger, different error patterns
* repeats!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

OLC - problems?

A

multiple possible paths = multiple possible genomes
* contigs: set of longest contigous segments which can unambiguously be identified - usually hundreds of thousands!

repeats!
* repeats that can’t be resolved are often left out

21
Q

What was OLC first developed for and what does that mean for us?

A
  • first developed and used for Sanger data
  • updated for new long-read technologies! (Pacific Biosciences, Oxford Nanopore)

not so good for short reads of NGS

22
Q

Best approach for NGS ?

A

many short reads –>
de Bruijn graph

23
Q

De Bruijn graph - method?

A
  • sequence reads
  • break reads into k-mers (e.g., 4-mers)
  • use these to construct a DBG - (drop k-mers that are too (in) frequent)
  • target sequence: path that visits each edge once…at least in theory!
    = eulerian path (cycle)

(output: also contigs!)

24
Q

What could be the reasons for a bubble in a genome assembly graph?

A

biological:
- SNP
- heterozygosity

technological:
- sequencing errors

25
DBG - What are some more complex topological features?
- spur - bubble - frayed rope
26
Shotgun sequencing definition
Shotgun sequencing is a laboratory technique for determining the DNA sequence of an organism’s genome. The method involves randomly breaking up the genome into small DNA fragments that are sequenced individually. A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to reconstitute the genome.
27
How to determin the best k-mer size
break reads into k-mers, plot their abundances, repeat for multiple values for k * choose k with optimal separation of signal from noise * choose k large enough to reduce the number of (redundant) nodes * choose k small enough to reduce the number of subgraphs (gaps)
28
Software using the de Bruijn approach - data structures? - errors? - implementations? - which to choose?
* auxiliary data structures store information about reads, k-mers, indices, positions, paired reads, etc * handling of sequencing errors, repeats, etc * incorporation of quality information * different implementations (for genomes) - Velvet, AbySS, AllPaths, Meraculous, SOAPdenovo, ... - different data structures! - hash table, Bloom filter, FM index, etc no single optimal implementation & parameter settings ➜ try different software/parameters
29
Def contif
set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region
30
Def linkage
- close location of genes or other DNA markers to each other on chromosomes. - The closer the genes are to each other on a chromosome, the more likely they are linked or inherited together from parents to offspring
31
Def consensus in OLC
* pick for each contig position one nucleotide
32
Def haplotype
- A haplotype is a physical grouping of genomic variants (or polymorphisms) that tend to be inherited together. - A specific haplotype typically reflects a unique combination of variants that reside near each other on a chromosome.
33
Def meta-data
* data about the data! * from the same gene, individual, population, species - more than one sequence - more than one data type
34
What is scaffolding in genome assembly?
* infer order, orientation, distance of contigs * link contigs with Ns into scaffolds
35
How can scaffolding be achieved?
- with mate pairs - with long reads - with long-range linkage information
36
Scaffolding with long-range linkage information example? originally designed for? what does it identify? what not known?
example: HiC, chromosome conformation capture * originally designed to study 3D genome structure * identify genomic regions physically adjacent in 3D → closer in 1D * can be used to scaffold a draft genome assembly * exact distance & orientation of contigs not known
37
Scaffolding with long reads how? 2 ways problem?
- long reads can link ≥2 contigs - align contigs and long reads OR - use hybrid assemblers that use both short and long reads as input - problem: long reads generally have much higher error rates!
38
What issues are there in finishing a genome assembly? How can this be solved?
Finishing a genome assembly * most short read assemblies remain as scaffolds - no chromosomal assignment of scaffolds - large gaps * filling gaps and chromosomal-level assembly is expensive and time-consuming Can be resolved with long reads: now more chromosome-level assemblies
39
How are assemblies evaluated?
* placement of paired reads (orientation, distance) * computation of quality metrics - numbers & lengths of contigs and scaffolds - N50 value * quantify expected gene content for a given lineage - BUSCO
40
What is the N50 value
used to evaluate assemblies N50: 50% of the entire assembly is contained in contigs/scaffolds of at least this length
41
Nuclear genome: how is it really, and how is this data represented ?
nuclear genome * diploid * homozygous & heterozygous positions * haplotypes * multiple linear chromosomes currently available data * collapsed sequence data, pseudohaploid * information about heterozygous positions* * fully resolved haplotypes* *requires special software and/or long or linked reads!
42
what does scaffold length and quality depend on?
scaffold length & quality depends on repeat content
43
What are some analysis challanges for NGS data?
data management (storage) algorithms - always need new analysis pipelines - need more expertise multiple genome assemblies from the same species * intraspecific diversity! * inventory vs. working with pan-genomes missing & fragmented & erroneous genome regions
44
What is a Hamiltonian path ?
passes through every VERTEX exactly once
45
What is an Eulerian path?
passes through every EDGE exactly once
46
EXAM QUESTION what is a scaffold? Why do most genome assemblies consist of scaffolds? Give 2 reasons (2022)
* infer order, orientation, distance of contigs * link contigs with Ns into scaffolds
47
EXAM QUESTION How to use Hamiltonian path in sequence assembly, two problems (2020)
DBG * break reads into k-mers (e.g., 4-mers) --> these become the edges * break these into k-1-mers --> these become the nodes * use these to construct a DBG - (drop k-mers that are too (in) frequent) * target sequence: path that visits each edge once…at least in theory! = eulerian path or cycle - not hamiltonian! nodes can be reused.
48
How do OLC and de Bruijn assembly methods deal with repeats? Why?
Both handle unresolvable repeats by essentially leaving them out. Unresolvable repeats break the assembly into fragments. Fragments are contigs (short for contiguous)