4. NGS: Assembly Flashcards

1
Q

online databases:

uniprot - sprot vs trembl?

A
  • sprot: manually curated
  • trembl: computer annotated
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

online databases:
uniprot?

A

uniprot: functionally annotated protein sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

online databases:

Genbank

eg Arabidopsis thaliana?

A

sequence data obtained and submitted by scientists

2,700,417 Arabidopsis thaliana sequence entries
(full length and partial protein-coding genes, non-coding genes, nuclear fragments/chromosomes/scaffolds, organellar genes/genomic regions,…)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

online databases:

genome resource?

A

complete set of protein-coding genes for one genome (nt, aa)

several exist (species specific, NCBI, ENSEMBL, UniProt, JGI …)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

.fastq file?

4 rows?

A
  • Raw and pre-processed data
  • text-based format
  • storing both a biological sequence and its corresponding quality scores.
  • Both encoded with a single ASCII character for brevity.

row 1 - @ sequence identifier - location on flow cell
row 2 - raw sequence letters
row 3 - a plus sing ‘+’ (optionally more)
row 4 - quality of the read (=same length as sequence row)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Sequence data: coverage?

A

each genome position is sequenced multiple times

coverage: average number of times each base is covered by independent reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Sequence data - optimal coverage? Sanger? Illumina?

A

optimal coverage depends on objective, sample, sequencing technology, …
* Sanger: a coverage of 8x is sufficient
* Illumina: 50-100x coverage is typical

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Principle of assembly

overlaps?

A

overlaps of identical sequence regions are assumed to originate from the same genomic location

this assumption is often not met!
- same sequence, different origins
- different sequence, same origin

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the challenges of assembly?

A
  • untrimmed poor-quality reads
  • errors
    • base calling errors
    • deletion
    • insertion
  • low coverage, bad linkage
  • unknown orientation
  • contamination
  • high heterozygosity
  • polyploidy
  • repeats !!!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What problem do repeats present in assembly?

A

if repeats < fragments: ok, can be assembled correctly

if:
repeats > fragments or
repeats&raquo_space; fragments
reads can’t be matched correctly

and some repeats in eukaryotes are much larger than the size of reads! (eg transposable elements!)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What types of repeats?

A
  • simple repeats
  • tandem or dispersed gene families
  • segmental duplications
  • interspersed repeats (transposable elements)
    • DNA transposons (corn Ac element)
    • viral retrotransposons (yeast Ty, fly Copia elements)
    • non-viral retrotransposons (SINEs, LINEs)
  • polyploids!!
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Assembly: general steps ?

A
  • identify read overlaps, assemble into contigs
  • determine order and orientation of contigs: scaffolds
  • finish into a chromosome-level assembly
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Assembly: different approaches for contigs?

A

greedy

OLC (overlap layout consensus)

de bruijn

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Assembly: greedy approach

what is generated?

what next?

what may this result in?

when to terminate?

A

generate a look-up table with the prefixes and suffixes of all sequence reads

Then, given a starting read, extend it with another read, every extension: based on greatest overlap

–> however may result in misassembly

terminate when conflicting information is found
* e.g., two or more reads could extend but do not overlap each other
* result >=1 contigs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Assembly: greedy approach

usage?

results? problems?

Example?

A

still used to assemble organellar genomes from short-read shotgun data

  • smaller size than nuclear genome
  • single (circular) genome
  • several hundred copies per cell
  • generally good results (but problems with repeats!)

computer lab: NOVOPlasty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

OLC algorithm

what is it? what type of data structure used and what do elements represent?

A

OLC algorithm: overlap graph

overlap (string) graph
* nodes represent sequence reads
* edge weights: prefix and suffix overlap

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

OLC algo

goal and approach?

A

goal & approach
* find genome / contigs by traversing the graph

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

OLC

what can you say about computing the overlap graph? how is it done?

how is it updated?

A

compute overlap graph (huge! messy!)
find overlaps with suffix trees and/or dynamic programming

layout
* remove redundancy in the graph where possible
* compute contigs from parts of the overlap graph consensus
* pick for each contig position one nucleotide

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

OLC approach is suitable/not suitable for …?

A

computationally not feasible for NGS short-reads
* shorter reads & higher coverage
➡ too many pairwise comparisons
➡ too few unique overlaps
* higher error rates than Sanger, different error patterns
* repeats!!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

OLC - problems?

A

multiple possible paths = multiple possible genomes
* contigs: set of longest contigous segments which can unambiguously be identified - usually hundreds of thousands!

repeats!
* repeats that can’t be resolved are often left out

21
Q

What was OLC first developed for and what does that mean for us?

A
  • first developed and used for Sanger data
  • updated for new long-read technologies! (Pacific Biosciences, Oxford Nanopore)

not so good for short reads of NGS

22
Q

Best approach for NGS ?

A

many short reads –>
de Bruijn graph

23
Q

De Bruijn graph - method?

A
  • sequence reads
  • break reads into k-mers (e.g., 4-mers)
  • use these to construct a DBG - (drop k-mers that are too (in) frequent)
  • target sequence: path that visits each edge once…at least in theory!
    = eulerian path (cycle)

(output: also contigs!)

24
Q

What could be the reasons for a bubble in a genome assembly graph?

A

biological:
- SNP
- heterozygosity

technological:
- sequencing errors

25
Q

DBG - What are some more complex topological features?

A
  • spur
  • bubble
  • frayed rope
26
Q

Shotgun sequencing definition

A

Shotgun sequencing is a laboratory technique for determining the DNA sequence of an organism’s genome.

The method involves randomly breaking up the genome into small DNA fragments that are sequenced individually.

A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to reconstitute the genome.

27
Q

How to determin the best k-mer size

A

break reads into k-mers, plot their abundances, repeat for multiple values for k
* choose k with optimal separation of signal from noise
* choose k large enough to reduce the number of (redundant) nodes
* choose k small enough to reduce the number of subgraphs (gaps)

28
Q

Software using the de Bruijn approach
- data structures?
- errors?
- implementations?
- which to choose?

A
  • auxiliary data structures store information about reads, k-mers, indices, positions, paired reads, etc
  • handling of sequencing errors, repeats, etc
  • incorporation of quality information
  • different implementations (for genomes)
    • Velvet, AbySS, AllPaths, Meraculous, SOAPdenovo, …
    • different data structures!
      • hash table, Bloom filter, FM index, etc

no single optimal implementation & parameter settings ➜ try different software/parameters

29
Q

Def contif

A

set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region

30
Q

Def linkage

A
  • close location of genes or other DNA markers to each other on chromosomes.
  • The closer the genes are to each other on a chromosome, the more likely they are linked or inherited together from parents to offspring
31
Q

Def consensus in OLC

A
  • pick for each contig position one nucleotide
32
Q

Def haplotype

A
  • A haplotype is a physical grouping of genomic variants (or polymorphisms) that tend to be inherited together.
  • A specific haplotype typically reflects a unique combination of variants that reside near each other on a chromosome.
33
Q

Def meta-data

A
  • data about the data!
  • from the same gene, individual, population, species
  • more than one sequence
  • more than one data type
34
Q

What is scaffolding in genome assembly?

A
  • infer order, orientation, distance of contigs
  • link contigs with Ns into scaffolds
35
Q

How can scaffolding be achieved?

A
  • with mate pairs
  • with long reads
  • with long-range linkage information
36
Q

Scaffolding with long-range linkage information

example?
originally designed for?
what does it identify?
what not known?

A

example: HiC, chromosome conformation capture
* originally designed to study 3D genome structure
* identify genomic regions physically adjacent in 3D → closer in 1D
* can be used to scaffold a draft genome assembly
* exact distance & orientation of contigs not known

37
Q

Scaffolding with long reads

how? 2 ways
problem?

A
  • long reads can link ≥2 contigs
  • align contigs and long reads OR
  • use hybrid assemblers that use both short and long reads as input
  • problem: long reads generally have much higher error rates!
38
Q

What issues are there in finishing a genome assembly?

How can this be solved?

A

Finishing a genome assembly
* most short read assemblies remain as scaffolds
- no chromosomal assignment of scaffolds
- large gaps
* filling gaps and chromosomal-level assembly is expensive and time-consuming

Can be resolved with long reads: now more chromosome-level assemblies

39
Q

How are assemblies evaluated?

A
  • placement of paired reads (orientation, distance)
  • computation of quality metrics
    • numbers & lengths of contigs and scaffolds
    • N50 value
  • quantify expected gene content for a given lineage - BUSCO
40
Q

What is the N50 value

A

used to evaluate assemblies

N50: 50% of the entire assembly is contained in contigs/scaffolds of at least this length

41
Q

Nuclear genome:

how is it really, and how is this data represented ?

A

nuclear genome
* diploid
* homozygous & heterozygous positions
* haplotypes
* multiple linear chromosomes

currently available data
* collapsed sequence data, pseudohaploid
* information about heterozygous positions*
* fully resolved haplotypes*

*requires special software and/or long or linked reads!

42
Q

what does scaffold length and quality depend on?

A

scaffold length & quality depends on repeat content

43
Q

What are some analysis challanges for NGS data?

A

data management (storage)
algorithms - always need new
analysis pipelines - need more expertise
multiple genome assemblies from the same species
* intraspecific diversity!
* inventory vs. working with pan-genomes
missing & fragmented & erroneous genome regions

44
Q

What is a Hamiltonian path ?

A

passes through every VERTEX exactly once

45
Q

What is an Eulerian path?

A

passes through every EDGE exactly once

46
Q

EXAM QUESTION

what is a scaffold? Why do most genome assemblies consist of scaffolds? Give 2 reasons (2022)

A
  • infer order, orientation, distance of contigs
  • link contigs with Ns into scaffolds
47
Q

EXAM QUESTION

How to use Hamiltonian path in sequence assembly, two problems (2020)

A

DBG

  • break reads into k-mers (e.g., 4-mers) –> these become the edges
  • break these into k-1-mers –> these become the nodes
  • use these to construct a DBG - (drop k-mers that are too (in) frequent)
  • target sequence: path that visits each edge once…at least in theory!
    = eulerian path or cycle - not hamiltonian! nodes can be reused.
48
Q

How do OLC and de Bruijn assembly methods deal with repeats?
Why?

A

Both handle unresolvable repeats by essentially leaving them out.

Unresolvable repeats break the assembly into fragments.

Fragments are contigs (short for contiguous)