4. NGS: Assembly Flashcards

Question

DBG - What are some more complex topological features?

Answer 1

- spur - bubble - frayed rope

Answer 2

Shotgun sequencing is a laboratory technique for determining the DNA sequence of an organism’s genome. The method involves randomly breaking up the genome into small DNA fragments that are sequenced individually. A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to reconstitute the genome.

Answer 3

break reads into k-mers, plot their abundances, repeat for multiple values for k * choose k with optimal separation of signal from noise * choose k large enough to reduce the number of (redundant) nodes * choose k small enough to reduce the number of subgraphs (gaps)

Answer 4

* auxiliary data structures store information about reads, k-mers, indices, positions, paired reads, etc * handling of sequencing errors, repeats, etc * incorporation of quality information * different implementations (for genomes) - Velvet, AbySS, AllPaths, Meraculous, SOAPdenovo, ... - different data structures! - hash table, Bloom filter, FM index, etc no single optimal implementation & parameter settings ➜ try different software/parameters

Answer 5

set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region

Answer 6

- close location of genes or other DNA markers to each other on chromosomes. - The closer the genes are to each other on a chromosome, the more likely they are linked or inherited together from parents to offspring

Answer 7

* pick for each contig position one nucleotide

Answer 8

- A haplotype is a physical grouping of genomic variants (or polymorphisms) that tend to be inherited together. - A specific haplotype typically reflects a unique combination of variants that reside near each other on a chromosome.

Answer 9

* data about the data! * from the same gene, individual, population, species - more than one sequence - more than one data type

Answer 10

* infer order, orientation, distance of contigs * link contigs with Ns into scaffolds

Answer 11

- with mate pairs - with long reads - with long-range linkage information

Answer 12

example: HiC, chromosome conformation capture * originally designed to study 3D genome structure * identify genomic regions physically adjacent in 3D → closer in 1D * can be used to scaffold a draft genome assembly * exact distance & orientation of contigs not known

Answer 13

- long reads can link ≥2 contigs - align contigs and long reads OR - use hybrid assemblers that use both short and long reads as input - problem: long reads generally have much higher error rates!

Answer 14

Finishing a genome assembly * most short read assemblies remain as scaffolds - no chromosomal assignment of scaffolds - large gaps * filling gaps and chromosomal-level assembly is expensive and time-consuming Can be resolved with long reads: now more chromosome-level assemblies

Answer 15

* placement of paired reads (orientation, distance) * computation of quality metrics - numbers & lengths of contigs and scaffolds - N50 value * quantify expected gene content for a given lineage - BUSCO

Answer 16

used to evaluate assemblies N50: 50% of the entire assembly is contained in contigs/scaffolds of at least this length

Answer 17

nuclear genome * diploid * homozygous & heterozygous positions * haplotypes * multiple linear chromosomes currently available data * collapsed sequence data, pseudohaploid * information about heterozygous positions* * fully resolved haplotypes* *requires special software and/or long or linked reads!

Answer 18

scaffold length & quality depends on repeat content

Answer 19

data management (storage) algorithms - always need new analysis pipelines - need more expertise multiple genome assemblies from the same species * intraspecific diversity! * inventory vs. working with pan-genomes missing & fragmented & erroneous genome regions

Answer 20

passes through every VERTEX exactly once

Answer 21

passes through every EDGE exactly once

Answer 22

* infer order, orientation, distance of contigs * link contigs with Ns into scaffolds

Answer 23

DBG * break reads into k-mers (e.g., 4-mers) --> these become the edges * break these into k-1-mers --> these become the nodes * use these to construct a DBG - (drop k-mers that are too (in) frequent) * target sequence: path that visits each edge once…at least in theory! = eulerian path or cycle - not hamiltonian! nodes can be reused.

Answer 24

Both handle unresolvable repeats by essentially leaving them out. Unresolvable repeats break the assembly into fragments. Fragments are contigs (short for contiguous)