Genome analysis - de novo assembly Flashcards

Question 1

Q

What is a de novo genome assembly?

Answer

A

When you sequence and assemble a new genome that does not have a reference genome.

Question 2

Q

Why are we interested in de novo assembly?

Answer

A

It gives an inventory of the genetic information in an organism and tells us what the organism can do and how it has evolved.
I can make a good reference for functional genomics and genome wide association studies for the future.

Question 3

Q

What is the shotgun approach? What is needed for this approach?

Answer

A

The shotgun approach is a way of sequencing and assembling a de novo genome.

Randomly fragment the DNA and sequence the fragments
Find overlaps between the reads
Assemble overlaps into contigs
Assemble contigs into scaffolds.

To be able to find the overlaps we need high sequence coverage, especially if we are using short read sequencing.

Question 4

Q

What is sequence coverage (read depth)?

Answer

A

How many times each base has been sequenced.

Question 5

Q

What are contigs and scaffolds?

Answer

A

During de novo assembly we sequence fragments of the genome and we find overlaps between the reads. Those overlapped reads are assembled as continuous sequences without gaps called contigs.

Then, in the scaffolding step, the contigs are connected by large-insert (pair-end/mate-pair) reads, which generally originate from large DNA fragments or fosmid inserts of several kilobases in length. The ordered set of connected contigs is defined as a ‘scaffold’ that is providing information about the contigs relative positions and orientations within the genome.

Question 6

Q

How can we know the expected number of contigs?

Answer

A

Number of contigs = Ne^-co

N = number of reads
c = coverage
o = 1- T/L
L = read length
T = minimum detectable overlap
G = genome size

This can also be used to estimate the coverage that your sequencing needs.

Question 7

Q

How does the minimum detectable read value, read length and coverage affect the expected number of contigs?

Answer

A

When the minimum detectable overlap is higher, more individual contigs are generated. Each read will contribute to a smaller portion of the genome because it has fewer overlaps with other reads. As a result, more contigs will be produced.

Longer reads will give fewer individual contigs because it is easier to find overlaps without needing as high coverage.

Higher coverage helps in finding overlaps, especially needed for short reads.

Question 8

Q

What are overlaps?

Answer

A

The more similar the end of one read is to the beginning of another, the more likely they are to have been originated from overlapping stretches of the genome.

Question 9

Q

What are the general problems in de novo assemblies?

Answer

A

Bias due to technology and/or sequence composition
Sequencing errors
Heterozygosity in the data within the genome

These problems have to be solved with experimental design or assembly algorithms.

Question 10

Q

Why do we need high coverage for the shotgun approach?

Answer

A

To find the overlaps between the reads. Longer reads also makes it easier to find the overlaps but generally longer reads have lower sequencing quality.

Question 11

Q

What is the key computational challenges with using long vs short reads for assembly?

Answer

A

The key challenge with shotgun assembling long reads is that it becomes computationally challenging to overcome the higher error rate.

Short reads makes it harder to find the overlaps and it limits the ability to resolve the repeats. We also need higher coverage - we need to sequence more - to fin the overlaps.

Question 12

Q

What is the benefit of using long vs short reads for assembling genomes?

Answer

A

Having longer reads will generally reduce the number of contigs because it is easier to find the overlaps. It gives higher continuity of the assembly which we want when we are for example comparing our assembly to other genomes to find larger structural variations.

Short reads have high accuracy, high throughput and high resolution. Good to use if you are interested in the sequences of genes when the demands on continuity are not as high.

Question 13

Q

What is minimum detectable overlap?

Answer

A

How long the overlap needs to be to be detected. If this value is higher, more individual contigs are generated because fewer of the reads will overlap and fewer reads will be connected into continuous sequences.

Question 14

Q

What are greedy assembly algorithms?

Answer

A

The first try in solving the assembly challenges.

These algorithms aim to assemble the genome by locally selecting the most promising overlaps based on simple parameters such as sequence similarity and overlap length and then merging the two fragments. This is repeated until no more merges can be done.

It chooses the most parsimonious explanation for the data.

Question 15

Q

What algorithms would you use for short reads vs long reads assembly?

Answer

A

If you have long reads use overlap assemblers.

If you have short reads use de bruin assemblies.

If you have both long and short reads then look at what the majority of your reads are and the correct with the other.

Question 16

Q

What is the long read assembly pipeline (overlaps graphs)?

Answer

A

Reads →

overlap (build overlap graphs) →

Layout (bundle stretches of overlaps into contigs, contig graph and determine the path through the graph) →

Consensus (pick most likely nucleotide sequence for each contig →

Contigs.

Question 17

Q

Explain how overlap graphs are constructed

Answer

A

An overlap graph is constructed such that the nodes are sequencing reads. We put and edge between the nodes if the end of one read overlaps with the beginning of the other read.

The overlap graph is like a representation of the relationship between the k-mers but we cannot tell the exact sequence by just looking at an overlap graph because it usually ends up being big and messy.

Question 18

Q

Explain the layout step of the overlap assembly pipeline

Answer

A

In the layout we remove the edges of the overlap graph that are redundant and some edges can also be inferred from looking at other edges and we remove these as well. We them emit contigs from the non-branching stretches and then determine the path through the graph.

Question 19

Q

What is a path in the context of overlap graphs?

Answer

A

Sequence of nodes such that form each node there is an edge to the next node in the sequence.

Solving the assembly is the problem of identifying a path through the graph.

There are Hamiltonian paths (visits each node once) and Eulerian paths (visits all the edges once). The Eulerian path is the easier one to solve.

In an overlap graph we use the Hamiltonian path and in a de bruin graph we use the Eulerian path.

Question 20

Q

What is the consensus step of the overlap assembly pipeline?

Answer

A

In the consensus step we line all the reads that make up a contig up and choose the consensus one for the assembly by using multiple sequence alignment.

Question 21

Q

How are overlaps found in the overlap assembly pipeline?

Answer

A

The overlaps are found by comparing all reads against each other to find regions of overlaps. A seed and extend algorithm is then used where you choose a k-mer size and look for exact matches of that length and then extend to both sides.

The overlaps found can either be true or they can be false due to the k-mer being the end of a repetitive sequence.

Question 22

Q

Give examples of overlap assemblers

Answer

A

Celera and later Canu.

Question 23

Q

What is the short read assembly pipeline?

Answer

A

Error correction ((remove errors in our sequences) shrinks the assembly graph, reducing time and memory requirements) →

Graph construction (de Bruijn graph) →

Graph Cleaning →

Contig assembly →

Scaffolding →

Gap Filling.

Question 24

Q

Explain the error correction step of the long read pipeline.

Answer

A

The error rate of short reads is usually lower than for long reads but the beginning of the read usually has higher accuracy than the end.

In this step we count how many times each k-mer is occurring in all reads, k-mers that have errors in then should be very few.

We do this because it shrinks the assembly graph, reduces time and reduces sensitivity to errors during the assembly.

Question 25

Q

Explain how de bruin graphs are constructed.

Answer

A

De bruin assemblers use hash tables to find overlaps similar to the overlap assemblers but they do not find the full overlaps because they do not extend the overlapping k-mers.

k-mers are nodes and the adjacent k-mers are linked together using edges in the graph. Every node is a unique k-mer and the edges are overlaps of length k-1.

Question 26

Q

Why are de bruijn graphs less messy than overlap graphs?

Answer

A

Because we do error correction before the assembly and each node is a unique k-mer so there is less redundancy.

Question 27

Q

What are tips in de bruijn graphs?

Answer

A

In de Bruijn graphs, “tips” are structures that represent the ends of branches or dead ends in the graph.

These are sequences of nodes that are not fully connected to the rest of the graph and have no outgoing edges, meaning they do not extend further.

Tips often indicate areas of the genome where sequencing coverage is low, sequencing errors are present, or where the true sequence diverges from the reference. During graph clean-up we remove those as well as bubbles.

Question 28

Q

What are bubbles in de bruin graphs?

Answer

A

“Bubbles” are structures that represent alternative paths between two points in the graph. Bubbles occur when there are multiple possible routes through the graph that eventually reconverge at a common point.

They can arise due to genomic variations, such as single nucleotide polymorphisms (SNPs), small insertions or deletions (indels), or sequencing errors.

Question 29

Q

Why would you use different assembly algorithms for long and short reads?

Answer

A

Overlap assemblers work well for long reads but they are problematic for short reads because we need many more reads to get the same sequence coverage as the long reads and then the computing overlaps will take too long.

Another problem is that the overlaps are shorter and it is hard to find the true ones, this leads to us needing even higher coverage (read depth). Shorter k-mer length increases computation time but it also usually means less complex graphs. However if we have too long k-mers and a lot of sequencing errors then we will have a hard time finding the overlaps because we have no similar kmers.

Question 30

Q

What are the basic statistics for determining if your assembly is of good quality?

Answer

A

number of contigs
number of scaffolds
largest contig
total length of the assembly
N50 - the contig length such that using equal or longer contigs sum up to 50% of the bases in the assembly. Sort from longest to shortest and N50 is found where the sum of the lengths >= 50% of the assembly length.
L50 count of smallest number of contigs whose length sum makes up to 50% of genome size.

Question 31

Q

Except for looking at the basic statistics of your assembly, how can we determine if an assembly is good?

Answer

A

We can compare with other sources and our expectations.

How big do we estimate the genome to be?
Are all the known important genes present?
Compare with genomes of closely related species.

We can also look at the intrinsic consistency between assembly and the data.

Regions with unusual depth of coverage may indicate repeats and sequencing artifacts.
Large number of mismatches between assembled sequences and reads.
Inconsistencies of paired reads may indicate base calling errors.
Partially aligned reads.
Busco scores

Question 32

Q

Graph complexity can be diagnosed and resolved using two different methods, which?

Answer

A

Graph complexity includes tips, bubbles ect. and they can be diagnosed using read depth. Tips can be detected by the fact that there is lower read depth for that branch and the same for one of the paths in a bubble. These can then be resolved using tip pruning and bubble popping.

We can also resolve graph complexity using the reads:
- read threading
- mate threading
- path following

Question 33

Q

What is read threading, mate threading and path following?

Answer

A

Different ways of using reads to resolve graph complexity.

Read threading: joins paths across collapsed repeats that are shorter than the read lengths.

Mate threading: Mate threading joins paths across collapsed repeats that are shorter than the paired-end distances.

Path following: Path following chooses one path if its length fits the paired-end constraint.

Question 34

Q

What questions should you ask before deciding on a design for an experiment?

Answer

A

What to you want to use the results for?
How big is the genome?
What is the GC - content and repeat structure?
How much and of what quality DNA can you produce?
How much money do you want to spend?

Question 35

Q

What makes it hard to get the perfect assembly?

Answer

A

Problems in finding overlaps because of sequencing errors or short reads.
Missing data due to bias during generation
repeats.

Question 36

Q

What is standard acceptable minimum quality?

Question 37

Q

Some sequences are more difficult to obtain than others, what sequences are hard to get?

Answer

A

Very high or low GC - content
Long stretches of repeats
Inverted repeats that form hairpins

Much bias stems from polymerase based amplification - overcome by single molecule sequencing (PacBio, nanopore).

Question 38

Q

Why does GC - content give sequencing bias?

Answer

A

GC extremes make it hard to amplify the fragments with PCR due to that the temperatures for denaturing the dna will vary across the genome. This means that variations in GC content across the genome may give bias in relative coverage of the sequencing.

For the PCR - free technologies, it seems that there is still a problem getting unbiased relative coverage for regions of low GC content but it appears to be better for higher GC - content.

Since we expect the relative coverage of the genome to be 1 across the entire genome we may falsely suspect indels in the regions where the coverage is lower/higher for some fragments.

Variations in relative coverage also makes it harder to resolve repeats and find overlaps which will affect the quality of the assembly.

Question 39

Q

How can we solve the problem of long stretches of repeats giving gaps in your assembly?

Answer

A

Use technologies that give reads longer than the repeats
Use paired reads - the appropriate size is related to the sizes of the repeated regions in the genome. The paired end fragments need to be long enough to

Question 40

Q

Why is repeated sequences a problem for genome assembly?

Answer

A

Because if a read fragment starts before the repeat and ends in the repeat, it is difficult for the assembler to know where they are in the genome. This is a big problem for short read technologies. We generally need longer reads than the repeats to resolve this or to use paired end sequencing.

Question 41

Q

What is Hi-C/Omni-C? How does it work?

Answer

A

Long range chromosome interactions.

Crosslink DNA
Cut with restriction enzymes
Ligate the ends
Sequence the fragments using paired ends

Map to a draft assembly and use the number of connections between contigs to order contigs into scaffolds.

Question 42

Q

Why is the quality and amount of starting DNA important?

Answer

A

Fewer experiments are possible with little DNA, we can’t do multiple libraries or technologies.
Long reads require high molecular weight DNA
We will lose DNA in the lab
We have a reduced diversity of the same which means more bias and sequencing artifacts.

Question 43

Q

When do we need to do a de novo assembly?

Answer

A

There are no sufficiently similar genomes already sequenced.
Large differences in gene content
Structural differences between genomes
DNA sequence divergence

Question 44

Q

What is a transcriptome?

Answer

A

The complete set of RNA molecules in a cell or population of cells under a certain condition.

Question 45

Q

What is the difference in assembling RNA vs DNA?

Answer

A

All parts of the DNA are equally represented in a cell whilst RNA represents gene expression.

This means that for RNA:
- All genes are not expressed all the time
- All genes are not expressed at a similar level
- Several variants i.e. isoforms can be expressed from the same gene (at different levels, in different tissues ect.)

Question 46

Q

What questions can we answer with RNA-seq?

Answer

A

What parts of the genome are expressed?
Do transcript sequences look the same as in the genome? For gene fusion detection and RNA editing.
At what levels and in what form are genes expressed? At different development times, tissues or with treatments? For differential expression analysis (difference in expression of genes in different conditions), alternative expression analysis (diversity of transcript isoforms from a single gene), allele specific expression.
ect.

Question 47

Q

Why use RNA instead of DNA for de novo sequencing?

Answer

A

The transcriptome is much smaller than the genome. The genome is 3Gbp and the transcriptome is only 30 000 genes.
The transcriptome is less repetitive
Get a reference for the RNA transcripts but combine it with functional studies such as differential expressions
Some features can only be seen at RNA level - alternative expression analysis (alternative isoforms, fusion transcripts, RNA editing).
Predicting transcripts from the genome alone is difficult.

Question 48

Q

How does the way contigs are structured differ between assembling genomes vs transcriptomes?

Answer

A

When assembling a transcriptome, each contain will be one gene instead of a long sequence.

RNA needs to be reversed transcriped into cDNA since most machines does not take RNA.

Question 49

Q

Why can we see different isoforms of RNA transcripts?

Answer

A

Because of alternative splicing.

Question 50

Q

What should you think of before starting your transcriptome assembly?

Answer

A

Decide what rna molecules you are interested in sequencing. coding or also non-coding? The largest proportion of RNA is rRNA which is non-coding and not very informative.
Remove DNA to not confuse the analysis.
RNA ses library enrichment so that we do not sequence the total RNA which is mostly non-informative.

We can also take steps to only get the RNA molecules we are interested in such as:
- RNA reduction/depletion - remove rRNA
- Poly(A) - select mRNA
- cDNA capture - extract RNA that are important for certain traits/genes.

Question 51

Q

What are the challenges with de novo RNA assemblies?

Answer

A

Transcripts have different coverage. A low coverage transcript does not have to be false.
Read coverage will be uneven across the transcripts length. This means that we cannot use relative coverage as quality metric.
Reads with sequencing errors from a highly expressed gene may be more abundant that the correct ones from a lowly expressed gene.
Transcripts encoded by adjacent loci can overlap and thus can be erroneously fused.
Multiple transcripts per locus, owing to alternative splicing. Data structures need to be able to handle this.

Question 52

Q

How much do you need to sequence for RNA assemblies?

Answer

A

It depends on the question. If you are interested in highly expression genes then not so much but lowly expressed genes may need higher coverage.

It also depends on:
- organism (unicellular/multicellular, prokaryote, eukaryote ect.)
- quality of RNA, library construction
- sequence type (read length, paired end?).

Question 53

Q

Why is read coverage across the transcripts length uneven when doing RNA assemblies?

Answer

A

Mostly due to technical issues:
RNA selection method
RNA specific molecular biology (RNA fragmentation, reverse transcription.
Sequencing specific molecular biology (adapter ligation, library enrichment, PCR)
Degradation of RNA

Question 54

Q

When would you use long vs short k-mer lengths when doing rna assembly?

Answer

A

Long kmers are better for transcripts with higher read depth and short k-mers give better representation of transcripts with low read depth.

You could try a range of sizes and then merge the assemblies.

Question 55

Q

How do you know if your RNA assembly is of good quality?

Answer

A

Examine the read representation of the assembly - ideally ~80% of the reads should be represented by the assembly. Unassembled reads can be lowly expressed transcripts with insufficient coverage, low quality or contaminants.
Examine the representation of full-length protein coding genes - compare against databases of known protein sequences.
Explore completeness according to conserved orthodox content (BUSCO).
Quality scores with rnaQUAST.
Ex90N50

Question 56

Q

What is Ex90N50?

Answer

A

A quality assessment of rna assemblies.

Order the the transcript by TPM(transcripts per million, measure that quantifies gene expression) in descending order.

Take the N50 value at 90% of the of the total normalized expression data.

It is a way of checking if you have sequenced deep enough. If you have not sequenced deep enough you will see that the N50 value at 90% of the expression is not higher than for less of the total expression expression.

Question 57

Q

How do we filter a transcriptome? How do we know what is real and what isn’t?

Answer

A

Look for ORFs
Filter by normalized expression
Cluster based on similarity

Question 58

Q

When should we choose to assemble the transcriptome instead of the genome?

Answer

A

For huge complicated genomes, RNA might be less complicated. The genes are smaller and we can target certain genes.
If we only are interested in the expressed parts of the genome
If we are interested in finding new isoforms of transcripts.

Question 59

Q

What is metagenomics?

Answer

A

Analysis of genetic material recovered directly from environmental samples. Unlike traditional genomics which focuses on the genomes of individuals, metagenomics examines the collective genetic material of communities of microorganisms that inhabit the environment.

Question 60

Q

Why do we do metagenomics?

Answer

A

Only 1% of the microbes are culturable, so then we have to sequence them directly from their environment and metagenomics is culture-independent.

To find out what organisms are in a sample, what each organism do and how they do it. Do they change over time? do they change with diseases?

Question 61

Q

What are the two metagenomics approaches?

Answer

A

Amplicon sequencing
Shotgun metagenomics

Question 62

Q

What is taxonomic diversity and functional metagenomics?

Answer

A

Taxonomic diversity is the step where we find out what organisms are present in the sample and their abundance.

Question 63

Q

What is amplicon sequencing?

Answer

A

One approach to taxonomic diversity analysis, finding out what organisms are present in a sample and in what abundance.

We amplify one marker gene (usually 16s) with PCR and then sequence the PCR fragments.

We then group the fragments into operational taxonomic units (OTUs). OTUs are groups of 16s sequences with similarity to each other.

We then use databases to identify the OTUs to see what organisms the groups represents.

Question 64

Q

What is the difference in using ASVs vs OTUs in amplicon sequencing?

Answer

A

The OTUs are groups of the marker gene with a preset similarity threshold represented by a consensus sequence. It is very fast but they are subject to reference bias, we can only see what we have seen before.

Amplicon sequence variants (ASVs) are exact sequence variants, which means that we keep unique fragments as individuals, allowing for precise identification of individual variants without clustering. This gives higher taxonomic resolution which may be better if we want to distinguish between closely related taxa.

These methods are clustering-first methods where we cluster or find ASVs first and then do taxonomic assignment.

Answer 64

A

In clustering first methods we first look for OTUs or ASVs in the reads and then we use databases to assign taxonomic placement. These methods are slower but better at finding new things.

In assignment first methods, we compare all individual sequences to databases and then group them based on the annotations. This method is faster but is not good at finding new things.

It is a good approach to first start with the assignment first method since it quickly will assign taxa to the well defined species and the reads that it could not find placements for will be grouped together.

then do clustering first on the undefined group.

Answer 65

A

Choice of marker gene
DNA extraction and PCR amplification
Sequence technology
Chimeric reads - may lead to misleading taxonomic assignment and overestimation of diversity.

Answer 66

A

Longer reads makes it easies to assign taxonomy. This is however expensive and it is common to sequence paired end and merge to one sequence.

When thinking of how much you should sequence you should think about how big you estimate the diversity in the sample to be. How much do you need to sequence to capture all the diversity?

The abundances of the different taxi will vary within the sample.

Answer 67

A

Directly metagenomically sequence community DNA and compare it to reference genomes or gene catalogs. This is more expensive but provides improved taxonomic resolution and allows observation of single nucleotide polymorphisms (SNPs) and other variant sequences.

In shotgun Metagenomics we also get the functional analysis that tells us what organisms do what.

Answer 68

A

It shows there ealtionship between samples gathered and how many taxa you discover.

If you have low diversity in your sample then a few samples is enough but if you estimate the diversity to be high then you would need to sample more to find all taxa.

Answer 69

A

We can then do reference based analysis directly on the reads and get a functional and taxonomic profile of the community. But we will get no indication of what organism is doing what.

Answer 70

A

Errors in reads
Variation between individuals (strains, species).
Repeats
Similar sequences between unrelated organisms.
Uneven coverage - most species with very low coverage.

Answer 71

A

If we assemble the reads and analyze the contigs we get information about function in environment and some correlation between genes.

If we further bin the contigs and analyze the Contig bins (MAG) we get information of function in environment and to some extent species.

Answer 72

A

Binning contigs to find out which organism in the sample is doing what.

The basis on which can bin contigs are:
- similarities to reference (extrinsic).

Sequence patterns, tetramer fréquence most common (intrinsic). The patterns are specific for each organism.

combining several different characteristics for binning might lead to better resolution.

Answer 73

A

same as normal de novo assemblies but with more focus on genome completeness and contamination.

Answer 74

A

Continuity: N50, L50, number of contigs
Completeness: is the assembly the same size as the expected genome size?
Correctness: Quality of the sequencing, look at FastQC.

Look at BUSCO scores to see if all conserved regions from closely related species are present in you assembly to make a more informed decision.

Look at the coverage across the assembly, we expect all parts of the genome to be present in equal amounts and lower or higher coverage could indicate remaining repeats or sequencing artefacts in the assembly.

Answer 75

A

Sequence stretches that are found identically in different parts of the genome are called repeats. In particular for eukaryotes, repeats are very common, but the exact amount and distribution of the repeats differ between organisms.

Assembly programs will be confused by these stretches as reads coming from these regions will be identical, and this can lead to incorrect assemblies. Longer sequence reads that go all through the repeat sequence into the unique sequence bordering the repeat will help greatly, as will sequencing libraries with greater insert sizes (e.g., Illumina matepairs, or PacBio long reads).