Genome assembly Flashcards

Question 1

Q

Why do we sequence?

Answer

A

we are still sequencing new genomes
can be a new individual
for DNA protein interacrions
metagenetic
Sequence new genome (no previous version)
Sequence new individuals - how does it differ to reference
Sequence population - look at variation across population
Sequence tumour cells and compare to ‘normal’ tissue – where are cancer mutations - time course?
Sequence transcripts: survey gene-space, also relative quantification by tissue / time / condition
Sequence as read-out to identify DNA-protein interaction (e.g. chromatin precipitation)
Metagenomic mixed-organism co-habiting population sequencing: genome fragments, transcripts or rRNAs to identify identity, relative abundance

Question 2

Q

What are the next gen sequencing technologies

Answer

A

-Illumina
-Oxford nanopore
-PacBio

Question 3

Q

How do you get high quality In Illumina?

Answer

A

short reads but ht e volume of reads you can get through is quite big

Question 4

Q

What are the length of the reads in PacBio?

Answer

A

shorter than nanopore but longer than illumina

Question 5

Q

How do you deal with high error rates in PacBio?

Answer

A

very high error rate - to solve that you sequence multiple times and then because the errors are random you can just align the sequences and then you get a high accuracy

Question 6

Q

What are quality scores particularly important for?

Answer

A

if you are trying to find SNPs you need to know the quality score to see if you have a sequencing error or an actual variation

Question 7

Q

What do we need quality scores for?

Answer

A

Quality scores are assigned to estimate confidence of a given base call
Phred scores
aiming for quality score 30 or higher
The quality scores are used for filtering and trimming of reads
Also used for assembly
Base quality scores are essential for variant calling to distinguish a true variant from a sequencing error

Question 8

Q

Where does the quality decorate?

Answer

A

Quality deteriorates towards the ends of reads

Question 9

Q

What does AT and GC do?

Answer

A

High AT or GC content reduces complexity and can lead to higher error rates\

Question 10

Q

What is the formula for QV?

Answer

A

The quality value ( QV) is related to the base call error probability by the formula
QV = - 10 x log10( Pe ); where Pe is the probability that the base call is an error

Question 11

Q

What is base calling?

Answer

A

in illumina
Base calling algorithms turn raw intensities into A, T, C, G or N base calls

Question 12

Q

What is Chastity Filter?

Answer

A

Usual method for base calling in Illumina systems is known as Chastity Filter
Chastity filter calls a base if the intensity divided by the sum of highest and second highest intensity is no less than a threshold of 0.6 (usually). Otherwise it is marked as N

Question 13

Q

What is Fast Q format?

Answer

A

the standard output format for next gen sequencing output
all the programs rely now on that format

Question 14

Q

What do they use for quality scores in Fast Q?

Answer

A

they use ascii values for quality scores so you get char to char association

Question 15

Q

Describe the standard output

Answer

A

4 lines per sequence
Line 1 begins with the @ character, a sequence ID and an optional description
Line 2 is the sequence
Line 3 begins with the + character and, optionally followed by the same sequence ID and description
Line 4 encodes the quality values for the sequence letters in line 2 and must contain the same number of characters

Question 16

Q

What is depth of coverage useful for?

Answer

A

Sequencing errors are eliminated by the depth of coverage of overlapping sequence fragments

Question 17

Q

What was the depth coverage in the human genome project?

Answer

A

For the Human Genome Project, most of the genome was sequenced at 12X or greater
coverage.
Each base was present in 12 reads on average.
Even with 12x coverage approximately 1% of the genome not accurately assembled

Question 18

Q

Describe paired end sequencing

Answer

A

you go from both ends so you two reads per fragment
reads are shorter than sequence
gives you information how far away from each other the sequences are in the genome

Question 19

Q

What do we do with repeats in paired end sequencing?

Answer

A

it is quite tricky to assemble a genome when you have repeats because the you can’t see whihc one the sequence was
to solve that then you have to anchor the reads using other sequences overlaping with the sequence
If one read is unmappable because it falls in a very repetitive region, but the other is unique, you can again use that distance information to map both reads
One read can be mapped and the second can then be positioned within the repeat
With enough paired end reads the entire repeat can be mapped
With large repeats (LINE etc) paired ends won’t be able to map entire repeat

Question 20

Q

Describe pmate pairs sequencing

Answer

A

Mate pairs are similar to paired ends but the insertion length is much greater
Paired ends are a few hundred bp but mate pairs are kb long
DNA fragmented into 2-5Kb fragments and the ends repaired with biotin labelled dNTPs
The fragments are then circularised and fragmented
Biotin labelled fragments captured, adapters added and sequenced from both ends, as with paired end reads distance between reads known

Question 21

Q

What do you need for scaffolding?

Answer

A

-contig
-scaffold

Question 22

Q

What are contigs?

Answer

A

Contiguous sequence where base order is known
Assembled from sequence reads

Question 23

Q

What are scaffolds?

Answer

A

Genome sequence reconstructed from contigs and gaps
Gaps are where reads (paired end or mate pairs, depending on gap length) from the two sequenced ends of at least one fragment overlap with other reads in two different contigs
Approx length of fragments are known so number of bases between contigs are estimated

Question 24

Q

What is de novo sequenicing>?

Answer

A

The genome is sequenced and assembled for the first time so there is no reference.
When the human genome was first sequenced it was de novo.
De novo is the more difficult and challenging of the two methods.
De novo projects may use multiple technologies to sequence full genome

Question 25

Q

What is reference sequencing?

Answer

A

The genome has already been sequenced so a reference is available
For subsequent re-sequencing the reference can be used as a scaffold for the assembly

Question 26

Q

How many overlaps do you get per n reads?

Answer

A

For n reads there are 2n2 - 2n possible overlaps

Question 27

Q

Describe Greedy Approach - Phrap

Answer

A

The simplest assembly method
Finds two sequences with largest overlap and merges them
Repeats until no further assembly possible.
The choices made by the assembler are local and do not take into account the global relationship between reads
Limited to simple assemblies due to read lengths and local assembly method
Cannot easily use global information such as paired end reads/mate pairs, which help resolve repetitive genomes
Phrap uses the crossmatch program which is a full implementation of the Smith Waterman algorithm

Question 28

Q

Describe Overlap Graph (OLC-Overlap Layout Consensus)

Answer

A

Find the best match between the suffix of one read and the prefix of another
Mismatches allowed in overlaps for sequencing errors
Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
Determine path through reads to create layout
Create local multiple alignments from the overlapping reads
Consensus derived from alignments

Question 29

Q

What is a k-mer?

Answer

A

K-mer - all the possible substrings of length k that are contained in a string

Question 30

Q

How do you identify overlap?

Answer

A

Sort all k-mers in the reads (typically 16 – 24 based) and index them
K-mer - all the possible substrings of length k that are contained in a string
Identify pairs of reads that share a k-mer
Extend to full alignment and discard if not >95% similar
This technique drastically reduces the search space and has been widely used
Even with this improvement the computational requirement to identify all possible overlaps from next-gen short reads is a significant limitation
OLC is suitable for Sanger sequencing reads (1 kb) and long PacBio reads (up to a few tens of kilobases)

Question 31

Q

Describe simple assembly

Answer

A

With Sanger sequencing reads represented as nodes in a graph and edges represent alignments
Following Hamiltonian cycle can construct genome by concatenating each read
Note this forms a circular genome
Hamiltonian cycle visits all nodes (reads) once only and returns to start position
However, this does not scale for the millions of reads from next gen genome sequencing

Question 32

Q

How do k-mers improve assembly?

Answer

A

For any genome we can use the same approach to reconstruct it
For assembly ideally need all k-mers present in the genome to be assembled
Each k-mer should appear at most once in the genome
Genome can then theoretically be assembled by following graph through the k-mers
The larger the genome the larger the required k-mer
This is the basis of de Bruijn graph assembly

Question 33

Q

de Bruijn Graph

Answer

A

Split reads into all possible k-mers – removes redundancy in reads
Follow Hamiltonian cycle in which each successive node (k-mer) is shifted by one nucleotide
Use of k-mers means that even though an individual k-mer may overlap with more than one other there is only one overlap that provides a path through the graph that passes through each k-mer only once

Question 34

Q

What is a hamiltonian graph?

Answer

A

The Hamiltonian graph approach is used by numerous assemblers: SOAPdenovo , SGA and ABySS among others
Traversing all nodes at once leads to the nondeterministic polynomial time (NP) -complete problem as the number of nodes increases
As the size of the genome increases, the computation time required to solve the graph problem increases infinitely
To compensate for this assembly programs adjust and simplify the graph, for example reducing branching nodes
An alternative approach used by other assemblers (Velvet, EULER, SPAdes etc) is to use a Eulerian path.
This scales better to larger genome

Question 35

Q

Eulerian Graph:

Answer

A

All k-mer prefixes and suffixes represented as nodes
Each prefix and suffix can only occur once in the graph. (Note they will be much larger than 2 nuc in full genome assembly graph)
Edges represent k-mers having particular prefixes and suffixes
k-mer edge ATG has prefix AT and suffix
Perform Eulerian cycle through graph - visits every edge of the graph exactly once

Question 36

Q

Assembly requirements

Answer

A

Hamiltonian or Eulerian have the same requirements in order to assemble a complete genome:
- Requirements – if met a path through the graph, visiting each edge once, is possible if:
  - Containsallk-mersinthegenome(unlikelytooccur).Ensuresgraph balanced - in directed graph number of edges in is same as number out
  - All k-mers are error free (next gen sequences contain errors)
  - Each k-mer occurs at most once in the genome (problem with repeats but paired end reads help to overcome this)
- Assembly programs adapt the method to compensate for these issues e.g. removing branches
- Low coverage areas will lead to multiple contigs
- Final stage of assembly is scaffolding, using paired end reads to join contigs

Question 37

Q

What is the significance of k-mers size in genome assembly?

Answer

A

Assembly requires presence of all (or nearly all) k-mers in genome
Illumina reads are approx 100-200bp+ – k-mer of 100+
Reads will not contain all possible 100-mers etc present in genome, however deep the coverage
Assemblers will break each read into overlapping k-mers e.g. 46 overlapping 55- mers (for 100bp read)
This ensures that nearly all 55-mers in the genome are detected
The k-mer size can be set when running he assembly so different options can be tried as optimum option depends of the genome sequence

Question 38

Q

What are the stages of AbySS

Answer

A

Uniting
- The initial assembly of sequences using a de Bruijn graph approach
Contig
- Paired-end reads aligned to the unitigs and the pair information is used to orient and merge overlapping unitigs
Scaffold
- Align mate-pair reads to the contigs to orient and join them into scaffolds
- “N” characters are inserted at any gaps in coverage and for unresolved repeats

Question 39

Q

describe uniting in Assembly

Answer

A

The most resource demanding stage of the de Bruijn assembly, including memory requirement
All k-mers from the sequence reads are stored in a hash table- Additional information for each k-mer is also stored:
- Number of k-mer occurrences in the reads
- Presence or absence of possible neighbour k-mers in the de Bruijn graph

Question 40

Q

What is a bloom filter?

Answer

A

A Bloom filter is a compact data structure for representing a set of elements that supports two operations:
- (1) inserting an element into the set. These are the k-mers
- (2) querying for the presence of an element in the set
Used by ABySS and reduces the memory requirement
The Bloom filter structure consists of a bit vector and one or more hash functions
The hash functions map each k-mer to a corresponding set of positions within the bit vector - the bit signature
A k-mer is added to the Bloom filter by setting the its bit value to one
Queried by testing if all positions of its bit signature are one

Question 41

Q

Describe the filtering process of k-mers

Answer

A

To filter out the majority of k-mers caused by sequencing errors all k-mers with an occurrence count below a user-specified threshold are discarded
Optimum minimum typically 2-4
Retained k-mers are called solid k-mers
In the second pass through the reads those that consist entirely of solid k-mers (solid reads) are extend left and right within the de Bruijn graph to create unitigs
During the read extension phase of assembly it’s possible for multiple solid reads to result in the same unitig
Avoided by using an additional tracking Bloom filter to record k-mers included in previous unitigs
A solid read is only extended if it has at least one k-mer that is not already in the tracking Bloom filter

Question 42

Q

What does the string graph give us?

Answer

A

Longer reads have enabled return to overlap graph approach
String graph uses same methodology as overlap graph but simplified
First, contained reads (red) - reads that are substrings of some other read - are removed:
The resulting graph, called a string graph, shares many properties with the de Bruijn graph without the need to break the reads into k-mers

Question 43

Q

What is the FM index?

Answer

A

Theoretical work on efficiently constructing the string graph using the FM index led to memory-efficient assemblers for large genomes.
The FM index is based on the Burrows-Wheeler transform and the suffix array