Genome assembly Flashcards

1
Q

Why do we sequence?

A
  • we are still sequencing new genomes
  • can be a new individual
  • for DNA protein interacrions
  • metagenetic
  • Sequence new genome (no previous version)
  • Sequence new individuals - how does it differ to reference
  • Sequence population - look at variation across population
  • Sequence tumour cells and compare to ‘normal’ tissue – where are cancer mutations - time course?
  • Sequence transcripts: survey gene-space, also relative quantification by tissue / time / condition
  • Sequence as read-out to identify DNA-protein interaction (e.g. chromatin precipitation)
  • Metagenomic mixed-organism co-habiting population sequencing: genome fragments, transcripts or rRNAs to identify identity, relative abundance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the next gen sequencing technologies

A

-Illumina
-Oxford nanopore
-PacBio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How do you get high quality In Illumina?

A

short reads but ht e volume of reads you can get through is quite big

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the length of the reads in PacBio?

A

shorter than nanopore but longer than illumina

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do you deal with high error rates in PacBio?

A

very high error rate - to solve that you sequence multiple times and then because the errors are random you can just align the sequences and then you get a high accuracy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are quality scores particularly important for?

A

if you are trying to find SNPs you need to know the quality score to see if you have a sequencing error or an actual variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What do we need quality scores for?

A
  • Quality scores are assigned to estimate confidence of a given base call
  • Phred scores
  • aiming for quality score 30 or higher
  • The quality scores are used for filtering and trimming of reads
  • Also used for assembly
  • Base quality scores are essential for variant calling to distinguish a true variant from a sequencing error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Where does the quality decorate?

A

Quality deteriorates towards the ends of reads

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does AT and GC do?

A

High AT or GC content reduces complexity and can lead to higher error rates\

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the formula for QV?

A
  • The quality value ( QV) is related to the base call error probability by the formula
  • QV = - 10 x log10( Pe ); where Pe is the probability that the base call is an error
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is base calling?

A
  • in illumina
  • Base calling algorithms turn raw intensities into A, T, C, G or N base calls
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Chastity Filter?

A
  • Usual method for base calling in Illumina systems is known as Chastity Filter
  • Chastity filter calls a base if the intensity divided by the sum of highest and second highest intensity is no less than a threshold of 0.6 (usually). Otherwise it is marked as N
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Fast Q format?

A
  • the standard output format for next gen sequencing output
  • all the programs rely now on that format
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What do they use for quality scores in Fast Q?

A

they use ascii values for quality scores so you get char to char association

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe the standard output

A
  • 4 lines per sequence
  • Line 1 begins with the @ character, a sequence ID and an optional description
  • Line 2 is the sequence
  • Line 3 begins with the + character and, optionally followed by the same sequence ID and description
  • Line 4 encodes the quality values for the sequence letters in line 2 and must contain the same number of characters
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is depth of coverage useful for?

A

Sequencing errors are eliminated by the depth of coverage of overlapping sequence fragments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What was the depth coverage in the human genome project?

A
  • For the Human Genome Project, most of the genome was sequenced at 12X or greater
    coverage.
  • Each base was present in 12 reads on average.
  • Even with 12x coverage approximately 1% of the genome not accurately assembled
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Describe paired end sequencing

A
  • you go from both ends so you two reads per fragment
  • reads are shorter than sequence
  • gives you information how far away from each other the sequences are in the genome
19
Q

What do we do with repeats in paired end sequencing?

A
  • it is quite tricky to assemble a genome when you have repeats because the you can’t see whihc one the sequence was
  • to solve that then you have to anchor the reads using other sequences overlaping with the sequence
  • If one read is unmappable because it falls in a very repetitive region, but the other is unique, you can again use that distance information to map both reads
  • One read can be mapped and the second can then be positioned within the repeat
  • With enough paired end reads the entire repeat can be mapped
  • With large repeats (LINE etc) paired ends won’t be able to map entire repeat
20
Q

Describe pmate pairs sequencing

A
  • Mate pairs are similar to paired ends but the insertion length is much greater
  • Paired ends are a few hundred bp but mate pairs are kb long
  • DNA fragmented into 2-5Kb fragments and the ends repaired with biotin labelled dNTPs
  • The fragments are then circularised and fragmented
  • Biotin labelled fragments captured, adapters added and sequenced from both ends, as with paired end reads distance between reads known
21
Q

What do you need for scaffolding?

A

-contig
-scaffold

22
Q

What are contigs?

A
  • Contiguous sequence where base order is known
  • Assembled from sequence reads
23
Q

What are scaffolds?

A
  • Genome sequence reconstructed from contigs and gaps
  • Gaps are where reads (paired end or mate pairs, depending on gap length) from the two sequenced ends of at least one fragment overlap with other reads in two different contigs
  • Approx length of fragments are known so number of bases between contigs are estimated
24
Q

What is de novo sequenicing>?

A
  • The genome is sequenced and assembled for the first time so there is no reference.
  • When the human genome was first sequenced it was de novo.
  • De novo is the more difficult and challenging of the two methods.
  • De novo projects may use multiple technologies to sequence full genome
25
Q

What is reference sequencing?

A
  • The genome has already been sequenced so a reference is available
  • For subsequent re-sequencing the reference can be used as a scaffold for the assembly
26
Q

How many overlaps do you get per n reads?

A

For n reads there are 2n2 - 2n possible overlaps

27
Q

Describe Greedy Approach - Phrap

A
  • The simplest assembly method
  • Finds two sequences with largest overlap and merges them
  • Repeats until no further assembly possible.
  • The choices made by the assembler are local and do not take into account the global relationship between reads
  • Limited to simple assemblies due to read lengths and local assembly method
  • Cannot easily use global information such as paired end reads/mate pairs, which help resolve repetitive genomes
  • Phrap uses the crossmatch program which is a full implementation of the Smith Waterman algorithm
28
Q

Describe Overlap Graph (OLC-Overlap Layout Consensus)

A
  • Find the best match between the suffix of one read and the prefix of another
  • Mismatches allowed in overlaps for sequencing errors
  • Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
  • Determine path through reads to create layout
  • Create local multiple alignments from the overlapping reads
  • Consensus derived from alignments
29
Q

What is a k-mer?

A

K-mer - all the possible substrings of length k that are contained in a string

30
Q

How do you identify overlap?

A
  • Sort all k-mers in the reads (typically 16 – 24 based) and index them
  • K-mer - all the possible substrings of length k that are contained in a string
  • Identify pairs of reads that share a k-mer
  • Extend to full alignment and discard if not >95% similar
  • This technique drastically reduces the search space and has been widely used
  • Even with this improvement the computational requirement to identify all possible overlaps from next-gen short reads is a significant limitation
  • OLC is suitable for Sanger sequencing reads (1 kb) and long PacBio reads (up to a few tens of kilobases)
31
Q

Describe simple assembly

A
  • With Sanger sequencing reads represented as nodes in a graph and edges represent alignments
  • Following Hamiltonian cycle can construct genome by concatenating each read
  • Note this forms a circular genome
  • Hamiltonian cycle visits all nodes (reads) once only and returns to start position
  • However, this does not scale for the millions of reads from next gen genome sequencing
32
Q

How do k-mers improve assembly?

A
  • For any genome we can use the same approach to reconstruct it
  • For assembly ideally need all k-mers present in the genome to be assembled
  • Each k-mer should appear at most once in the genome
  • Genome can then theoretically be assembled by following graph through the k-mers
  • The larger the genome the larger the required k-mer
  • This is the basis of de Bruijn graph assembly
33
Q

de Bruijn Graph

A
  • Split reads into all possible k-mers – removes redundancy in reads
  • Follow Hamiltonian cycle in which each successive node (k-mer) is shifted by one nucleotide
    Use of k-mers means that even though an individual k-mer may overlap with more than one other there is only one overlap that provides a path through the graph that passes through each k-mer only once
34
Q

What is a hamiltonian graph?

A
  • The Hamiltonian graph approach is used by numerous assemblers: SOAPdenovo , SGA and ABySS among others
  • Traversing all nodes at once leads to the nondeterministic polynomial time (NP) -complete problem as the number of nodes increases
  • As the size of the genome increases, the computation time required to solve the graph problem increases infinitely
  • To compensate for this assembly programs adjust and simplify the graph, for example reducing branching nodes
  • An alternative approach used by other assemblers (Velvet, EULER, SPAdes etc) is to use a Eulerian path.
  • This scales better to larger genome
35
Q

Eulerian Graph:

A
  • All k-mer prefixes and suffixes represented as nodes
  • Each prefix and suffix can only occur once in the graph. (Note they will be much larger than 2 nuc in full genome assembly graph)
  • Edges represent k-mers having particular prefixes and suffixes
  • k-mer edge ATG has prefix AT and suffix
    Perform Eulerian cycle through graph - visits every edge of the graph exactly once
36
Q

Assembly requirements

A
  • Hamiltonian or Eulerian have the same requirements in order to assemble a complete genome:
    • Requirements – if met a path through the graph, visiting each edge once, is possible if:
      • Containsallk-mersinthegenome(unlikelytooccur).Ensuresgraph balanced - in directed graph number of edges in is same as number out
      • All k-mers are error free (next gen sequences contain errors)
      • Each k-mer occurs at most once in the genome (problem with repeats but paired end reads help to overcome this)
    • Assembly programs adapt the method to compensate for these issues e.g. removing branches
    • Low coverage areas will lead to multiple contigs
    • Final stage of assembly is scaffolding, using paired end reads to join contigs
37
Q

What is the significance of k-mers size in genome assembly?

A
  • Assembly requires presence of all (or nearly all) k-mers in genome
  • Illumina reads are approx 100-200bp+ – k-mer of 100+
  • Reads will not contain all possible 100-mers etc present in genome, however deep the coverage
  • Assemblers will break each read into overlapping k-mers e.g. 46 overlapping 55- mers (for 100bp read)
  • This ensures that nearly all 55-mers in the genome are detected
  • The k-mer size can be set when running he assembly so different options can be tried as optimum option depends of the genome sequence
38
Q

What are the stages of AbySS

A
  • Uniting
    • The initial assembly of sequences using a de Bruijn graph approach
  • Contig
    • Paired-end reads aligned to the unitigs and the pair information is used to orient and merge overlapping unitigs
  • Scaffold
    • Align mate-pair reads to the contigs to orient and join them into scaffolds
    • “N” characters are inserted at any gaps in coverage and for unresolved repeats
39
Q

describe uniting in Assembly

A
  • The most resource demanding stage of the de Bruijn assembly, including memory requirement
  • All k-mers from the sequence reads are stored in a hash table- Additional information for each k-mer is also stored:
    • Number of k-mer occurrences in the reads
    • Presence or absence of possible neighbour k-mers in the de Bruijn graph
40
Q

What is a bloom filter?

A
  • A Bloom filter is a compact data structure for representing a set of elements that supports two operations:
    • (1) inserting an element into the set. These are the k-mers
    • (2) querying for the presence of an element in the set
  • Used by ABySS and reduces the memory requirement
  • The Bloom filter structure consists of a bit vector and one or more hash functions
  • The hash functions map each k-mer to a corresponding set of positions within the bit vector - the bit signature
  • A k-mer is added to the Bloom filter by setting the its bit value to one
  • Queried by testing if all positions of its bit signature are one
41
Q

Describe the filtering process of k-mers

A
  • To filter out the majority of k-mers caused by sequencing errors all k-mers with an occurrence count below a user-specified threshold are discarded
  • Optimum minimum typically 2-4
  • Retained k-mers are called solid k-mers
  • In the second pass through the reads those that consist entirely of solid k-mers (solid reads) are extend left and right within the de Bruijn graph to create unitigs
  • During the read extension phase of assembly it’s possible for multiple solid reads to result in the same unitig
  • Avoided by using an additional tracking Bloom filter to record k-mers included in previous unitigs
  • A solid read is only extended if it has at least one k-mer that is not already in the tracking Bloom filter
42
Q

What does the string graph give us?

A
  • Longer reads have enabled return to overlap graph approach
  • String graph uses same methodology as overlap graph but simplified
  • First, contained reads (red) - reads that are substrings of some other read - are removed:
    The resulting graph, called a string graph, shares many properties with the de Bruijn graph without the need to break the reads into k-mers
43
Q

What is the FM index?

A
  • Theoretical work on efficiently constructing the string graph using the FM index led to memory-efficient assemblers for large genomes.
  • The FM index is based on the Burrows-Wheeler transform and the suffix array