De Novo Assembly Flashcards
What are mate-pairs (MP)? What types there are?
- sequencing the two ends of genomic fragments
- usually considered only reads at unique positions
- long sequences
- two fragments that are distal to each other in the genome and in the opposite in orientation to that of a mate-paired fragment
- Paired-out
- mates map on 2 different contigs -> merge and assess gap
- Paired-in
- mates map on the same contig -> validate assembly, identify structural variations
- Single
- one read maps and its mate-pair does not -> close to a gap
What is a contig?
- series of overlapping DNA sequences
- merged makes a contigous fragment of DNA
What are Structural variations (SV)?
- genomic differences involving large segments of DNA
- deletions
- insertions
- inversions
- can be normally occurring in the population or pathological
What is scaffolding?
- linking together a non-contiguous series of genomic sequences into a scaffold
- bridge gaps between contigs
What can Pair-end and Mate-pair sequencing be used for?
- scaffolding (and gap filling)
- validation of de novo genomic sequencing
- structural variations detection
- two things needed: a genome and the mate paired reads
What is physical coverage?
- average number of times a base is read or spanned by mate paired reads
- mate pairs obtained from long physical inserts would be most effective for scaffolding
What is sequence coverage?
- average number of times a base is read
What does pair-end mean?
- short sequence
* set of fragments read on both ends
Detection of structural variations with mate pair libraries
- insert length statistics for the identification of structural variations
- Use of “broken” reads to identify points of
insertion/deletion- should result broken wher junction occurs
Sanger sequencing
- 1977, Frederick Sanger
- fluorescent dyes, first “automatic” DNA sequencers appeared
- quality deteriorates with length of fragment
- based on possibility of creating a lot of subfragments of the region to be sequenced and many copies
- start at same position
- termination should be according to 4 classes, random base, dideoxy terminator
Varible loci, what are those? What is the difference between polymorphism and mutation?
- in some individuals a locus could be found in 60% of the individuals as a C and in 40% as a T
- about 1 out of 500 basis
- polymorphism -> at least in 1% of population
- mutation -> less than 1%
What are the problems and solutions of whole genome sequencing?
- genomes of million/billions bases long
- Sanger -> 500/1000 bases long reads
- possible solutions:
- generate contigours fragments and sequence them (unfeasible)
- shotgun approach (sequence random fragments)
Hierarchical shotgun assembly
- BACs (Bacterial Artificial Chromosomes) can host 150 kbp of DNA
- transfered into E. coli they replicate
- BAC clones are sequenced independently, either randomly or to obtain minimum overlap
- assemblying in difficult (NP hard) Hamiltonian path
- Gaps:
- some regions could not be covered
- repeats make association of contigs difficult
Poisson distribution
- f(v) = (e^-r * r^v)/v!
- f(v) expected frequency that a base is found v times
- r rendundancy, average coverage
- 1-e^-r, part of the genome covered at least one time
Better assembly strategies
- from shotgun reads to contigs
- Greedy algorithm
- no need to calculate all possible paths
- 1/2 *n^2 possible overlaps calculated, n possible reads
- Process:
- pairwise alignments of all fragments
- choose two fragments with the largest overlap
- merge
- repeat until only one fragment left
- complexity increases as n^2
Scaffolding
- from contigs to scaffolds
- mate-pairs:
- pairing the ends can help merge contigs and resolve difficult repeated regions
- impossible to complete a complex genomic sequence with the only approach of sequence overlaps
What are the uses of mate pairs?
- Scaffolding
- unique pair out quite useful in closing gaps
- Assembly validation [contigs]
- mates align on both contig and distance compatible with distance of genomic insert
- Gap closure
Read assembly and Eulerian paths
- Graph, reads vertices and read overlaps arches, occurrences
- sequences occurring at a higher rate may be repeated more than once
- Eulerian path, similar but easier to compute
- De bruijn graphs, kmer of length k, one-out-one-in
- long enough, should’t be repeats (present only once)
De Bruijn graphs practical application
- shotgun library sequenced at high coverage (60x)
- fastq -> kmers are counted (kmers and frequency)
- start from any kmer, extend one position at a time looking at list:
- no repeats -> only one present
- few counts could be errors
- more counts could be repeats (more edges)
De Bruijn graphs and mate pairs
- same fragment could be sequenced from both ends
- inverted sequences should be counted as one
- kmer best length is l20/30 bases
- genomics sequences -> lots of repeats -> many branches
- using mate pair libraries could help identify the right path