De Novo Assembly Flashcards
1
Q
What are mate-pairs (MP)? What types there are?
A
- sequencing the two ends of genomic fragments
- usually considered only reads at unique positions
- long sequences
- two fragments that are distal to each other in the genome and in the opposite in orientation to that of a mate-paired fragment
- Paired-out
- mates map on 2 different contigs -> merge and assess gap
- Paired-in
- mates map on the same contig -> validate assembly, identify structural variations
- Single
- one read maps and its mate-pair does not -> close to a gap
2
Q
What is a contig?
A
- series of overlapping DNA sequences
- merged makes a contigous fragment of DNA
3
Q
What are Structural variations (SV)?
A
- genomic differences involving large segments of DNA
- deletions
- insertions
- inversions
- can be normally occurring in the population or pathological
4
Q
What is scaffolding?
A
- linking together a non-contiguous series of genomic sequences into a scaffold
- bridge gaps between contigs
5
Q
What can Pair-end and Mate-pair sequencing be used for?
A
- scaffolding (and gap filling)
- validation of de novo genomic sequencing
- structural variations detection
- two things needed: a genome and the mate paired reads
6
Q
What is physical coverage?
A
- average number of times a base is read or spanned by mate paired reads
- mate pairs obtained from long physical inserts would be most effective for scaffolding
7
Q
What is sequence coverage?
A
- average number of times a base is read
8
Q
What does pair-end mean?
A
- short sequence
* set of fragments read on both ends
9
Q
Detection of structural variations with mate pair libraries
A
- insert length statistics for the identification of structural variations
- Use of “broken” reads to identify points of
insertion/deletion- should result broken wher junction occurs
10
Q
Sanger sequencing
A
- 1977, Frederick Sanger
- fluorescent dyes, first “automatic” DNA sequencers appeared
- quality deteriorates with length of fragment
- based on possibility of creating a lot of subfragments of the region to be sequenced and many copies
- start at same position
- termination should be according to 4 classes, random base, dideoxy terminator
11
Q
Varible loci, what are those? What is the difference between polymorphism and mutation?
A
- in some individuals a locus could be found in 60% of the individuals as a C and in 40% as a T
- about 1 out of 500 basis
- polymorphism -> at least in 1% of population
- mutation -> less than 1%
12
Q
What are the problems and solutions of whole genome sequencing?
A
- genomes of million/billions bases long
- Sanger -> 500/1000 bases long reads
- possible solutions:
- generate contigours fragments and sequence them (unfeasible)
- shotgun approach (sequence random fragments)
13
Q
Hierarchical shotgun assembly
A
- BACs (Bacterial Artificial Chromosomes) can host 150 kbp of DNA
- transfered into E. coli they replicate
- BAC clones are sequenced independently, either randomly or to obtain minimum overlap
- assemblying in difficult (NP hard) Hamiltonian path
- Gaps:
- some regions could not be covered
- repeats make association of contigs difficult
14
Q
Poisson distribution
A
- f(v) = (e^-r * r^v)/v!
- f(v) expected frequency that a base is found v times
- r rendundancy, average coverage
- 1-e^-r, part of the genome covered at least one time
15
Q
Better assembly strategies
A
- from shotgun reads to contigs
- Greedy algorithm
- no need to calculate all possible paths
- 1/2 *n^2 possible overlaps calculated, n possible reads
- Process:
- pairwise alignments of all fragments
- choose two fragments with the largest overlap
- merge
- repeat until only one fragment left
- complexity increases as n^2