Lecture 4 Flashcards
Explain how Illumina reads are assembled
Look for overlaps in text sequences making contigs and assemble them
What are contigs?
A continuous sequence
What is a read
DNA fragments that have been sequence (text)
Why do we still need to assess assembly quality even though human genome project is done?
There are many unknown genomes of organisms that are not well model organisms
What are the 4 ways we measure how good an assembly is?
- % of reads assembled
- Number of contigs
- Length of contigs
- N50 → 50% of assembly is contained in contigs greater than or equal to this length (N50)
How to find N50?
- Put contigs in order biggest to smallest or smallest to biggest
- Add them up and divide by 2 → this number is NOT N50 → find where this length of this number is
Describe factors that limit assembly quality?
- Low coverage
- Difficult Sequences
- Low Accuracy
What does low coverage mean?
Missing sequences. not enough overlaps
How to find average coverage?
of reads x read length / genome size
What is number of reads determined by illumina sequencing
of clusters that machine generates
What is the read length determined by in illumina sequencing?
of pictures taken
What is the recommended coverage that is enough to determine the complete sequence of a genome
30x - 50x coverage is recommended
What are difficult sequences
-Repeats → Trying to figure out how long the repeat region is difficult
-Heterozygosity → Some reads may have different bases at the same position
What is low accuracy
mistakes in sequencing (by polymerase) lead to bad assembly
What does high coverage allow for checking for sequence errors
Allows for determining if base is correct or not