Genome assembly Flashcards
Why do we sequence?
- we are still sequencing new genomes
- can be a new individual
- for DNA protein interacrions
- metagenetic
- Sequence new genome (no previous version)
- Sequence new individuals - how does it differ to reference
- Sequence population - look at variation across population
- Sequence tumour cells and compare to ‘normal’ tissue – where are cancer mutations - time course?
- Sequence transcripts: survey gene-space, also relative quantification by tissue / time / condition
- Sequence as read-out to identify DNA-protein interaction (e.g. chromatin precipitation)
- Metagenomic mixed-organism co-habiting population sequencing: genome fragments, transcripts or rRNAs to identify identity, relative abundance
What are the next gen sequencing technologies
-Illumina
-Oxford nanopore
-PacBio
How do you get high quality In Illumina?
short reads but ht e volume of reads you can get through is quite big
What are the length of the reads in PacBio?
shorter than nanopore but longer than illumina
How do you deal with high error rates in PacBio?
very high error rate - to solve that you sequence multiple times and then because the errors are random you can just align the sequences and then you get a high accuracy
What are quality scores particularly important for?
if you are trying to find SNPs you need to know the quality score to see if you have a sequencing error or an actual variation
What do we need quality scores for?
- Quality scores are assigned to estimate confidence of a given base call
- Phred scores
- aiming for quality score 30 or higher
- The quality scores are used for filtering and trimming of reads
- Also used for assembly
- Base quality scores are essential for variant calling to distinguish a true variant from a sequencing error
Where does the quality decorate?
Quality deteriorates towards the ends of reads
What does AT and GC do?
High AT or GC content reduces complexity and can lead to higher error rates\
What is the formula for QV?
- The quality value ( QV) is related to the base call error probability by the formula
- QV = - 10 x log10( Pe ); where Pe is the probability that the base call is an error
What is base calling?
- in illumina
- Base calling algorithms turn raw intensities into A, T, C, G or N base calls
What is Chastity Filter?
- Usual method for base calling in Illumina systems is known as Chastity Filter
- Chastity filter calls a base if the intensity divided by the sum of highest and second highest intensity is no less than a threshold of 0.6 (usually). Otherwise it is marked as N
What is Fast Q format?
- the standard output format for next gen sequencing output
- all the programs rely now on that format
What do they use for quality scores in Fast Q?
they use ascii values for quality scores so you get char to char association
Describe the standard output
- 4 lines per sequence
- Line 1 begins with the @ character, a sequence ID and an optional description
- Line 2 is the sequence
- Line 3 begins with the + character and, optionally followed by the same sequence ID and description
- Line 4 encodes the quality values for the sequence letters in line 2 and must contain the same number of characters
What is depth of coverage useful for?
Sequencing errors are eliminated by the depth of coverage of overlapping sequence fragments
What was the depth coverage in the human genome project?
- For the Human Genome Project, most of the genome was sequenced at 12X or greater
coverage. - Each base was present in 12 reads on average.
- Even with 12x coverage approximately 1% of the genome not accurately assembled