Theme 2 Flashcards
Define genome annotation
an overlay of biological information on to the genome sequence to predict and mark important features.
What features does genome annotation look for
Protein coding genes (by location)
RNA features (by location and function)
Protein function (by similar sequences)
Features of protein coding genes
Contained in an ORF
Have an initiation codon, usually ATG
Have ribosome binding site
What technology do we use to find genes location
Genefinders
e.g: GeneMarkS, GLIMMER and Prodigal
How do we predict protein function from similar sequences
Compare query sequence to databased
Similar order and aa content is good similarity (must be similar in order)
<10% identical: similarity occurs by chance thus not related
10-35% identical: might have a related function
> 35% identical: probably have a related function
What is the most common tool for comparing sequences
BLAST
p = compare protein query to protein database
n = compare nucleotide query to protein database
The “expect” measures the likelihood of a match up occurring by chance
- Near 0 = good
- Above 0.1 = bad
Explain a genetic proof using virulence factors
Proving what a gene does
Comparing 2 strains one with a new virulence factor
- virulent version of the gene is put into the avirulent strain. If the virulence factor can then be observed, it has been proven that that specific gene or gene change is responsible for the virulence factor.
What is read depth (coverage)
A measurement of how much of the genome it will cover in reads.
Depth = (no. of reads x length of each read in bases)/estimated genome size
30x to 100x is enough to avoid gaps
Describe Sanger sequencing
- Dideoxy nucleotides (with radioactive marker)
- Normal deoxy nucleotides
- Primer
- ssDNA template
- DNA polymerase
As a sequence is made dideoxy nucleotide will terminate it and can be read by the marker. `
Describe illumina sequencing technology and the 2 main machines
Same as sanger but uses “blocked” nucleotide. A photo is taken when nucleotide added, then unblocked so next can join, then another photo taken.
HiSeqX10: for human genome, 3 days, many 150 bp reads
MiSeq: for other jobs, 56 hours, (less than above but) 350 bp reads
What is FASTQ and a phred quality score
FASTQ file stores sequence fragments before mapping and FASTA shows them after mapping.
Multi-FASTQ: list of all the reads
Phred quality score: Measure of the quality of sequence identification by symbols
Explain what a draft genome is
A sequence that has not been perfectly check and annotated
Made up of contigs (an unbroken consensus sequence)
A contig break is where there is no overlap (but >30x depth usually prevents this)
How do you go from a draft to a closed genome
We use the short read draft genome from illumina with long reads from other technology
The long reads span more than the longest repeated elements so we can locate them in the genome.
Sum of read quality and read length
What are the technologies that can be used to make large reads
PacBio: Single molecule sequencing with fluorophores on nucleotides so that a fluorescent flash can be recorded when a base is added
Nanopore: Pulls ssDNA through nanopore. electrical pulse measured from when a base passes the sensor
PCR strategy: Design primers to amplify the gaps between contigs. Makes a ‘PCR amplicon’ which can be sequenced
Outline the HGP
Aimed to determine the entire sequence of human DNA to identify all the genes.
1990-2003
Cover 99% of genome with error 1 in 100,000 bases.
Not telomeres, centromeres
Found that:
- HG is 316.4 million bases and 20,000-25,000 genes
- Less than 2% encodes proteins