Theme 2 Flashcards
Define genome annotation
an overlay of biological information on to the genome sequence to predict and mark important features.
What features does genome annotation look for
Protein coding genes (by location)
RNA features (by location and function)
Protein function (by similar sequences)
Features of protein coding genes
Contained in an ORF
Have an initiation codon, usually ATG
Have ribosome binding site
What technology do we use to find genes location
Genefinders
e.g: GeneMarkS, GLIMMER and Prodigal
How do we predict protein function from similar sequences
Compare query sequence to databased
Similar order and aa content is good similarity (must be similar in order)
<10% identical: similarity occurs by chance thus not related
10-35% identical: might have a related function
> 35% identical: probably have a related function
What is the most common tool for comparing sequences
BLAST
p = compare protein query to protein database
n = compare nucleotide query to protein database
The “expect” measures the likelihood of a match up occurring by chance
- Near 0 = good
- Above 0.1 = bad
Explain a genetic proof using virulence factors
Proving what a gene does
Comparing 2 strains one with a new virulence factor
- virulent version of the gene is put into the avirulent strain. If the virulence factor can then be observed, it has been proven that that specific gene or gene change is responsible for the virulence factor.
What is read depth (coverage)
A measurement of how much of the genome it will cover in reads.
Depth = (no. of reads x length of each read in bases)/estimated genome size
30x to 100x is enough to avoid gaps
Describe Sanger sequencing
- Dideoxy nucleotides (with radioactive marker)
- Normal deoxy nucleotides
- Primer
- ssDNA template
- DNA polymerase
As a sequence is made dideoxy nucleotide will terminate it and can be read by the marker. `
Describe illumina sequencing technology and the 2 main machines
Same as sanger but uses “blocked” nucleotide. A photo is taken when nucleotide added, then unblocked so next can join, then another photo taken.
HiSeqX10: for human genome, 3 days, many 150 bp reads
MiSeq: for other jobs, 56 hours, (less than above but) 350 bp reads
What is FASTQ and a phred quality score
FASTQ file stores sequence fragments before mapping and FASTA shows them after mapping.
Multi-FASTQ: list of all the reads
Phred quality score: Measure of the quality of sequence identification by symbols
Explain what a draft genome is
A sequence that has not been perfectly check and annotated
Made up of contigs (an unbroken consensus sequence)
A contig break is where there is no overlap (but >30x depth usually prevents this)
How do you go from a draft to a closed genome
We use the short read draft genome from illumina with long reads from other technology
The long reads span more than the longest repeated elements so we can locate them in the genome.
Sum of read quality and read length
What are the technologies that can be used to make large reads
PacBio: Single molecule sequencing with fluorophores on nucleotides so that a fluorescent flash can be recorded when a base is added
Nanopore: Pulls ssDNA through nanopore. electrical pulse measured from when a base passes the sensor
PCR strategy: Design primers to amplify the gaps between contigs. Makes a ‘PCR amplicon’ which can be sequenced
Outline the HGP
Aimed to determine the entire sequence of human DNA to identify all the genes.
1990-2003
Cover 99% of genome with error 1 in 100,000 bases.
Not telomeres, centromeres
Found that:
- HG is 316.4 million bases and 20,000-25,000 genes
- Less than 2% encodes proteins
Why are eukaryotic genomes more difficult for gene identification
Promoter sequences not easily recognised and can be far from start site
Most genes are interrupted
Explain CpG islands
CG or CpG is frequently methylated in DNA which turns off genes by altering chromatin structure.
Normal C if deaminated = U. but corrected in repair
Methylated C if deaminated = T which is not corrected
Over time this mutation has occurred meaning there is less CpG.
Promoters of genes which are on are not methylated meaning if a mutation occurs it is repaired.
Promoters are usually CpG dense because of this regulation ability.
What is junk DNA
Pseudogenes, mobile genetic elements, segmental duplications, small sequence repeats that may be remnants or contribute to chromosomal bulk
What are the 3 types of pseudogenes
Classical pseudogenes: arise by a gene DNA duplicating Contain introns.
Processed pseudogenes: processed mRNA integrating into DNA. Do not contain introns as the mRNA was already spliced.
Other pseudogenes may be transcribed and have biological roles – microRNA decoy.
What is a pseudogene
Pseudogenes are non functional DNA segments that resemble genes. Become inactive by a mutation occurring
What is a mobile genetic element
DNA segments that can move or copy itself to another position in the genome which will alter expression/function.
Can be transposons and retrotransposons
What is a transposon
(cut and paste)
DNA segments which can move within/between chromosomes by encoding their own transposase
Only transposon remnants evident in human genome.
What is a retrotransposons
(copy and paste)
move from one point to another in the genome via RNA intermediates being reverse transcribed into ss cDNA which is then converted into dsDNA and inserted into new site.
Can be retroviral like or non retroviral
Explain retroviral like retrotransposons
Retroviral-like: endogenous retroviral element, ERV
From retrovirus infection
Explain non retroviral retrotransposons
Long interspersed nuclear element, LINE: have a promoter and encode a protein with combined endonuclease and reverse transcriptase (RT) activity.
Short interspersed nuclear element, SINE: SINEs do not encode any own proteins move using enzymes produced by other mobile elements e.g. LINEs.
Retrotransposons in somatic tissue may:
Alter function
Contribute to disease
Change phenotype expressed
What is ENCODE
Encyclopaedia of DNA elements
Aims to catalogue all of the functional elements in human genome