Shane - Lecture 3 Flashcards
How many genes do human have?
22,000 genes
How long did it take to sequence the full human genome?
About 10 years
How much of your genetic material is the exact same as a random stranger?
99% of it is identical
Why did it take so long to sequence the human genome?
Because we have 3 billion base pairs but only 22,000 genes
What is computational gene prediction?
Trying to find what genes are found on a sequence of DNA i.e. what region of the uncharacterised sequence codes for proteins
What information can be found via computational gene prediction?
(6)
What regions codes for protein
Which DNA strand encodes the gene
Which reading frame is used
Where does the gene start and end
Where are the exon-intron boundaries in eukaryotes
Where are the regulatory sequences for that gene
What often acts as the start codon?
ATG
What are the benefits of gene finding on prokaryotes?
(3)
Small genomes
High coding density
No introns
What is the gene level accuracy of gene finding of prokaryotes?
99%
What are the characteristics of eukaryotic genes?
Large genomes
Low coding density
Intron/exon structure
What is the gene level accuracy of gene finding on eukaryotic genes?
About 50% accuracy
What are the problems associated with gene finding on prokaryotes?
(3)
Overlapping open reading frames
Very short genes - protein might be only a few dozen amino acids
Finding transcription start sites (TSS) and promoters
What is a TSS?
The point at which RNA polymerase starts trascribing
What is a TSS?
The point at which RNA polymerase starts transcribing
What are the four ways we can predict the location of genes in genomic sequences?
Searching by signal
Searching by content
Similarity-based methods
Comparative genomics
What is it called when searching by signal and content is done simultaneously?
Ab initio or intrinsic methods
What are intrinsic methods of gene prediction used for?
For looking for very specific features associated with genes
What is it called if similarity-based methods and comparative genomics are used together?
Extrinsic methods
What is meant by searching by signal gene prediction?
The analysis of a sequence signal involved in gene specification
What is meant by searching by content signal gene prediction?
Codon bias correlated with coding regions
What is meant by similarity based methods of gene prediction?
Use of similarity to known annotated sequences
What is meant by comparative genomics?
Aligning genomic sequences from different species
What is meant by extrinsic methods of gene prediction?
(2)
Is our unknown gene similar to other known gene sequences
This relies on pre-existing gene information
How does ab initio gene finding work?
(4)
We input a DNA string of letters (A, C, G, T)
We get out an annotation of the string of letters showing for every nucleotide whether it is coding or non-coding
Red = stop and start codons
Blue = exons
Black = introns
Identifies coding exons of protein-coding genes
Give an example of one of the most common stop codons
TAA
How does searching by signal work?
There are four different signal found at different sites:
- translation start codon ATG
- 5’ splice donor site
- 3’ splice acceptor site
- translation stop codon - TAA, TAG, TGA
List the three stop codons
TAA
TAG
TGA
What can be used to look up the donor and acceptor splice sites of a sequence?
Consensus sequences can be used to find splice sites
What can be used to help identify a stop signal?
The Cs and Ts found running up to the stop
What does searching by content do?
Accurate prediction of exons dependant on content-based features -> can identify the type of exon
What are the three types of exons?
Initial exons
Internal exons
Terminal exons
What are initial exons?
Open reading frames delimited by a start site and 5’ donor site
What are internal exons
Open reading frames delimited by a 3’ acceptor site and 5’ donor site
What are terminal exons?
Open reading frames delimited by 3’ acceptor site and stop codon
Where is codon bias mostly found?
Found in exons more so than introns
What is codon bias?
The uneven usage of amino acids -> some are more frequently found and some are not
How can codon bias be useful?
It can be used to differentiate between coding and non-coding regions as some codons might only be found in coding regions etc
What are coding statistics?
A function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein