13 - Whole genome analysis Flashcards
How has illumina sequencing changed recently?
Used to be that you could only do reads of 50 to 300 base pairs. Thanks to paired ends you can now do 600 (2x300)
The equipment is cheaper and you don’t need as high a concentration of DNA anymore
Why shouldn’t you want to find the sequence of a genome just for the heck of it?
Only annotated genomes are useful.
Most genomes are annotated automatically, what’s a drawback to this?
Humans make the highest quality annotations. But this just isn’t realistic for most cases.
Give some general steps of annotating a genome
- Generate ORFs from a completed sequence
- Do a homology search for different factors (metabolic pathways, frameshift detection, gene families etc.)
- Combine the above search with more specialized searches (eg. DNA motifs, regulatory elements and repetitive sequences).
what are the two most common approaches to define gene structure?
Prediction based (ab initio): algorithms designed to find genes/gene structures base don nucleotide sequence and composition
Sequence similarity (evidence driven): alignment to mRNA sequences (ESTs) and proteins from the same species or related species; identification of domains and motifs.
These are often done in combination.
What is prediction based (ab initio) gene structure searching?
Using software (gene predictors) that use mathematical models created from the accumulated knowledge about gene structure on a particular type of organism. They provide a rapid way to conduct a preliminary analysis of raw genome data but have low accuracy and can’t deal with alternative splicing and other complex situations. They are more effective on prokaryotic genomes.
This uses what is already known about the gene.
Describe evidence driven gene prediction
Utilizes external data to find genes and determine their precise boundaries and features, such as introns and alternative splicing patterns. Their ability to produce accurate gene models depend on the nature and quality of the data available.
What sort of elements might an ab initio prediction based search look for in a eukaryotic genome to find genes?
- Promoter regions
- 5’ UTR
- initial exon
- introns
- Protein coding regions
- 3’ UTR
- poly-A tail (in mRNA)
- Intergenic DNA
What is RepeatFinder / RepeatMasker?
A tool that can look for many kinds of repeated sequence in a raw genomic dataset.
It uses a comprehensive database of repeated DNA.
It can label (annotate) regions of repeats and mask them to exclude them from further analysis.
What is tRNAScan?
A tool that can look for potential tRNA genes and annotates them into the genome sequence
What is gene ontology and how can it be used for genome analysis?
The Gene Ontology is a controlled vocabulary of terms to describe gene product characteristics in the domains of localization and function.
They can be used to classify genes into functional categories (eg. metabolism, stress related, immunity etc.) as well as their location of expression (kidney, liver etc.)
What is the Kyoto Encyclopedia of Genes and Genomes (KEGG)?
A program which can analyse data to assign metabolic pathways to the genes within the genome of the organism being studied.