Chapter 5: Genome Annotation Flashcards
Module 5
genome annotation
the process
by which genes are located in a genome sequence
Module 5
Once an assembled genome sequence has been obtained, various methods can be employed to locate the genes that are present. These methods can be divided into
- those that involve simply inspecting the sequence, by eye or more frequently by computer
- those methods that locate genes by experimental analysis.
Module 5
open reading frames (ORFs)
- Genes that code for proteins
- consisting of a series of codons that specify the amino acid sequence of the protein that the gene codes for
- begins with an initiation codon, usually (but not always) ATG
- ends with a termination codon, either TAA, TAG, or TGA
Module 5
ORF scanning or ab initio gene prediction
- involves searching a DNA sequence for ORFs that begin with an ATG and end with a termination triplet
- complicated by the fact that each DNA sequence has six reading frames, three in one direction and three in the reverse direction on the complementary strand
- Each strand has three reading frames, depending on which nucleotide is chosen as the starting position.
Module 5
- The key to the success of ORF scanning is the frequency with which _____ _____ appear in the DNA sequence.
- If the DNA has a random sequence and a GC content of 50%, then each of the three termination codons will appear, on average, once every _____.
- If the GC content is greater than 50%, then the termination codons, being AT-rich, will occur less frequently, but one will still be expected every ______.
- This means that random DNA should not show many ORFs longer than _____ codons in length, especially if the presence of a starting ATG triplet is used as part of the definition of an ORF.
- Most genes, on the other hand, are longer than 50 codons: the average lengths are _____ codons for bacterial genes and approximately _____ codons for humans.
- termination codons: TAA, TAG, or TGA
- 43 = 64 bp
- 100–200 bp
- 50
- 300–350
- 450
ORF scanning, in its simplest form, therefore takes a figure of, say, _____ codons as the shortest length of a putative gene and records positive hits for all ORFs _____ than this.
- 100
- longer
Although ORF scans work well for bacterial genomes, they are less effective for locating genes in DNA sequences from higher eukaryotes. This is partly because
there is substantially more space between the real genes in a eukaryotic genome (for example, approximately 62% of the human genome is intergenic),
Module 5
- main problem with the human genome, and the genomes of higher eukaryotes in general, is that their genes are often split by _____ and so do not appear as continuous ORFs in the DNA sequence.
- Many _____ are shorter than 100 codons, some consisting of fewer than 50 codons, and continuing the reading frame into an intron usually leads to a termination sequence that appears to close the ORF
- Intron boundaries are marked by _____ _____
- introns
- exons
- consensus sequences
Module 5
initiation codon
ATG
Module 5
termination codon
TAA, TAG, or TGA
Module 5
Codon bias
- not all codons are used with equal frequency in the genes of a particular organism
- all organisms have a bias, which is different in different species
- The codon bias of the organism being studied is therefore written into the ORF-scanning software
- i.e. leucine is most frequently coded by CTG and is only rarely specified by TTA or CTA
- the frequency with which a particular organism uses the available CODONS in genes
Module 5
consensus sequence
the sequence shows the most frequent nucleotide at each position in all of the upstream exon–intron boundaries that are known
- Exon–intron boundaries can be searched for, as these have distinctive sequence features, via _____ _____
- The sequence of the upstream exon–intron boundary is usually described as 5ʹ-AG↓GTAAGT-3ʹ with the arrow indicating the precise _____ _____.
- only the GT immediately after the arrow is invariable: elsewhere in the sequence, nucleotides other than the ones shown are quite often found
- The downstream intron–exon boundary is even less well defined: 5ʹ-PyPyPyPyPyPyNCAG↓-3ʹ, where Py and N means
- consensus sequence
- boundary point
- one of the pyrimidine nucleotides (T or C) and N is any nucleotide
Module 5
- Upstream regulatory sequences can be used to locate the regions where genes _____.
- These regulatory sequences, like exon–intron boundaries, have _____ sequence features that they possess in order to carry out their role as recognition signals for the DNA-binding proteins involved in gene expression.
- As with exon–intron boundaries, the regulatory sequences are ____, more so in eukaryotes than in prokaryotes, and in eukaryotes not all genes have the same collection of regulatory sequences. Using these to locate genes is therefore problematic
- begin
- distinctive
- variable
Module 5
CpG islands
- upstream of many genes
- sequences of approximately 1 kb in which the GC content is greater than the average for the genome as a whole
- Some 40–50% of human genes are associated with an upstream CpG island
- distinctive, and when one is located in vertebrate DNA, a strong assumption can be made that a gene begins in the region immediately downstream
If the genome is completely unstudied, then the accuracy of gene prediction will be lower, even though most gene prediction software includes
a machine learning function, so the computer becomes trained to recognize appropriate patterns of codon usage as it gradually builds up the genome annotation.
Module 5
genes for noncoding RNAs such as rRNA and tRNA do not comprise _____ ______ _____ and hence will not be located by Codon bias, Exon-intron boundaries or Upstream regulatory sequences
- open reading frames
Module 5
intramolecular base pairing
- pattern that can occur in single-stranded DNA or, more commonly, in RNA
- It occurs when two regions of the same strand, usually complementary in nucleotide sequence when read in opposite directions, base-pair to form a secondary structure
Module 5
Noncoding RNA molecules have distinctive features, which can be used as an aid in their discovery in a genome sequence. The most important of these features is the ability to
- fold into a secondary structure
Module 5
Other noncoding RNA genes are less easy to locate because the RNAs take up structures that involve relatively little base pairing or the base pairing is not in a regular pattern. Three scanning approaches are used for location of the genes for these RNAs:
- scan DNA sequences for stem-loops (hair pins) therefore identify regions where noncoding RNA genes might be present
- scan for regulatory sequences associated with genes for noncoding RNAs
- In compact genomes, attention is directed toward regions that remain after a comprehensive search for protein-coding genes. Often these empty spaces are not empty at all, and a careful examination will reveal the presence of one or more noncoding RNA genes
Module 5
homologous
- derived by descent
- having the same relation, relative position, or structure
- similarity due to shared ancestry between a pair of structures or genes in different taxa
Module 5
homology
sequence conservation
Module 5
homology search
- DNA databases are searched to compare the test sequence with genes that have already been sequenced
- If the test sequence is part of a gene that has already been sequenced by someone else, then an identical match will be found
- intention is to determine if an entirely new sequence is similar to any known genes
- looking for a chance that the test and match sequences are homologous
- to assign functions to newly discovered genes
- central to gene prediction because it enables the authenticity of tentative exon sequences located by ORF scanning to be tested
- With homology search, if the tentative exon sequence gives one or more positive matches after a homology search then it is
- but if it gives no match then
- probably a real exon
- its authenticity must remain in doubt until it is assessed by one or other of the experiment-based genome annotation techniques.
Module 5
Limitations of homology analysis
- does not necessarily imply the same function
- may not reveal function- e.g. orphans (do not know function of gene)
- suggests related function
- provides a starting point for hypothesis and research
Module 5
- Because of natural selection, the sequence similarities between related genomes are greatest within the _____ and lowest in the _____ _____.
- Therefore, when related genomes are compared, homologous genes are easily identified because they have high sequence similarity, and any ORF that does not have a clear homolog in the second genome can be discounted as almost certainly being a chance sequence and not a genuine gene because
- genes
- intergenic regions
- they have no equivalents in the related genomes
Module 5
synteny
- the conservation of blocks of order within two sets of chromosomes that are being compared with each other.
- gene order is conserved
- greater in more closely related species
- facilitates the identification of genes
Module 5
How can homology be used to identify genes
- Compare an ORF sequence with a nucleotide sequence database
- electronically translate the ORF and use the polypeptide sequence to search a protein sequence database
- Based on the concept that gene sequences are conserved, true exons will often have related sequences in a databas
- As more genes added to a database, more likely to find a related sequence
Module 5
homolog
- A gene related to a second gene by descent from a common ancestral DNA sequence
- A morphological structure in one species related to that in a second species by descent from a common ancestral structure.
- may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the relationship betwen genes separated by the event of genetic duplication (see paralog)