Finals Flashcards
Functional genomics
The functional annotation of genes is a large field that utilizes extensive experimentation to describe the function and interactions of gene and gene products
For functional annotation, what is BLAST and InterPro Software framework based on?
Sequence similarity
Examples of functional classification schemes
Gene Ontology (GO)
Enzyme Commission (EC) Numbers
Kyoto Encyclopedia of Genes & Genomes (KEGG) BRITE
How many classification schemes have been devised for protein structures and what are they?
Three:
SCOP (Structural Classification of Proteins)
CATH (Class, Architecture, Topology, Homologous superfamily)
FSSP (Families of structurally similar proteins)
What is the success or reliability of functional prediction influenced by?
Accuracy of the alignment of homologous characters in two or more sequences
What is the twilight zone?
Sequence similarity between two protein sequences is 15-25%, and the reliability of the prediction that two proteins are homologous, or evolutionarily related is only 10%
What is the percent identity that might occur between two protein sequences of longer than 100 amino acids simply by chance?
10-20%
What is the reliability of prediction that two protein sequences are homologous when the sequence identity is above 30%?
90%
By what percentage of amino acids in the sequence is the protein fold determined which determines the general structure of a protein?
3-4%
What is the likely sequence similarity of proteins with similar structure?
> 33%
What is the midnight zone?
Sequence identity is very low <15%, sequences are so different that the relationship is nearly invisible at sequence level, but may adopt very similar 3D structure
What percentage of gene annotations in public databases are incorrect or misleading?
5-63%
How are the errors in gene annotations in public databases propagated?
Via analyses of new genomes
Where do the errors in gene annotations arise from?
They originate from various sources including genome assembly and gene prediction.
Genome assembly: Erroneous or incomplete genome assembly - Truncated or chimeric genes
Genes and gene function prediction: Single nucleotide errors
Which databases are the best-curated for protein functional annotations and why?
RefSeq
UniProt/SwissProt
They require multiple lines of experimentally derived evidence
Which sequence databases are integrated in the InterPro framework?
HAMAP, Panther, PIRSF, TIGRFAM
Which method to predict signal peptide is integrated in the InterPro framework?
SignalP
Which method to predict transmembrane region is integrated in the InterPro framework?
TMHMM
Which fingerprint databases are integrated in the InterPro framework?
PRINTS
Which motif databases are integrated in the InterPro framework?
ProSite
Which domain databases are integrated in the InterPro framework?
Gene3D, Pfam, ProDom, ProSite (Profile), SMART, Superfamily
The sensitivity of BLAST is comparable to what algorithm?
Smith-Waterman
How can BLAST recognize distant homologues?
An iterative algorithm using a position specific score matrix is devised and implemented in PSI-BLAST. A matrix is reconstructed for individual iterations using sequences from previous iterations.
What could lead to an erroneous transfer of function in BLAST-based annotation methods?
Homologues may align only over a small portion of their overall lengths.
Homologue may have been wrongly annotated in the first place.
Issues with BLAST-based annotation methods
- Distant homologues
- Homologues may only align over a small portion of the overall lengths
- Misannotated homologues
What is commonly used to predict orthologous proteins from KEGG databases?
BLAST
What can mis-annotation of homologous proteins may also lead to in case of orthologs?
Orthologs are predicted from KEGG databases, and misannotation may lead to erroneous predictions of metabolic pathways and protein families
A server that incorporates a curated domain-family database
PfamA
A server that incorporates a computationally-generated domain-family database
PfamB
How does Pfam generate clusters of domain families?
Defined by the program ADDA
How are clusters of domain families formed in ADDA?
From pairwise comparisons of profiles of domains inferred by penalizing splits and partial overlaps in pairwise, BLAST-aligned, protein-similarity matrix
How does SMART domain database work?
Simple Modular Architecture Research Tool requires manual intervention during annotation and is linked to a database called STRING (Search Tool for the Retrieval of Interacting Genes)
How does ProDom domain database work?
Comparing the results from PSI-BLAST against the UniProtKB database and inferring domain information from the resultant data
Which domain databases does ProDom complement?
Pfam, ProSite, SMART
How does SUPERFAMILY resource for domains work?
It uses the SCOP classification scheme for inferred protein-domain superfamilies and assigns gene ontology (GO) terms to these families using Gene Ontology annotation
How does Gene3D resource for domains work?
It combines both structural (CATH classification scheme) and functional information to annotate domains found in sequences in the databases UniProtKB, RefSeq, Ensembl. It clusters annotated superfamilies into functional subfamilies using GeMMA.
How does the ProSite motif database work?
Recognizes protein motifs using regular expressions and weight matrix profiles, augmented by the annotation rule database ProRule.
What does the implementation of ProRule in ProSite does?
It increases the reliability by imposing rules, such as essential amino acids in the active sites of enzymes.
What are some clustering methods and databases?
Methods: OrthoMCL, InParanoid, MultiParanoid
Databases: OrthoDB, Clusters of Orthologous Groups of Proteins
What are the clustering methods and databases based on?
They use the all-versus-all similarity metrices, created based on the pairwise alignments of protein sequences using algorithms such as BLAST, FASTA, Smith-Waterman
What is the largest, publicly available, all-versus-all protein sequence similarity score matrix called?
Similarity Matrix of Proteins (SIMAP)
What is SIMAP2 limited to?
Proteins encoded in complete genome sequences (Not publicly available)
In which database is SIMAP2 employed?
eggNOG: evolutionary geneology of genes: Non-supervised Orthologous Groups
how are the proteins assembled in eggNOG?
They are assembled into in-paralogous (as opposed to out-paralogous and orthologous) groups by comparing sequence similarities within and among clades
How are orthologous groups found from eggNOG?
Orthologous groups amongst the in-paralogous groups in eggNOG are then identified by creating and merging reciprocal best hits among three species
How can clustering methods be improved to predict orthologues and paralogues?
Amino acid substitution models like BLOSUM can be replaced with models that better estimate phylogenetic distances, such as JTT, WAG and by reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species.
In which methods or databases is the reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species accounted for?
SYNERGY
PhIG (Phylogenetically inferred groups)
TreeFam
PANTHER
What are the problems with databases with phylogenomic annotation algorithms?
They provide a compelling option to rapidly detect protein function, but they are limited in:
- Their coverage of species and proteins
- Using sequence similarity searches to position the query sequence in phylogenetic trees in the databases, constructed using substitution models and taxonomic information seems questionable
- Even perfect positioning does not guarantee the accurate prediction of function for the query protein sequence, because homologous proteins do not always have the same function
How do we annotate proteins based on structure?
We compare the predicted folds of the gene products against structurally similar proteins in databases such as protein data bank (PDB)
What are the limitations of annotating protein functions based on structure?
- Only 60% of structurally similar proteins without significant sequence similarity share a binding site location, thus the function inferred from this comparison may not always be correct.
- Moreover, functional knowledge about a lot of the 3D structures of proteins in PDB is lacking as structural genomics initiatives are only directed at determining the 3D structures through high-throughput structure determination efforts.
- In convergent evolution, the same function is observed even with different folds, thus preventing the use of structural homologues to infer a function
What should be done to increase the accuracy of structure-based function prediction and why?
Conserved amino acids in active and binding sites need to be evaluated. That is because, for enzymes, catalytic residues and their locations within the protein and orientation within the active sites are usually conserved and are not associated with structural variation, thereby allowing the functional annotation of distantly related homologues.
How to identify the conserved residues to improve the accuracy of structure-based function prediction?
The identification of conserved residues in protein families is through multiple sequence alignment
Where can the functional classification of proteins be evaluated?
In the annual Critical Assessment of Function Annotation (CAFA) challenge
When is promising annotation achieved?
When using machine learning and supervised classification methods, and unsupervised clustering methods
In what databases are the results for experimentally evaluated and computationally predicted protein-protein interaction networks and protein-protein complexes found?
DIP
STRING
Where can machine learning and supervised classification methods, with unsupervised clustering methods be applied?
It can be applied to predict individual features of proteins (domain boundaries, subcellular location, conserved residues), to collectively predict a function with data integrated from different sources (structure, taxonomy, sequence, transcription, metabolic and protein-protein interaction networks).
Or to enhance an existing homology based annotation.
What does gene prediction or structural annotation or gene finding mean?
Aims to identify structural elements in a genomic region that represent a gene.
What does extrinsic methods for gene prediction do?
They align transcriptomic, protein sequence, and/or other evidence datasets to the genomic sequence for gene prediction
What does intrinsic methods for gene prediction do?
They use statistical patterns to identify gene regions in a genomic sequence
What is the predicted gene element data typically represented by?
A unified general feature format (GFF)
What is a general pipeline for gene prediction and functional annotation?
RNA-Seq reads -> Transcriptome assembly -> Transcript sequences (Protein sequences + Genome scaffolds) -> Gene prediction -> Gene annotation (InterPro: Domains, motifs, signal peptides) -> Post-processing
For extrinsic methods, how are genes predicted?
Based on the alignment success
For accurately predicting a gene structure with extrinsic methods, what sequences are preferred?
cDNA sequences
What is native alignment in the context of aligning an evidence dataset to a genomic sequence?
mRNA sequence as the evidence dataset typically are derived from the same species under investigation and match the genome sequence
What is trans-alignment in the context of aligning an evidence dataset to a genomic sequence?
Protein sequence as the evidence dataset are from closely related species and are not expected to match the conceptually translated genomic sequences
What are the challenges for extrinsic methods?
- Alignment inaccuracies
- Fragmented nature of evidence (mRNA or protein sequences) data
- Splice variants from genes
Why is Exonerate algorithm widely used to align for extrinsic methods of gene prediction?
- Process data relatively rapidly
- Align both protein and nucleotide sequences
Which aligners align evidence data accurately across exons and introns?
Pair HMM aligners, such as Pairagon and GeneWise
What is the disadvantage of using Pair HMM aligners?
Large computational time
Examples of alignment algorithms that use BLAST to produce seed alignments which are then extended using different dynamic programming variants such as Needleman-Wunsch or Smith-Waterman algorithms
EST_GENOME, AAT, Exonerate
What are consensus based methods in intrinsic gene prediction?
Consensus based methods, also known as signal sensors, predict known nucleotide patterns in gene elements. These methods look for specific, well-defined sequences that indicate important functional sites in DNA such as: Splice sites, start and stop codons, and kozak consensus sequence (related to the initiation of translation)
What sites do consensus based methods look for?
Well known pattern in gene elements such as kozak consensus sequence, start and stop codons, splice sites
Which methods are used to recognize the signals in consensus based methods in intrinsic gene prediction?
Methods utilizing the Weighed Matrix Method (WMM) such as Position Weight Matrix (PWM), Weighed Array Model (WAM), Maximal Dependence Decomposition (MDD), Windowed weight array model (WWAM)
How does weighed matrix method (WMM) work?
Calculates the signal probability and assumes that individual nucleotides are independent
How does weighed array model (WAM) work?
Assumes dependencies between adjacent nucleotides
How does maximal dependence decomposition (MDD) work?
Implements a decision tree of weighed matrix method (WMM) and extends the dependency considerations across non-adjacent nucleotides
How does windowed weight array model (WWAM) work?
Assumes dependencies across three consecutive nucleotides and averages related conditional probabilities among five consecutive nucleotides
What are non-consensus based methods in intrinsic gene prediction?
Use nucleotide composition (content) to recognize gene elements and sequence areas (coding and non coding regions)
What is the most successful discriminator between coding and non-coding regions when predicting nucleotide by nucleotide in non-consensus intrinsic gene prediction?
Hidden Markov Models using hexamer sequence composition
To extend the prediction capability of single nucleotide approach (HMMs with hexamer sequence composition to discriminate between coding and NC regions) to versatile gene elements or even complete gene structures, how are the prediction algorithms are enhanced?
Three-period, fifth-order generalized HMMs (GHMMs):
Hexamer sequences are used + Together with the built-in knowledge of codon structure to ensure the preservation of a reading frame
Examples of programs using GHMM based three-period fifth-order Markov Chain model
GENSCAN, GeneMark-ES
Which Markov models are used to further improve predictions from GHMM based Markov Chain models?
Interpolated Markov Models (IMM) in which Markov models of different order are interpolated
Which gene finders implement interpolated markov model?
AUGUSTUS, GlimmerHMM
How has Ab initio prediction algorithms been enhanced?
It has been enhanced using information from syntenic (=colocalized) regions among multiple genomes. It is advisable to employ genomes from taxonomically closely related species.
How to create functional prediction models for Ab initio gene prediction?
Ab initio predictors have to be trained with reliable training datasets, which are specific to each genome
What to do if training data is not available for a specific genome while creating a functional prediction model for Ab initio gene prediction?
Parameter values for prediction models can be estimated by predicting genes first using suboptimal parameter values, and then by recalculating new values based on these predicted genes
What are suboptimal parameter values?
Copied from prediction models for closely related species;
Inferred from the structure of core eukaryotic genes
Obtained from unsupervised gene prediction programs (Such as GeneMark)
What was the first attempt to combine the prediction data from multiple sources
Using the program COMBINER (Linear and statistical combinations of the prediction data from multiple sources)
What is the successor of COMBINER and what are its ab initio and evidence based on? How does it work?
JIGSAW.
ab initio: Internal support with GHMMs
evidence: Expresses external evidence of structural elements of a gene using feature vectors.
Feature vectors give a weighting coefficient to each prediction source, and dynamic programming (combined with decision trees) is used to establish optimal gene structures
Ensembl
Combined gene prediction program
Prefers evidence based over ab initio
High quality annotations at the cost of sensitivity
EVIGAN
Combined gene prediction program
Predicts gene structures using Dynamic Bayes networks
Estimated with Maximum Likelihood
GLEAN
Combined gene prediction program
Uses latent class analysis (LCA) algorithm to give consensus predictions
Gene structures are predicted from gene structural elements
MAKER2
Combined gene prediction program
Uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction
What is the advantage of using MAKER2?
It can estimate the reliability of any prediction as it uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction
EVM
Combined gene prediction program
Accommodates the use of variable of gene prediction and evidence data, allows for manual weight adjustment of each data source
What are orthologues and paralogues inferred by in InParanoid and MultiParanoid?
Pairwise reciprocal
What pairwise similarity matrix does MultiParanoid use?
InParanoid
What pairwise similarity matrix does InParanoid use?
BLAST based
How are orthologues and paralogues inferred in orthomcl?
Markov Clustering Algorithm