Finals Flashcards

Question 1

Q

Functional genomics

Answer

A

The functional annotation of genes is a large field that utilizes extensive experimentation to describe the function and interactions of gene and gene products

Question 2

Q

For functional annotation, what is BLAST and InterPro Software framework based on?

Answer

A

Sequence similarity

Question 3

Q

Examples of functional classification schemes

Answer

A

Gene Ontology (GO)
Enzyme Commission (EC) Numbers
Kyoto Encyclopedia of Genes & Genomes (KEGG) BRITE

Question 4

Q

How many classification schemes have been devised for protein structures and what are they?

Answer

A

Three:

SCOP (Structural Classification of Proteins)
CATH (Class, Architecture, Topology, Homologous superfamily)
FSSP (Families of structurally similar proteins)

Question 5

Q

What is the success or reliability of functional prediction influenced by?

Answer

A

Accuracy of the alignment of homologous characters in two or more sequences

Question 6

Q

What is the twilight zone?

Answer

A

Sequence similarity between two protein sequences is 15-25%, and the reliability of the prediction that two proteins are homologous, or evolutionarily related is only 10%

Question 7

Q

What is the percent identity that might occur between two protein sequences of longer than 100 amino acids simply by chance?

Question 8

Q

What is the reliability of prediction that two protein sequences are homologous when the sequence identity is above 30%?

Question 9

Q

By what percentage of amino acids in the sequence is the protein fold determined which determines the general structure of a protein?

Question 10

Q

What is the likely sequence similarity of proteins with similar structure?

Question 11

Q

What is the midnight zone?

Answer

A

Sequence identity is very low <15%, sequences are so different that the relationship is nearly invisible at sequence level, but may adopt very similar 3D structure

Question 12

Q

What percentage of gene annotations in public databases are incorrect or misleading?

Question 13

Q

How are the errors in gene annotations in public databases propagated?

Answer

A

Via analyses of new genomes

Question 14

Q

Where do the errors in gene annotations arise from?

Answer

A

They originate from various sources including genome assembly and gene prediction.

Genome assembly: Erroneous or incomplete genome assembly - Truncated or chimeric genes
Genes and gene function prediction: Single nucleotide errors

Question 15

Q

Which databases are the best-curated for protein functional annotations and why?

Answer

A

RefSeq
UniProt/SwissProt

They require multiple lines of experimentally derived evidence

Question 16

Q

Which sequence databases are integrated in the InterPro framework?

Answer

A

HAMAP, Panther, PIRSF, TIGRFAM

Question 17

Q

Which method to predict signal peptide is integrated in the InterPro framework?

Question 18

Q

Which method to predict transmembrane region is integrated in the InterPro framework?

Question 19

Q

Which fingerprint databases are integrated in the InterPro framework?

Question 20

Q

Which motif databases are integrated in the InterPro framework?

Question 21

Q

Which domain databases are integrated in the InterPro framework?

Answer

A

Gene3D, Pfam, ProDom, ProSite (Profile), SMART, Superfamily

Question 22

Q

The sensitivity of BLAST is comparable to what algorithm?

Answer

A

Smith-Waterman

Question 23

Q

How can BLAST recognize distant homologues?

Answer

A

An iterative algorithm using a position specific score matrix is devised and implemented in PSI-BLAST. A matrix is reconstructed for individual iterations using sequences from previous iterations.

Question 24

Q

What could lead to an erroneous transfer of function in BLAST-based annotation methods?

Answer

A

Homologues may align only over a small portion of their overall lengths.

Homologue may have been wrongly annotated in the first place.

Question 25

Q

Issues with BLAST-based annotation methods

Answer

A

Distant homologues
Homologues may only align over a small portion of the overall lengths
Misannotated homologues

Question 26

Q

What is commonly used to predict orthologous proteins from KEGG databases?

Question 27

Q

What can mis-annotation of homologous proteins may also lead to in case of orthologs?

Answer

A

Orthologs are predicted from KEGG databases, and misannotation may lead to erroneous predictions of metabolic pathways and protein families

Question 28

Q

A server that incorporates a curated domain-family database

Question 29

Q

A server that incorporates a computationally-generated domain-family database

Question 30

Q

How does Pfam generate clusters of domain families?

Answer

A

Defined by the program ADDA

Question 31

Q

How are clusters of domain families formed in ADDA?

Answer

A

From pairwise comparisons of profiles of domains inferred by penalizing splits and partial overlaps in pairwise, BLAST-aligned, protein-similarity matrix

Question 32

Q

How does SMART domain database work?

Answer

A

Simple Modular Architecture Research Tool requires manual intervention during annotation and is linked to a database called STRING (Search Tool for the Retrieval of Interacting Genes)

Question 33

Q

How does ProDom domain database work?

Answer

A

Comparing the results from PSI-BLAST against the UniProtKB database and inferring domain information from the resultant data

Question 34

Q

Which domain databases does ProDom complement?

Answer

A

Pfam, ProSite, SMART

Question 35

Q

How does SUPERFAMILY resource for domains work?

Answer

A

It uses the SCOP classification scheme for inferred protein-domain superfamilies and assigns gene ontology (GO) terms to these families using Gene Ontology annotation

Question 36

Q

How does Gene3D resource for domains work?

Answer

A

It combines both structural (CATH classification scheme) and functional information to annotate domains found in sequences in the databases UniProtKB, RefSeq, Ensembl. It clusters annotated superfamilies into functional subfamilies using GeMMA.

Question 37

Q

How does the ProSite motif database work?

Answer

A

Recognizes protein motifs using regular expressions and weight matrix profiles, augmented by the annotation rule database ProRule.

Question 38

Q

What does the implementation of ProRule in ProSite does?

Answer

A

It increases the reliability by imposing rules, such as essential amino acids in the active sites of enzymes.

Question 39

Q

What are some clustering methods and databases?

Answer

A

Methods: OrthoMCL, InParanoid, MultiParanoid
Databases: OrthoDB, Clusters of Orthologous Groups of Proteins

Question 40

Q

What are the clustering methods and databases based on?

Answer

A

They use the all-versus-all similarity metrices, created based on the pairwise alignments of protein sequences using algorithms such as BLAST, FASTA, Smith-Waterman

Question 41

Q

What is the largest, publicly available, all-versus-all protein sequence similarity score matrix called?

Answer

A

Similarity Matrix of Proteins (SIMAP)

Question 42

Q

What is SIMAP2 limited to?

Answer

A

Proteins encoded in complete genome sequences (Not publicly available)

Question 43

Q

In which database is SIMAP2 employed?

Answer

A

eggNOG: evolutionary geneology of genes: Non-supervised Orthologous Groups

Question 44

Q

how are the proteins assembled in eggNOG?

Answer

A

They are assembled into in-paralogous (as opposed to out-paralogous and orthologous) groups by comparing sequence similarities within and among clades

Question 45

Q

How are orthologous groups found from eggNOG?

Answer

A

Orthologous groups amongst the in-paralogous groups in eggNOG are then identified by creating and merging reciprocal best hits among three species

Question 46

Q

How can clustering methods be improved to predict orthologues and paralogues?

Answer

A

Amino acid substitution models like BLOSUM can be replaced with models that better estimate phylogenetic distances, such as JTT, WAG and by reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species.

Question 47

Q

In which methods or databases is the reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species accounted for?

Answer

A

SYNERGY
PhIG (Phylogenetically inferred groups)
TreeFam
PANTHER

Question 48

Q

What are the problems with databases with phylogenomic annotation algorithms?

Answer

A

They provide a compelling option to rapidly detect protein function, but they are limited in:

Their coverage of species and proteins
Using sequence similarity searches to position the query sequence in phylogenetic trees in the databases, constructed using substitution models and taxonomic information seems questionable
Even perfect positioning does not guarantee the accurate prediction of function for the query protein sequence, because homologous proteins do not always have the same function

Question 49

Q

How do we annotate proteins based on structure?

Answer

A

We compare the predicted folds of the gene products against structurally similar proteins in databases such as protein data bank (PDB)

Question 50

Q

What are the limitations of annotating protein functions based on structure?

Answer

A

Only 60% of structurally similar proteins without significant sequence similarity share a binding site location, thus the function inferred from this comparison may not always be correct.
Moreover, functional knowledge about a lot of the 3D structures of proteins in PDB is lacking as structural genomics initiatives are only directed at determining the 3D structures through high-throughput structure determination efforts.
In convergent evolution, the same function is observed even with different folds, thus preventing the use of structural homologues to infer a function

Question 51

Q

What should be done to increase the accuracy of structure-based function prediction and why?

Answer

A

Conserved amino acids in active and binding sites need to be evaluated. That is because, for enzymes, catalytic residues and their locations within the protein and orientation within the active sites are usually conserved and are not associated with structural variation, thereby allowing the functional annotation of distantly related homologues.

Question 52

Q

How to identify the conserved residues to improve the accuracy of structure-based function prediction?

Answer

A

The identification of conserved residues in protein families is through multiple sequence alignment

Question 53

Q

Where can the functional classification of proteins be evaluated?

Answer

A

In the annual Critical Assessment of Function Annotation (CAFA) challenge

Question 54

Q

When is promising annotation achieved?

Answer

A

When using machine learning and supervised classification methods, and unsupervised clustering methods

Question 55

Q

In what databases are the results for experimentally evaluated and computationally predicted protein-protein interaction networks and protein-protein complexes found?

Answer

A

DIP
STRING

Question 56

Q

Where can machine learning and supervised classification methods, with unsupervised clustering methods be applied?

Answer

A

It can be applied to predict individual features of proteins (domain boundaries, subcellular location, conserved residues), to collectively predict a function with data integrated from different sources (structure, taxonomy, sequence, transcription, metabolic and protein-protein interaction networks).

Or to enhance an existing homology based annotation.

Question 57

Q

What does gene prediction or structural annotation or gene finding mean?

Answer

A

Aims to identify structural elements in a genomic region that represent a gene.

Question 58

Q

What does extrinsic methods for gene prediction do?

Answer

A

They align transcriptomic, protein sequence, and/or other evidence datasets to the genomic sequence for gene prediction

Question 59

Q

What does intrinsic methods for gene prediction do?

Answer

A

They use statistical patterns to identify gene regions in a genomic sequence

Question 60

Q

What is the predicted gene element data typically represented by?

Answer

A

A unified general feature format (GFF)

Question 61

Q

What is a general pipeline for gene prediction and functional annotation?

Answer

A

RNA-Seq reads -> Transcriptome assembly -> Transcript sequences (Protein sequences + Genome scaffolds) -> Gene prediction -> Gene annotation (InterPro: Domains, motifs, signal peptides) -> Post-processing

Question 62

Q

For extrinsic methods, how are genes predicted?

Answer

A

Based on the alignment success

Question 63

Q

For accurately predicting a gene structure with extrinsic methods, what sequences are preferred?

Answer

A

cDNA sequences

Question 64

Q

What is native alignment in the context of aligning an evidence dataset to a genomic sequence?

Answer

A

mRNA sequence as the evidence dataset typically are derived from the same species under investigation and match the genome sequence

Answer 53

A

Protein sequence as the evidence dataset are from closely related species and are not expected to match the conceptually translated genomic sequences

Answer 54

A

Alignment inaccuracies
Fragmented nature of evidence (mRNA or protein sequences) data
Splice variants from genes

Answer 55

A

Process data relatively rapidly
Align both protein and nucleotide sequences

Answer 56

A

Pair HMM aligners, such as Pairagon and GeneWise

Answer 57

A

Large computational time

Answer 58

A

EST_GENOME, AAT, Exonerate

Answer 59

A

Consensus based methods, also known as signal sensors, predict known nucleotide patterns in gene elements. These methods look for specific, well-defined sequences that indicate important functional sites in DNA such as: Splice sites, start and stop codons, and kozak consensus sequence (related to the initiation of translation)

Answer 60

A

Well known pattern in gene elements such as kozak consensus sequence, start and stop codons, splice sites

Answer 61

A

Methods utilizing the Weighed Matrix Method (WMM) such as Position Weight Matrix (PWM), Weighed Array Model (WAM), Maximal Dependence Decomposition (MDD), Windowed weight array model (WWAM)

Answer 62

A

Calculates the signal probability and assumes that individual nucleotides are independent

Answer 63

A

Assumes dependencies between adjacent nucleotides

Answer 64

A

Implements a decision tree of weighed matrix method (WMM) and extends the dependency considerations across non-adjacent nucleotides

Answer 65

A

Assumes dependencies across three consecutive nucleotides and averages related conditional probabilities among five consecutive nucleotides

Answer 66

A

Use nucleotide composition (content) to recognize gene elements and sequence areas (coding and non coding regions)

Answer 67

A

Hidden Markov Models using hexamer sequence composition

Answer 68

A

Three-period, fifth-order generalized HMMs (GHMMs):

Hexamer sequences are used + Together with the built-in knowledge of codon structure to ensure the preservation of a reading frame

Answer 69

A

GENSCAN, GeneMark-ES

Answer 70

A

Interpolated Markov Models (IMM) in which Markov models of different order are interpolated

Answer 71

A

AUGUSTUS, GlimmerHMM

Answer 72

A

It has been enhanced using information from syntenic (=colocalized) regions among multiple genomes. It is advisable to employ genomes from taxonomically closely related species.

Answer 73

A

Ab initio predictors have to be trained with reliable training datasets, which are specific to each genome

Answer 74

A

Parameter values for prediction models can be estimated by predicting genes first using suboptimal parameter values, and then by recalculating new values based on these predicted genes

Answer 75

A

Copied from prediction models for closely related species;
Inferred from the structure of core eukaryotic genes
Obtained from unsupervised gene prediction programs (Such as GeneMark)

Answer 76

A

Using the program COMBINER (Linear and statistical combinations of the prediction data from multiple sources)

Answer 77

A

JIGSAW.
ab initio: Internal support with GHMMs
evidence: Expresses external evidence of structural elements of a gene using feature vectors.

Feature vectors give a weighting coefficient to each prediction source, and dynamic programming (combined with decision trees) is used to establish optimal gene structures

Answer 78

A

Combined gene prediction program
Prefers evidence based over ab initio
High quality annotations at the cost of sensitivity

Answer 79

A

Combined gene prediction program
Predicts gene structures using Dynamic Bayes networks
Estimated with Maximum Likelihood

Answer 80

A

Combined gene prediction program
Uses latent class analysis (LCA) algorithm to give consensus predictions
Gene structures are predicted from gene structural elements

Answer 81

A

Combined gene prediction program
Uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction

Answer 82

A

It can estimate the reliability of any prediction as it uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction

Answer 83

A

Combined gene prediction program
Accommodates the use of variable of gene prediction and evidence data, allows for manual weight adjustment of each data source

Answer 84

A

Pairwise reciprocal

Answer 85

A

InParanoid

Answer 86

A

BLAST based

Answer 87

A

Markov Clustering Algorithm