Finals Flashcards

1
Q

Functional genomics

A

The functional annotation of genes is a large field that utilizes extensive experimentation to describe the function and interactions of gene and gene products

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

For functional annotation, what is BLAST and InterPro Software framework based on?

A

Sequence similarity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Examples of functional classification schemes

A

Gene Ontology (GO)
Enzyme Commission (EC) Numbers
Kyoto Encyclopedia of Genes & Genomes (KEGG) BRITE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How many classification schemes have been devised for protein structures and what are they?

A

Three:

SCOP (Structural Classification of Proteins)
CATH (Class, Architecture, Topology, Homologous superfamily)
FSSP (Families of structurally similar proteins)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the success or reliability of functional prediction influenced by?

A

Accuracy of the alignment of homologous characters in two or more sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the twilight zone?

A

Sequence similarity between two protein sequences is 15-25%, and the reliability of the prediction that two proteins are homologous, or evolutionarily related is only 10%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the percent identity that might occur between two protein sequences of longer than 100 amino acids simply by chance?

A

10-20%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the reliability of prediction that two protein sequences are homologous when the sequence identity is above 30%?

A

90%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

By what percentage of amino acids in the sequence is the protein fold determined which determines the general structure of a protein?

A

3-4%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the likely sequence similarity of proteins with similar structure?

A

> 33%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the midnight zone?

A

Sequence identity is very low <15%, sequences are so different that the relationship is nearly invisible at sequence level, but may adopt very similar 3D structure

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What percentage of gene annotations in public databases are incorrect or misleading?

A

5-63%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How are the errors in gene annotations in public databases propagated?

A

Via analyses of new genomes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Where do the errors in gene annotations arise from?

A

They originate from various sources including genome assembly and gene prediction.

Genome assembly: Erroneous or incomplete genome assembly - Truncated or chimeric genes
Genes and gene function prediction: Single nucleotide errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Which databases are the best-curated for protein functional annotations and why?

A

RefSeq
UniProt/SwissProt

They require multiple lines of experimentally derived evidence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Which sequence databases are integrated in the InterPro framework?

A

HAMAP, Panther, PIRSF, TIGRFAM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Which method to predict signal peptide is integrated in the InterPro framework?

A

SignalP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Which method to predict transmembrane region is integrated in the InterPro framework?

A

TMHMM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which fingerprint databases are integrated in the InterPro framework?

A

PRINTS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Which motif databases are integrated in the InterPro framework?

A

ProSite

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Which domain databases are integrated in the InterPro framework?

A

Gene3D, Pfam, ProDom, ProSite (Profile), SMART, Superfamily

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

The sensitivity of BLAST is comparable to what algorithm?

A

Smith-Waterman

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

How can BLAST recognize distant homologues?

A

An iterative algorithm using a position specific score matrix is devised and implemented in PSI-BLAST. A matrix is reconstructed for individual iterations using sequences from previous iterations.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What could lead to an erroneous transfer of function in BLAST-based annotation methods?

A

Homologues may align only over a small portion of their overall lengths.

Homologue may have been wrongly annotated in the first place.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Issues with BLAST-based annotation methods

A
  1. Distant homologues
  2. Homologues may only align over a small portion of the overall lengths
  3. Misannotated homologues
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is commonly used to predict orthologous proteins from KEGG databases?

A

BLAST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What can mis-annotation of homologous proteins may also lead to in case of orthologs?

A

Orthologs are predicted from KEGG databases, and misannotation may lead to erroneous predictions of metabolic pathways and protein families

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

A server that incorporates a curated domain-family database

A

PfamA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

A server that incorporates a computationally-generated domain-family database

A

PfamB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

How does Pfam generate clusters of domain families?

A

Defined by the program ADDA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How are clusters of domain families formed in ADDA?

A

From pairwise comparisons of profiles of domains inferred by penalizing splits and partial overlaps in pairwise, BLAST-aligned, protein-similarity matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

How does SMART domain database work?

A

Simple Modular Architecture Research Tool requires manual intervention during annotation and is linked to a database called STRING (Search Tool for the Retrieval of Interacting Genes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How does ProDom domain database work?

A

Comparing the results from PSI-BLAST against the UniProtKB database and inferring domain information from the resultant data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Which domain databases does ProDom complement?

A

Pfam, ProSite, SMART

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How does SUPERFAMILY resource for domains work?

A

It uses the SCOP classification scheme for inferred protein-domain superfamilies and assigns gene ontology (GO) terms to these families using Gene Ontology annotation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

How does Gene3D resource for domains work?

A

It combines both structural (CATH classification scheme) and functional information to annotate domains found in sequences in the databases UniProtKB, RefSeq, Ensembl. It clusters annotated superfamilies into functional subfamilies using GeMMA.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How does the ProSite motif database work?

A

Recognizes protein motifs using regular expressions and weight matrix profiles, augmented by the annotation rule database ProRule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What does the implementation of ProRule in ProSite does?

A

It increases the reliability by imposing rules, such as essential amino acids in the active sites of enzymes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What are some clustering methods and databases?

A

Methods: OrthoMCL, InParanoid, MultiParanoid
Databases: OrthoDB, Clusters of Orthologous Groups of Proteins

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What are the clustering methods and databases based on?

A

They use the all-versus-all similarity metrices, created based on the pairwise alignments of protein sequences using algorithms such as BLAST, FASTA, Smith-Waterman

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is the largest, publicly available, all-versus-all protein sequence similarity score matrix called?

A

Similarity Matrix of Proteins (SIMAP)

42
Q

What is SIMAP2 limited to?

A

Proteins encoded in complete genome sequences (Not publicly available)

43
Q

In which database is SIMAP2 employed?

A

eggNOG: evolutionary geneology of genes: Non-supervised Orthologous Groups

44
Q

how are the proteins assembled in eggNOG?

A

They are assembled into in-paralogous (as opposed to out-paralogous and orthologous) groups by comparing sequence similarities within and among clades

45
Q

How are orthologous groups found from eggNOG?

A

Orthologous groups amongst the in-paralogous groups in eggNOG are then identified by creating and merging reciprocal best hits among three species

46
Q

How can clustering methods be improved to predict orthologues and paralogues?

A

Amino acid substitution models like BLOSUM can be replaced with models that better estimate phylogenetic distances, such as JTT, WAG and by reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species.

47
Q

In which methods or databases is the reconciliation of a deduced phylogenetic tree of individual genes to the phylogenetic tree of species accounted for?

A

SYNERGY
PhIG (Phylogenetically inferred groups)
TreeFam
PANTHER

48
Q

What are the problems with databases with phylogenomic annotation algorithms?

A

They provide a compelling option to rapidly detect protein function, but they are limited in:

  1. Their coverage of species and proteins
  2. Using sequence similarity searches to position the query sequence in phylogenetic trees in the databases, constructed using substitution models and taxonomic information seems questionable
  3. Even perfect positioning does not guarantee the accurate prediction of function for the query protein sequence, because homologous proteins do not always have the same function
49
Q

How do we annotate proteins based on structure?

A

We compare the predicted folds of the gene products against structurally similar proteins in databases such as protein data bank (PDB)

50
Q

What are the limitations of annotating protein functions based on structure?

A
  1. Only 60% of structurally similar proteins without significant sequence similarity share a binding site location, thus the function inferred from this comparison may not always be correct.
  2. Moreover, functional knowledge about a lot of the 3D structures of proteins in PDB is lacking as structural genomics initiatives are only directed at determining the 3D structures through high-throughput structure determination efforts.
  3. In convergent evolution, the same function is observed even with different folds, thus preventing the use of structural homologues to infer a function
51
Q

What should be done to increase the accuracy of structure-based function prediction and why?

A

Conserved amino acids in active and binding sites need to be evaluated. That is because, for enzymes, catalytic residues and their locations within the protein and orientation within the active sites are usually conserved and are not associated with structural variation, thereby allowing the functional annotation of distantly related homologues.

52
Q

How to identify the conserved residues to improve the accuracy of structure-based function prediction?

A

The identification of conserved residues in protein families is through multiple sequence alignment

53
Q

Where can the functional classification of proteins be evaluated?

A

In the annual Critical Assessment of Function Annotation (CAFA) challenge

54
Q

When is promising annotation achieved?

A

When using machine learning and supervised classification methods, and unsupervised clustering methods

55
Q

In what databases are the results for experimentally evaluated and computationally predicted protein-protein interaction networks and protein-protein complexes found?

56
Q

Where can machine learning and supervised classification methods, with unsupervised clustering methods be applied?

A

It can be applied to predict individual features of proteins (domain boundaries, subcellular location, conserved residues), to collectively predict a function with data integrated from different sources (structure, taxonomy, sequence, transcription, metabolic and protein-protein interaction networks).

Or to enhance an existing homology based annotation.

57
Q

What does gene prediction or structural annotation or gene finding mean?

A

Aims to identify structural elements in a genomic region that represent a gene.

58
Q

What does extrinsic methods for gene prediction do?

A

They align transcriptomic, protein sequence, and/or other evidence datasets to the genomic sequence for gene prediction

59
Q

What does intrinsic methods for gene prediction do?

A

They use statistical patterns to identify gene regions in a genomic sequence

60
Q

What is the predicted gene element data typically represented by?

A

A unified general feature format (GFF)

61
Q

What is a general pipeline for gene prediction and functional annotation?

A

RNA-Seq reads -> Transcriptome assembly -> Transcript sequences (Protein sequences + Genome scaffolds) -> Gene prediction -> Gene annotation (InterPro: Domains, motifs, signal peptides) -> Post-processing

62
Q

For extrinsic methods, how are genes predicted?

A

Based on the alignment success

63
Q

For accurately predicting a gene structure with extrinsic methods, what sequences are preferred?

A

cDNA sequences

64
Q

What is native alignment in the context of aligning an evidence dataset to a genomic sequence?

A

mRNA sequence as the evidence dataset typically are derived from the same species under investigation and match the genome sequence

65
Q

What is trans-alignment in the context of aligning an evidence dataset to a genomic sequence?

A

Protein sequence as the evidence dataset are from closely related species and are not expected to match the conceptually translated genomic sequences

66
Q

What are the challenges for extrinsic methods?

A
  1. Alignment inaccuracies
  2. Fragmented nature of evidence (mRNA or protein sequences) data
  3. Splice variants from genes
67
Q

Why is Exonerate algorithm widely used to align for extrinsic methods of gene prediction?

A
  1. Process data relatively rapidly
  2. Align both protein and nucleotide sequences
68
Q

Which aligners align evidence data accurately across exons and introns?

A

Pair HMM aligners, such as Pairagon and GeneWise

69
Q

What is the disadvantage of using Pair HMM aligners?

A

Large computational time

70
Q

Examples of alignment algorithms that use BLAST to produce seed alignments which are then extended using different dynamic programming variants such as Needleman-Wunsch or Smith-Waterman algorithms

A

EST_GENOME, AAT, Exonerate

71
Q

What are consensus based methods in intrinsic gene prediction?

A

Consensus based methods, also known as signal sensors, predict known nucleotide patterns in gene elements. These methods look for specific, well-defined sequences that indicate important functional sites in DNA such as: Splice sites, start and stop codons, and kozak consensus sequence (related to the initiation of translation)

72
Q

What sites do consensus based methods look for?

A

Well known pattern in gene elements such as kozak consensus sequence, start and stop codons, splice sites

73
Q

Which methods are used to recognize the signals in consensus based methods in intrinsic gene prediction?

A

Methods utilizing the Weighed Matrix Method (WMM) such as Position Weight Matrix (PWM), Weighed Array Model (WAM), Maximal Dependence Decomposition (MDD), Windowed weight array model (WWAM)

74
Q

How does weighed matrix method (WMM) work?

A

Calculates the signal probability and assumes that individual nucleotides are independent

75
Q

How does weighed array model (WAM) work?

A

Assumes dependencies between adjacent nucleotides

76
Q

How does maximal dependence decomposition (MDD) work?

A

Implements a decision tree of weighed matrix method (WMM) and extends the dependency considerations across non-adjacent nucleotides

77
Q

How does windowed weight array model (WWAM) work?

A

Assumes dependencies across three consecutive nucleotides and averages related conditional probabilities among five consecutive nucleotides

78
Q

What are non-consensus based methods in intrinsic gene prediction?

A

Use nucleotide composition (content) to recognize gene elements and sequence areas (coding and non coding regions)

79
Q

What is the most successful discriminator between coding and non-coding regions when predicting nucleotide by nucleotide in non-consensus intrinsic gene prediction?

A

Hidden Markov Models using hexamer sequence composition

80
Q

To extend the prediction capability of single nucleotide approach (HMMs with hexamer sequence composition to discriminate between coding and NC regions) to versatile gene elements or even complete gene structures, how are the prediction algorithms are enhanced?

A

Three-period, fifth-order generalized HMMs (GHMMs):

Hexamer sequences are used + Together with the built-in knowledge of codon structure to ensure the preservation of a reading frame

80
Q

Examples of programs using GHMM based three-period fifth-order Markov Chain model

A

GENSCAN, GeneMark-ES

81
Q

Which Markov models are used to further improve predictions from GHMM based Markov Chain models?

A

Interpolated Markov Models (IMM) in which Markov models of different order are interpolated

82
Q

Which gene finders implement interpolated markov model?

A

AUGUSTUS, GlimmerHMM

83
Q

How has Ab initio prediction algorithms been enhanced?

A

It has been enhanced using information from syntenic (=colocalized) regions among multiple genomes. It is advisable to employ genomes from taxonomically closely related species.

84
Q

How to create functional prediction models for Ab initio gene prediction?

A

Ab initio predictors have to be trained with reliable training datasets, which are specific to each genome

85
Q

What to do if training data is not available for a specific genome while creating a functional prediction model for Ab initio gene prediction?

A

Parameter values for prediction models can be estimated by predicting genes first using suboptimal parameter values, and then by recalculating new values based on these predicted genes

86
Q

What are suboptimal parameter values?

A

Copied from prediction models for closely related species;
Inferred from the structure of core eukaryotic genes
Obtained from unsupervised gene prediction programs (Such as GeneMark)

87
Q

What was the first attempt to combine the prediction data from multiple sources

A

Using the program COMBINER (Linear and statistical combinations of the prediction data from multiple sources)

88
Q

What is the successor of COMBINER and what are its ab initio and evidence based on? How does it work?

A

JIGSAW.
ab initio: Internal support with GHMMs
evidence: Expresses external evidence of structural elements of a gene using feature vectors.

Feature vectors give a weighting coefficient to each prediction source, and dynamic programming (combined with decision trees) is used to establish optimal gene structures

89
Q

Ensembl

A

Combined gene prediction program
Prefers evidence based over ab initio
High quality annotations at the cost of sensitivity

90
Q

EVIGAN

A

Combined gene prediction program
Predicts gene structures using Dynamic Bayes networks
Estimated with Maximum Likelihood

91
Q

GLEAN

A

Combined gene prediction program
Uses latent class analysis (LCA) algorithm to give consensus predictions
Gene structures are predicted from gene structural elements

92
Q

MAKER2

A

Combined gene prediction program
Uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction

93
Q

What is the advantage of using MAKER2?

A

It can estimate the reliability of any prediction as it uses annotation edit distance (AED) to estimate the share of evidence data for consensus prediction

94
Q

EVM

A

Combined gene prediction program
Accommodates the use of variable of gene prediction and evidence data, allows for manual weight adjustment of each data source

95
Q

What are orthologues and paralogues inferred by in InParanoid and MultiParanoid?

A

Pairwise reciprocal

96
Q

What pairwise similarity matrix does MultiParanoid use?

A

InParanoid

97
Q

What pairwise similarity matrix does InParanoid use?

A

BLAST based

98
Q

How are orthologues and paralogues inferred in orthomcl?

A

Markov Clustering Algorithm