Bioinformatics Flashcards

Question 1

Q

ortholog, paralog, homolog - define

Answer

A

Homolog: A gene related to a second gene by descent from a common ancestral DNA sequence. The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the relationship betwen genes separated by the event of genetic duplication (see paralog).

Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes. (See also Paralogs.).

Speciation is the origin of a new species capable of making a living in a new way from the species from which it arose. As part of this process it has also aquired some barreir to genetic exchage with the parent species.

Paralogs are genes related by duplication within a genome. Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one.

Question 2

Q

What is the STRING database used for?

Answer

A

In molecular biology, STRING (Search Tool for the Retrieval of Interacting Genes/Proteins) is abiological database and web resource of known and predicted protein-protein interactions.

Protein-protein interaction networks are an important ingredient for the system-level understanding of cellular processes. Such networks can be used for filtering and assessing functional genomics data and for providing an intuitive platform for annotating structural, functional and evolutionary properties of proteins. Exploring the predicted interaction networks can suggest new directions for future experimental research and provide cross-species predictions for efficient interaction mapping.

Like many other database that store protein association knowledge STRING imports data from experimentally derived protein-protein interactions through literature curation. Furthermore, STRING also store computationally predicted interactions from: (i) text mining of scientific texts, (ii) interactions computed from genomic features, and (iii) interactions transferred from model organisms based on orthology. [7]

All predicted or imported interactions are benchmarked against a common reference of functional partnership as annotated done by KEGG (Kyoto Encyclopedia of Genes and Genomes).

Question 3

Q

What is UniProt?

Answer

A

UniProt is a comprehensive, high-quality and freely accessible database of protein sequenceand functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature.

UniProt Knowledgebase (UniProtKB) is a protein database partially curated by experts, consisting of two sections: UniProtKB/Swiss-Prot (containing reviewed, manually annotated entries) and UniProtKB/TrEMBL (containing unreviewed, automatically annotated entries).[11]

UniProtKB/Swiss-Prot is a manually annotated, non-redundant protein sequence database. It combines information extracted from scientific literature and biocurator-evaluated computational analysis. The aim of UniProtKB/Swiss-Prot is to provide all known relevant information about a particular protein. Annotation is regularly reviewed to keep up with current scientific findings. The manual annotation of an entry involves detailed analysis of the protein sequence and of the scientific literature.[14]

Sequences from the same gene and the same species are merged into the same database entry. Differences between sequences are identified, and their cause documented (for example alternative splicing, natural variation, incorrect initiation sites, incorrect exon boundaries, frameshifts, unidentified conflicts). A range of sequence analysis tools is used in the annotation of UniProtKB/Swiss-Prot entries. Computer-predictions are manually evaluated, and relevant results selected for inclusion in the entry. These predictions include post-translational modifications, transmembrane domains andtopology, signal peptides, domain identification, and protein family classification.[14][15]

Relevant publications are identified by searching databases such as PubMed. The full text of each paper is read, and information is extracted and added to the entry. Annotation arising from the scientific literature includes, but is not limited to:[10][14][15]

Protein and gene names
Function
Enzyme-specific information such as catalytic activity, cofactors and catalytic residues
Subcellular location
Protein-protein interactions
Pattern of expression
Locations and roles of significant domains and sites
Ion-, substrate- and cofactor-binding sites
Protein variant forms produced by natural genetic variation, RNA editing, alternative splicing, proteolytic processing, and post-translational modification