lecture 4 What is Bioinformatics? Flashcards
What is Bioinformatics?
‘Bioinformatics is the acquisition, archiving, and interpretation (analysis) of molecular biology information’.
Bioinformatics is a multidisciplinary science at cross roads of Biology, Computer Science, Statistics, Mathematical Modelling, Systems Science
what is acquisition in regard to bioinformatics?
Acquisition (Analytical Platforms):
-DNA, RNA, Protein Sequence
-Metabolites
-Molecular Structures
what is archiving in regard to bioinformatics?
Archiving (Biological Databases):
-DNA, RNA, Protein Sequence
-Metabolites
-Molecular Structures
what is interpretation in regard to bioinformatics?
Interpretation (Data Analysis):
-Computational Genomics
-Gene Function Annotation
-Molecular Pathway Annoation
-And more …
what is a Biological Databases
A large, persistent collection of systematically organised data, managed by a software that can retrieve and update records
Biological databases are central to many bioinformatics applications.
Biological databases provide the opportunity to access and systematically search a wide variety of biological data for an increasingly broad range of organisms
Types of biological data include
-Genomic, transcriptomic, and protein sequences.
-Genomic annotation, e.g. genes, transcription factors binding sites, gene function, pathways
-Phenotypes
-Protein Structure
-And more…
what are some Key concepts of biological databases
To be easily identifiable ALL records (gene/protein/metabolite names, sequences, data etc) needs to have an UNIQUE IDENTIFIER aka ACCESSION NUMBER
In your analyses/reports you will need to cite the unique identifier/accession number so the reader knows which data/organism/ gene (etc) you are really working with
An identifier unambiguously identify a biological entity
To be easily understood it is a good idea to have information presented in a FIXED FORMAT/VOCABULARY. This helps us (and computers) to read and extract the information we need.
In molecular biology there are two sequence formats GENBANK and FASTA that are frequently used.
FASTA is a de facto standard for any raw sequence.
GENBANK is the flat file format for gene sequences
Question: The sequence of a gene was updated in the Entrez database. What will happen to the gene identifier?
It remains unchanged.
Not all databases deal with versions the same way. Ensembl will append the version number to the id, such that the id is formed of two components:
the gene identifier and
the version of the gene.
How do I choose the right biological data(base)?
1.Type of Biological System
+Cell culture
*Animal model
-Human
2.Level of organisation
+Organeles
*Single Cells
-Tissues
3.Scope, depth and breadth of coverage
*Biased or partial, e.g. Candidate gene
-Comprehensive, e.g. Omics data
4.Genesis
+Computational predictions
*Experimental data
5.Levels of Curation
+Raw/archival data, e.g. SRA
-Curated data, e.g. RefSeq
6.Types of Curation
+Computationaly curated, e.g. UniProt
*Community curated, e.g. GO
-Expert reviewed, e.g. RefSeq
Summary
Biological databases store different types of biological data, e.g. sequence, bibliography, graphs, etc.
Two key concepts allows storage, sharing and unambiguous interpretation of data:
1.Unique Identifiers
2.Fixed Formats and Vocabularies
Biological databases can be characterised by six attributes:
1-Biological System
2-Level of Organisation
3-Scope and Coverage
4-Genesis
5-Curation
6-Types of Curation
Biological databases covered in this lecture
PubMed – Bibliographic database
GenBank – Gene-centric sequence database
UniProt – Integrative portal focused on protein data
Gene Ontology – Gene function database
KEGG – Metabolic pathways database
Summary of characteristics of bibliographical DB
Credibility of a source: For a journal being indexed in medline, scopus, pubmed and web of science requires meeting vigorous review and selection criteria.
Convenience: Find information in all journals by doing a single search
Permanency: PUBMED provides unique identifiers for each paper and a permanent URL for sharing and citing unambiguously.
Impact: Web of science and Scopus provides number of citations of Papers and Impact Factor of Journals (Number of citations / year)
Summary of banks
GenBank is a flat file database with the following characteristics:
-Human readable format (GenBank)
-Archival in nature
-Reflective of submiters point of view (subjective)
-Redundant (multiple copies)
UniProt is a protein-focused database consisting of the combined databases -SwissProt, TrEMBL.
SwissProt is manually annotated and reviewed
-TrEMBL are automatically annotated and not reviewed
Fasta is a machine interpretable format that consists of:
-A greater than symbol “>” for every new entry, with a unique identifier (name) and
-The sequence on the following line