HC 9 - Information Management: Public Biological Databases Flashcards
hoorcollege 9
Why is it important to visualize and store knowledge and data structural?
-Reuse
-New hyptheses and experiments
Types of databases for omics and clinical
In-house and public
Why are data good value for money, but difficult to establish and fund
Databases ensure expensive data is not lost (expensive to overdo experiments) > but hard to fund, because no new information/insights
-Database maintenance costs money
Information management
-Collection and storage
-Management > organisation, annotation, curation, integration
-Standardization > minimum information standards, FAIR principle
-Distribution and sharing > you cannot just reach human data (privacy)
-GDPR
What is a database?
-Computational archive to store and organize data > to easily query/retrieve data
-Consists of hardware and software
-To organize data in structured records
-Allows discovery of new information (data mining, machine learning, artificial intelligence)
FAIR principle
-Findable
-Accessible
-Interoperable
-Reuseable
Query
Ask to a database
For structure, what is important for a database?
Not different kinds of information in a column
Why are commas important in databases?
They are separators for columns
Each field consists a maximum of … categories of data
1
Relational database
Linked tables
> searching for certain labels which appear in the different tables
> linkage due to similar labels
> more complex databases like Gene Ontology
types of databases
Primary
-Consist of raw data
-Sequences, structures
Secondary (composite)
-Data from analysis or treatment of the primary data
-Protein families, metabolic pathways
Literature database
-not biological
-pubmed/ncbi
Online book, like GeneReviews
-expert authored
-peer-reviewed disease descriptions
Non-redundant database
All sequences which are uploaded for a gene reduced to one consensus sequence
> like RefSeq: one sequence per gene
> search on subset
> duplicate entries are removed
Redundant
Multiple sequences for the same gene
The database storage doubles every …. because:
Every 18 months > queries need to be repeated.
- Therefore, check database regularly
Most searched organisms in GenBank are ..
Mammalians, model organisms, fish, plants, microbes
Types of queries in GenBank and results
-Search sequence by name
-Search sequence by similarity: enter sequence
Results
-Accession code
-Version
-Unique identifier
-Comment (free text)
-Sequence features
-Links to other database
The primary accession code is …
a unique unchanging identifier assigned to each GenBank sequence record: used when citing information from GenBank
> can be used for other databases
Publication rules for scientific journals
-Describe where the data is found in the database
> Data deposit (the sequence)
Because
-Analysis can be validated by other researchers with possible new methods
-But sometimes the description is incomplete or it is already processed and not raw data
FTP client
For downloading files from a database
E-utilities
-Set of 8 server side programs
-Interface in Entrez query and database system NCBI
> can be used by typing certain URL format
How to find public databases?
-Large database providers: NCBI, EBU, SIB
-Nucleic Acids Research
-GeneCards
-Wikipedia list of biological databases
Criteria for inclusion in NAR
-Thoroughly curated
-Of interest to wide variety biologists
-Comprehensiveness of coverage
-Degree of added value (because of manual curation)
-Maintained for long period of time
XML output
Processing by computer to make data readable
What is important to check with a database for reliability?
When was it last updated?
Parts of databases
-Raw Data
> human genome (Ensembl), protein expression (gene expression omnibus), protein sequences (UniProt), protein structures (PDB), compounds (ChEBI)
-High level databases
> pathways (Reactome), protein interactions (String), Mendelian inheritance (OMIM)
GeneCards is a …
Hub to other databases: easily retrieve information about specific genes and proteins
Minimal Information Standards: what is MIBBI?
Minimum Information for Biological and Biomedical Investigations
Six most critical elements contributing towards MIAME
> Data
-Raw data for each hybridization
-Final processed normalized data for set of hybridizations in study
Metadata
-The essential sample annotation including experimental factors and values (e.g. compound and dose)
-The experimental design incl sample data relationships
-Sufficient annotation of the array
-Essential laboratory and data processing protocols
Which well known database encourages submitters to supply MIAME compliant data?
Gene Expression Omnibus (GEO)
GEO submission:
-Producent (platform description)
-Raw data fill in
-How do the samples relate to each other
>GEO groups experiments
Why the FAIR data principles?
To enhance reuseability of data
> specific emphasis on enhancing ability of computers to automatically find and use data
FAIR data principles: Findable
It should be easy to find by human or computer and based on metadata description
> (meta)data should be assigned to global unique identifiers
> data is described by rich metadata
> Metadata clearly and explicitly include the identifier of the data it describes
> (meta) data are registered or indexed in a searchable resource
FAIR: accessable
-(meta)data are retrievable by identifier using standardized communication protocol
>protocol is open, free and universally implementable
>protocol allows for authentication and authorisation when required
-Metadata should be accessible even when data is no longer available: ask authors for data with email
Human data should be accessible with the use of:
contracts for privacy, makes research harder
FAIR: interoperable
can be combined with other datasets and computersystems
- data interoperation is a non-trivial problem and the “I” will require the most creative effort
-(meta)data use formal, accessible, shared and broadly applicable language for knowledge representation
FAIR: reuseable
For use in future research and further processing
- good annotation of which processing steps have been done with the data to come to conclusions
Challenge of nomenclature
How do you assign and maintain correct names of biological objects across databases
> gene may have several alternative names/symbols
> gene names are not always consistently used for different organisms
Example of curation, annotation and provenance: Biocuration
Manually checking and correcting
-interpretation and integration of information relevant for biology
-goals
>accurate and reliable representation of biological knowledge, easily accessible and base for computional analysis
UniProt consists of….
-Swissprot (manually reviewed)
-Trembl (automatically reviewed)
Gene Ontology
Common language for annotation of genes
GO objectives
-Represent categories used to classify specific parts of our biological knowledge: biological process (network/pathway), molecular function (activity, function), cellular component (location, in which complex)
-Common knowledge applicable to any organism
-GO terms for gene annotation of any species > comparison across species
Examples of the three Gene Ontologies
-Molecular Funciton: carbohydrate binding, ATPase activity
-Biological Process: mitosis. purine metabolism
-Cellular component: nucleus, telomere, RNA pol II holoenzyme
GO-terms, GO-id, definitions
Term: short, like DNA binding transcription factor activity
GO-id: id and molecular function
Definition: description as text
Ontology
-Vocabulary of terms
-Definitions
-Defined logical relationships to each other
Ontology (network) structure
Nodes: terms in ontology
Edges: relationships between concepts
Kinds of relationships in GO structure: hierarchical directed acyclic graph (arrows from specific to less specific)
-is a
-part of
Ontology is the representation of something
We know about
TCGA
Cancer genome atlas
-public and omics data
-sign contracts for use of data
OMIM
Knowledgebase of human genes and phenotypes
> Mendelian Inheritance
> consequences of types of mutations
Database consistency: differences between databases
-Overlapping / complementary content
-Databases reflect expertise and interest of the groups that maintain them
-developed or different application
Inconsistencies between databases
-in data
-in metadata
-in links between databases
Error propagation
Database errors can propagate to other databases or scientific literature
> propagation when using erroneous information for annotation of other biological entity (in another database)
Privacy (GDPR)
Not much public clinical data available
> regulation for data protection (contracts etcetera)