HC 9 - Information Management: Public Biological Databases Flashcards
hoorcollege 9
Why is it important to visualize and store knowledge and data structural?
-Reuse
-New hyptheses and experiments
Types of databases for omics and clinical
In-house and public
Why are data good value for money, but difficult to establish and fund
Databases ensure expensive data is not lost (expensive to overdo experiments) > but hard to fund, because no new information/insights
-Database maintenance costs money
Information management
-Collection and storage
-Management > organisation, annotation, curation, integration
-Standardization > minimum information standards, FAIR principle
-Distribution and sharing > you cannot just reach human data (privacy)
-GDPR
What is a database?
-Computational archive to store and organize data > to easily query/retrieve data
-Consists of hardware and software
-To organize data in structured records
-Allows discovery of new information (data mining, machine learning, artificial intelligence)
FAIR principle
-Findable
-Accessible
-Interoperable
-Reuseable
Query
Ask to a database
For structure, what is important for a database?
Not different kinds of information in a column
Why are commas important in databases?
They are separators for columns
Each field consists a maximum of … categories of data
1
Relational database
Linked tables
> searching for certain labels which appear in the different tables
> linkage due to similar labels
> more complex databases like Gene Ontology
types of databases
Primary
-Consist of raw data
-Sequences, structures
Secondary (composite)
-Data from analysis or treatment of the primary data
-Protein families, metabolic pathways
Literature database
-not biological
-pubmed/ncbi
Online book, like GeneReviews
-expert authored
-peer-reviewed disease descriptions
Non-redundant database
All sequences which are uploaded for a gene reduced to one consensus sequence
> like RefSeq: one sequence per gene
> search on subset
> duplicate entries are removed
Redundant
Multiple sequences for the same gene
The database storage doubles every …. because:
Every 18 months > queries need to be repeated.
- Therefore, check database regularly
Most searched organisms in GenBank are ..
Mammalians, model organisms, fish, plants, microbes
Types of queries in GenBank and results
-Search sequence by name
-Search sequence by similarity: enter sequence
Results
-Accession code
-Version
-Unique identifier
-Comment (free text)
-Sequence features
-Links to other database
The primary accession code is …
a unique unchanging identifier assigned to each GenBank sequence record: used when citing information from GenBank
> can be used for other databases
Publication rules for scientific journals
-Describe where the data is found in the database
> Data deposit (the sequence)
Because
-Analysis can be validated by other researchers with possible new methods
-But sometimes the description is incomplete or it is already processed and not raw data
FTP client
For downloading files from a database
E-utilities
-Set of 8 server side programs
-Interface in Entrez query and database system NCBI
> can be used by typing certain URL format