HC 9 - Information Management: Public Biological Databases Flashcards

hoorcollege 9

1
Q

Why is it important to visualize and store knowledge and data structural?

A

-Reuse
-New hyptheses and experiments

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Types of databases for omics and clinical

A

In-house and public

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Why are data good value for money, but difficult to establish and fund

A

Databases ensure expensive data is not lost (expensive to overdo experiments) > but hard to fund, because no new information/insights
-Database maintenance costs money

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Information management

A

-Collection and storage
-Management > organisation, annotation, curation, integration
-Standardization > minimum information standards, FAIR principle
-Distribution and sharing > you cannot just reach human data (privacy)
-GDPR

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a database?

A

-Computational archive to store and organize data > to easily query/retrieve data
-Consists of hardware and software
-To organize data in structured records
-Allows discovery of new information (data mining, machine learning, artificial intelligence)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

FAIR principle

A

-Findable
-Accessible
-Interoperable
-Reuseable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Query

A

Ask to a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

For structure, what is important for a database?

A

Not different kinds of information in a column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why are commas important in databases?

A

They are separators for columns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Each field consists a maximum of … categories of data

A

1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Relational database

A

Linked tables
> searching for certain labels which appear in the different tables
> linkage due to similar labels
> more complex databases like Gene Ontology

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

types of databases

A

Primary
-Consist of raw data
-Sequences, structures
Secondary (composite)
-Data from analysis or treatment of the primary data
-Protein families, metabolic pathways
Literature database
-not biological
-pubmed/ncbi
Online book, like GeneReviews
-expert authored
-peer-reviewed disease descriptions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Non-redundant database

A

All sequences which are uploaded for a gene reduced to one consensus sequence
> like RefSeq: one sequence per gene
> search on subset
> duplicate entries are removed

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Redundant

A

Multiple sequences for the same gene

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The database storage doubles every …. because:

A

Every 18 months > queries need to be repeated.
- Therefore, check database regularly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Most searched organisms in GenBank are ..

A

Mammalians, model organisms, fish, plants, microbes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Types of queries in GenBank and results

A

-Search sequence by name
-Search sequence by similarity: enter sequence
Results
-Accession code
-Version
-Unique identifier
-Comment (free text)
-Sequence features
-Links to other database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The primary accession code is …

A

a unique unchanging identifier assigned to each GenBank sequence record: used when citing information from GenBank
> can be used for other databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Publication rules for scientific journals

A

-Describe where the data is found in the database
> Data deposit (the sequence)
Because
-Analysis can be validated by other researchers with possible new methods
-But sometimes the description is incomplete or it is already processed and not raw data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

FTP client

A

For downloading files from a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

E-utilities

A

-Set of 8 server side programs
-Interface in Entrez query and database system NCBI
> can be used by typing certain URL format

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How to find public databases?

A

-Large database providers: NCBI, EBU, SIB
-Nucleic Acids Research
-GeneCards
-Wikipedia list of biological databases

23
Q

Criteria for inclusion in NAR

A

-Thoroughly curated
-Of interest to wide variety biologists
-Comprehensiveness of coverage
-Degree of added value (because of manual curation)
-Maintained for long period of time

24
Q

XML output

A

Processing by computer to make data readable

25
Q

What is important to check with a database for reliability?

A

When was it last updated?

26
Q

Parts of databases

A

-Raw Data
> human genome (Ensembl), protein expression (gene expression omnibus), protein sequences (UniProt), protein structures (PDB), compounds (ChEBI)
-High level databases
> pathways (Reactome), protein interactions (String), Mendelian inheritance (OMIM)

27
Q

GeneCards is a …

A

Hub to other databases: easily retrieve information about specific genes and proteins

28
Q

Minimal Information Standards: what is MIBBI?

A

Minimum Information for Biological and Biomedical Investigations

29
Q

Six most critical elements contributing towards MIAME

A

> Data
-Raw data for each hybridization
-Final processed normalized data for set of hybridizations in study
Metadata
-The essential sample annotation including experimental factors and values (e.g. compound and dose)
-The experimental design incl sample data relationships
-Sufficient annotation of the array
-Essential laboratory and data processing protocols

30
Q

Which well known database encourages submitters to supply MIAME compliant data?

A

Gene Expression Omnibus (GEO)

31
Q

GEO submission:

A

-Producent (platform description)
-Raw data fill in
-How do the samples relate to each other
>GEO groups experiments

32
Q

Why the FAIR data principles?

A

To enhance reuseability of data
> specific emphasis on enhancing ability of computers to automatically find and use data

33
Q

FAIR data principles: Findable

A

It should be easy to find by human or computer and based on metadata description
> (meta)data should be assigned to global unique identifiers
> data is described by rich metadata
> Metadata clearly and explicitly include the identifier of the data it describes
> (meta) data are registered or indexed in a searchable resource

34
Q

FAIR: accessable

A

-(meta)data are retrievable by identifier using standardized communication protocol
>protocol is open, free and universally implementable
>protocol allows for authentication and authorisation when required
-Metadata should be accessible even when data is no longer available: ask authors for data with email

35
Q

Human data should be accessible with the use of:

A

contracts for privacy, makes research harder

36
Q

FAIR: interoperable

A

can be combined with other datasets and computersystems
- data interoperation is a non-trivial problem and the “I” will require the most creative effort
-(meta)data use formal, accessible, shared and broadly applicable language for knowledge representation

37
Q

FAIR: reuseable

A

For use in future research and further processing
- good annotation of which processing steps have been done with the data to come to conclusions

38
Q

Challenge of nomenclature

A

How do you assign and maintain correct names of biological objects across databases
> gene may have several alternative names/symbols
> gene names are not always consistently used for different organisms

39
Q

Example of curation, annotation and provenance: Biocuration

A

Manually checking and correcting
-interpretation and integration of information relevant for biology
-goals
>accurate and reliable representation of biological knowledge, easily accessible and base for computional analysis

40
Q

UniProt consists of….

A

-Swissprot (manually reviewed)
-Trembl (automatically reviewed)

41
Q

Gene Ontology

A

Common language for annotation of genes

42
Q

GO objectives

A

-Represent categories used to classify specific parts of our biological knowledge: biological process (network/pathway), molecular function (activity, function), cellular component (location, in which complex)
-Common knowledge applicable to any organism
-GO terms for gene annotation of any species > comparison across species

43
Q

Examples of the three Gene Ontologies

A

-Molecular Funciton: carbohydrate binding, ATPase activity
-Biological Process: mitosis. purine metabolism
-Cellular component: nucleus, telomere, RNA pol II holoenzyme

44
Q

GO-terms, GO-id, definitions

A

Term: short, like DNA binding transcription factor activity
GO-id: id and molecular function
Definition: description as text

45
Q

Ontology

A

-Vocabulary of terms
-Definitions
-Defined logical relationships to each other

46
Q

Ontology (network) structure

A

Nodes: terms in ontology
Edges: relationships between concepts

47
Q

Kinds of relationships in GO structure: hierarchical directed acyclic graph (arrows from specific to less specific)

A

-is a
-part of

48
Q

Ontology is the representation of something

A

We know about

49
Q

TCGA

A

Cancer genome atlas
-public and omics data
-sign contracts for use of data

50
Q

OMIM

A

Knowledgebase of human genes and phenotypes
> Mendelian Inheritance
> consequences of types of mutations

51
Q

Database consistency: differences between databases

A

-Overlapping / complementary content
-Databases reflect expertise and interest of the groups that maintain them
-developed or different application

52
Q

Inconsistencies between databases

A

-in data
-in metadata
-in links between databases

53
Q

Error propagation

A

Database errors can propagate to other databases or scientific literature
> propagation when using erroneous information for annotation of other biological entity (in another database)

54
Q

Privacy (GDPR)

A

Not much public clinical data available
> regulation for data protection (contracts etcetera)