various means to encode a database model (most commonly a table) as a single file can be a plain text file or a binary file usually no structural relationships between the records

a database that has a collection of tables of data items, all of which is formally described and organized according to the relational model data in a single table represents a relation tables may have additionally defined relationships with each other

Bookshelf PubMed PubMed Central OMIM

Week 8 (Lecture 14) - Databases Flashcards by Kate pline

Database

a structured collection of data held in computer storage
• especially one that incorporates software to make it accessible in a variety of ways
• any large collection of information

How well did you know this?

Not at all

Perfectly

Database management

the organization and manipulation of data in a database

How well did you know this?

Not at all

Perfectly

Database management systems (DBMS)

a software package that provides all the functions required for database management

How well did you know this?

Not at all

Perfectly

Database system

a database together with a database management system

How well did you know this?

Not at all

Perfectly

What is a database?

a collection of data…

structured
• searchable (index) 
--> table of contents
• updated periodically (release)
--> new edition
• cross-referenced (hyperlinks)
--> links with other databases

How well did you know this?

Not at all

Perfectly

A database includes

tools (software) necessary for
• access
• updating
• information insertion
• information deletion
etc

How well did you know this?

Not at all

Perfectly

Database storage management

flat files

* relational databases

How well did you know this?

Not at all

Perfectly

Flat file

various means to encode a database model (most commonly a table) as a single file
can be a plain text file or a binary file
usually no structural relationships between the records

How well did you know this?

Not at all

Perfectly

Relational database

a database that has a collection of tables of data items, all of which is formally described and organized according to the relational model
data in a single table represents a relation
tables may have additionally defined relationships with each other

How well did you know this?

Not at all

Perfectly

Why biological databases?

• exponential growth in biological data
• data are no longer published in a conventional manner, but directly submitted to databases
-- genomic sequences
-- 3D structures
-- 2D gel analysis
-- MS analysis
-- microarrays

• essential tools for biological research
– the only way to publish massive amounts of data without using all the paper in the world

How well did you know this?

Not at all

Perfectly

The first database that emerged concentrated on

collecting and annotating nucleotide and protein sequences generated by the early sequencing techniques

How well did you know this?

Not at all

Perfectly

Number of different biological databases

more than 1000

How well did you know this?

Not at all

Perfectly

Size of databasess - variable

< 100 Kb to >20Gb
• DNA: >20 Gb
• protein: 1 Gb
• 3D structure: 5 Gb

How well did you know this?

Not at all

Perfectly

Update frequency

daily to annually to seldom to forget about it

• usually accessible through the web

How well did you know this?

Not at all

Perfectly

Some databases in the field of molecular biology

AATDB
AceDB
ACUTS
ADB
AFDB

How well did you know this?

Not at all

Perfectly

Categories of databases for life sciences

sequence (DNA, protein)
genomics
mutation/polymorphism
protein domain/family
proteomics (2D gel, mass spectrometry)
3D structure
metabolic networks
regulatory networks
bibliography
expression (microarrays…)
specialized

How well did you know this?

Not at all

Perfectly

NCBI

GenBank is maintained at the National Center for Biotechnology Information
• Maryland, USA

How well did you know this?

Not at all

Perfectly

EMBL

European Molecular Biology Laboratory
• at the European Bioinformatics Institute
• Cambridge, UK

How well did you know this?

Not at all

Perfectly

DDBJ

DNA Databank of Japan
• at National Institute of Genetics
• Mishima, Japan

How well did you know this?

Not at all

Perfectly

Objectives of these databases

EMBL, GenBank, DDBJ

to ensure that DNA sequence information is stored in a way that is PUBLICLY and FREELY accessible
and can be retrieved and used by other researchers in the future

How well did you know this?

Not at all

Perfectly

Literature databases

Bookshelf
PubMed
PubMed Central
OMIM

How well did you know this?

Not at all

Perfectly

Bookshelf

Study These Flashcards

a collection of searchable biomedical books linked to PubMed

• literature database

PubMed

Study These Flashcards

allows searching by author names, journal titles, and a new Preview/Index option
• provides access to over 12 million MEDLINE citations dating back tot he mid 1960s
• includes History and Clipboard options which may enhance your search session
• literature database

OMIM

Study These Flashcards

Online Mendelian Inheritance in Man
• database of human genes and genetic disorders (also OMIA)
• literature database

PubMed - Medline

``` covers the fields of • medicine • nursing • dentistry • veterinary medicine • public health • preclinical sciences ``` * citations from ~5.200 worldwide journals in 37 languages (60 languages for older journals) * contains over 20 million citations since 1948 * contains links to biological db and some journals * new records added to PreMEDLINE daily

PubMed - literature searching

* can find papers on a given subject * can find papers on a specific gene * can find papers related to a given paper * can switch between literature and sequence databases * has links to publishers' websites to view full text of articles * PubMed has free full text copies

NCBI search

enter search in the query box and hit "Go"

The syntax (for Entrez)

1. Boolean operators - in uppercase 2. Boolean operators read left-to-right, parentheses 3. quotation marks 4. asterisk

Boolean operators

* AND * OR * NOT • AND is the default

Entrez processes all Boolean operators in a

left-to-right sequence • the order in which Entrez processes a search statement can be changed by enclosing individual concepts in parentheses ()

Quotation marks

the term inside the quotation marks is read as one phrase

Asterisk

extends the search to all terms that start with the letter BEFORE the asterisk • eg dia* --> diaphragm, dial, diameter

History feature

allows you to combine any of your past queries

Limits feature

allows you to limit a query to specific organisms, sequences submitted during a specific period of time, etc

Similarity between documents is measured by...

the words they have in common • which words are considered? • what is the weight of each word? • how do we calculate a similarity score of 2 articles?

Relationships between sequences are computed with

BLAST

Relationships between articles are computed with

MESH terms | • shared keywords

Relationships between DNA and protein sequences rely on

accession numbers

Relationships between sequences and MEDLINE articles rely on both

* shared keywords | * the mention of accession numbers in the articles

Computation of related articles - words considered

``` • remove stopwords - uninformative • stem words • words from the abstract are "text words" • words from the title are put in twice • words from the MeSH terms -- US National Library of Medicine -- vocabulary used for indexing articles -- consistent way to retrieve information ```

Global weight - greater if the word is

less frequent in the whole database

Local weight - greater if the word is

more frequent in the document | • longer document isn't favored

Computation of related articles

weight of each word

Similarity score of 2 articles

sum of weights of all common words

Weight of 1 pair of common words

``` local weight 1 * local weight 2 * global weight ```

The higher the similarity score

the closer the 2 articles

Similarity scores are

pre-computed

Database entries have

well-defined file formats | • this is important so that data can be read by the computer and extracted automatically

All database entries have a few things in common

* accession number | * description

Accession number

a unique identifier assigned when the entry is originally added to the database - and should not change

Description

* name of the gene * organism it came from * who sequenced it and when * some info about structure and function * whether there are papers describing research related to it

Small errors in file formats are a prime reason why people have problems

getting bioinformatics software to work • there are surprisingly many formats - so many that specialist software has been written just to convert sequences from one format to another

Database search strategies

* general search principles - not limited to sequence or to biology * start with broad keywords and narrow the search using more specific terms * try variants of spelling, numbers, etc. * search many databases

Week 8 (Lecture 14) - Databases Flashcards

(53 cards)