Week 8 (Lecture 14) - Databases Flashcards

1
Q

Database

A

a structured collection of data held in computer storage
• especially one that incorporates software to make it accessible in a variety of ways
• any large collection of information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Database management

A

the organization and manipulation of data in a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Database management systems (DBMS)

A

a software package that provides all the functions required for database management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Database system

A

a database together with a database management system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a database?

a collection of data…

A
structured
• searchable (index) 
--> table of contents
• updated periodically (release)
--> new edition
• cross-referenced (hyperlinks)
--> links with other databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A database includes

A
tools (software) necessary for
• access
• updating
• information insertion
• information deletion
etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Database storage management

A
  • flat files

* relational databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Flat file

A
  • various means to encode a database model (most commonly a table) as a single file
  • can be a plain text file or a binary file
  • usually no structural relationships between the records
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Relational database

A
  • a database that has a collection of tables of data items, all of which is formally described and organized according to the relational model
  • data in a single table represents a relation
  • tables may have additionally defined relationships with each other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why biological databases?

A
• exponential growth in biological data
• data are no longer published in a conventional manner, but directly submitted to databases
-- genomic sequences
-- 3D structures
-- 2D gel analysis
-- MS analysis
-- microarrays

• essential tools for biological research
– the only way to publish massive amounts of data without using all the paper in the world

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The first database that emerged concentrated on

A

collecting and annotating nucleotide and protein sequences generated by the early sequencing techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Number of different biological databases

A

more than 1000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Size of databasess - variable

A

< 100 Kb to >20Gb
• DNA: >20 Gb
• protein: 1 Gb
• 3D structure: 5 Gb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Update frequency

A

daily to annually to seldom to forget about it

• usually accessible through the web

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Some databases in the field of molecular biology

A
  • AATDB
  • AceDB
  • ACUTS
  • ADB
  • AFDB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Categories of databases for life sciences

A
  • sequence (DNA, protein)
  • genomics
  • mutation/polymorphism
  • protein domain/family
  • proteomics (2D gel, mass spectrometry)
  • 3D structure
  • metabolic networks
  • regulatory networks
  • bibliography
  • expression (microarrays…)
  • specialized
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

NCBI

A

GenBank is maintained at the National Center for Biotechnology Information
• Maryland, USA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

EMBL

A

European Molecular Biology Laboratory
• at the European Bioinformatics Institute
• Cambridge, UK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

DDBJ

A

DNA Databank of Japan
• at National Institute of Genetics
• Mishima, Japan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Objectives of these databases

EMBL, GenBank, DDBJ

A
  • to ensure that DNA sequence information is stored in a way that is PUBLICLY and FREELY accessible
  • and can be retrieved and used by other researchers in the future
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Literature databases

A
  • Bookshelf
  • PubMed
  • PubMed Central
  • OMIM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Bookshelf

A

a collection of searchable biomedical books linked to PubMed

• literature database

23
Q

PubMed

A

allows searching by author names, journal titles, and a new Preview/Index option
• provides access to over 12 million MEDLINE citations dating back tot he mid 1960s
• includes History and Clipboard options which may enhance your search session
• literature database

24
Q

OMIM

A

Online Mendelian Inheritance in Man
• database of human genes and genetic disorders (also OMIA)
• literature database

25
PubMed - Medline
``` covers the fields of • medicine • nursing • dentistry • veterinary medicine • public health • preclinical sciences ``` * citations from ~5.200 worldwide journals in 37 languages (60 languages for older journals) * contains over 20 million citations since 1948 * contains links to biological db and some journals * new records added to PreMEDLINE daily
26
PubMed - literature searching
* can find papers on a given subject * can find papers on a specific gene * can find papers related to a given paper * can switch between literature and sequence databases * has links to publishers' websites to view full text of articles * PubMed has free full text copies
27
NCBI search
enter search in the query box and hit "Go"
28
The syntax (for Entrez)
1. Boolean operators - in uppercase 2. Boolean operators read left-to-right, parentheses 3. quotation marks 4. asterisk
29
Boolean operators
* AND * OR * NOT • AND is the default
30
Entrez processes all Boolean operators in a
left-to-right sequence • the order in which Entrez processes a search statement can be changed by enclosing individual concepts in parentheses ()
31
Quotation marks
the term inside the quotation marks is read as one phrase
32
Asterisk
extends the search to all terms that start with the letter BEFORE the asterisk • eg dia* --> diaphragm, dial, diameter
33
History feature
allows you to combine any of your past queries
34
Limits feature
allows you to limit a query to specific organisms, sequences submitted during a specific period of time, etc
35
Similarity between documents is measured by...
the words they have in common • which words are considered? • what is the weight of each word? • how do we calculate a similarity score of 2 articles?
36
Relationships between sequences are computed with
BLAST
37
Relationships between articles are computed with
MESH terms | • shared keywords
38
Relationships between DNA and protein sequences rely on
accession numbers
39
Relationships between sequences and MEDLINE articles rely on both
* shared keywords | * the mention of accession numbers in the articles
40
Computation of related articles - words considered
``` • remove stopwords - uninformative • stem words • words from the abstract are "text words" • words from the title are put in twice • words from the MeSH terms -- US National Library of Medicine -- vocabulary used for indexing articles -- consistent way to retrieve information ```
41
Global weight - greater if the word is
less frequent in the whole database
42
Local weight - greater if the word is
more frequent in the document | • longer document isn't favored
43
Computation of related articles
weight of each word
44
Similarity score of 2 articles
sum of weights of all common words
45
Weight of 1 pair of common words
``` local weight 1 * local weight 2 * global weight ```
46
The higher the similarity score
the closer the 2 articles
47
Similarity scores are
pre-computed
48
Database entries have
well-defined file formats | • this is important so that data can be read by the computer and extracted automatically
49
All database entries have a few things in common
* accession number | * description
50
Accession number
a unique identifier assigned when the entry is originally added to the database - and should not change
51
Description
* name of the gene * organism it came from * who sequenced it and when * some info about structure and function * whether there are papers describing research related to it
52
Small errors in file formats are a prime reason why people have problems
getting bioinformatics software to work • there are surprisingly many formats - so many that specialist software has been written just to convert sequences from one format to another
53
Database search strategies
* general search principles - not limited to sequence or to biology * start with broad keywords and narrow the search using more specific terms * try variants of spelling, numbers, etc. * search many databases