Week 8 (Lecture 14) - Databases Flashcards

1
Q

Database

A

a structured collection of data held in computer storage
• especially one that incorporates software to make it accessible in a variety of ways
• any large collection of information

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Database management

A

the organization and manipulation of data in a database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Database management systems (DBMS)

A

a software package that provides all the functions required for database management

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Database system

A

a database together with a database management system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is a database?

a collection of data…

A
structured
• searchable (index) 
--> table of contents
• updated periodically (release)
--> new edition
• cross-referenced (hyperlinks)
--> links with other databases
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

A database includes

A
tools (software) necessary for
• access
• updating
• information insertion
• information deletion
etc
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Database storage management

A
  • flat files

* relational databases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Flat file

A
  • various means to encode a database model (most commonly a table) as a single file
  • can be a plain text file or a binary file
  • usually no structural relationships between the records
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Relational database

A
  • a database that has a collection of tables of data items, all of which is formally described and organized according to the relational model
  • data in a single table represents a relation
  • tables may have additionally defined relationships with each other
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why biological databases?

A
• exponential growth in biological data
• data are no longer published in a conventional manner, but directly submitted to databases
-- genomic sequences
-- 3D structures
-- 2D gel analysis
-- MS analysis
-- microarrays

• essential tools for biological research
– the only way to publish massive amounts of data without using all the paper in the world

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The first database that emerged concentrated on

A

collecting and annotating nucleotide and protein sequences generated by the early sequencing techniques

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Number of different biological databases

A

more than 1000

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Size of databasess - variable

A

< 100 Kb to >20Gb
• DNA: >20 Gb
• protein: 1 Gb
• 3D structure: 5 Gb

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Update frequency

A

daily to annually to seldom to forget about it

• usually accessible through the web

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Some databases in the field of molecular biology

A
  • AATDB
  • AceDB
  • ACUTS
  • ADB
  • AFDB
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Categories of databases for life sciences

A
  • sequence (DNA, protein)
  • genomics
  • mutation/polymorphism
  • protein domain/family
  • proteomics (2D gel, mass spectrometry)
  • 3D structure
  • metabolic networks
  • regulatory networks
  • bibliography
  • expression (microarrays…)
  • specialized
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

NCBI

A

GenBank is maintained at the National Center for Biotechnology Information
• Maryland, USA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

EMBL

A

European Molecular Biology Laboratory
• at the European Bioinformatics Institute
• Cambridge, UK

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

DDBJ

A

DNA Databank of Japan
• at National Institute of Genetics
• Mishima, Japan

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Objectives of these databases

EMBL, GenBank, DDBJ

A
  • to ensure that DNA sequence information is stored in a way that is PUBLICLY and FREELY accessible
  • and can be retrieved and used by other researchers in the future
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Literature databases

A
  • Bookshelf
  • PubMed
  • PubMed Central
  • OMIM
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Bookshelf

A

a collection of searchable biomedical books linked to PubMed

• literature database

23
Q

PubMed

A

allows searching by author names, journal titles, and a new Preview/Index option
• provides access to over 12 million MEDLINE citations dating back tot he mid 1960s
• includes History and Clipboard options which may enhance your search session
• literature database

24
Q

OMIM

A

Online Mendelian Inheritance in Man
• database of human genes and genetic disorders (also OMIA)
• literature database

25
Q

PubMed - Medline

A
covers the fields of 
• medicine
• nursing
• dentistry
• veterinary medicine
• public health
• preclinical sciences
  • citations from ~5.200 worldwide journals in 37 languages (60 languages for older journals)
  • contains over 20 million citations since 1948
  • contains links to biological db and some journals
  • new records added to PreMEDLINE daily
26
Q

PubMed - literature searching

A
  • can find papers on a given subject
  • can find papers on a specific gene
  • can find papers related to a given paper
  • can switch between literature and sequence databases
  • has links to publishers’ websites to view full text of articles
  • PubMed has free full text copies
27
Q

NCBI search

A

enter search in the query box and hit “Go”

28
Q

The syntax (for Entrez)

A
  1. Boolean operators - in uppercase
  2. Boolean operators read left-to-right, parentheses
  3. quotation marks
  4. asterisk
29
Q

Boolean operators

A
  • AND
  • OR
  • NOT

• AND is the default

30
Q

Entrez processes all Boolean operators in a

A

left-to-right sequence
• the order in which Entrez processes a search statement can be changed by enclosing individual concepts in parentheses ()

31
Q

Quotation marks

A

the term inside the quotation marks is read as one phrase

32
Q

Asterisk

A

extends the search to all terms that start with the letter BEFORE the asterisk
• eg dia* –> diaphragm, dial, diameter

33
Q

History feature

A

allows you to combine any of your past queries

34
Q

Limits feature

A

allows you to limit a query to specific organisms, sequences submitted during a specific period of time, etc

35
Q

Similarity between documents is measured by…

A

the words they have in common
• which words are considered?
• what is the weight of each word?
• how do we calculate a similarity score of 2 articles?

36
Q

Relationships between sequences are computed with

A

BLAST

37
Q

Relationships between articles are computed with

A

MESH terms

• shared keywords

38
Q

Relationships between DNA and protein sequences rely on

A

accession numbers

39
Q

Relationships between sequences and MEDLINE articles rely on both

A
  • shared keywords

* the mention of accession numbers in the articles

40
Q

Computation of related articles - words considered

A
• remove stopwords - uninformative
• stem words
• words from the abstract are "text words"
• words from the title are put in twice
• words from the MeSH terms
-- US National Library of Medicine
-- vocabulary used for indexing articles
-- consistent way to retrieve information
41
Q

Global weight - greater if the word is

A

less frequent in the whole database

42
Q

Local weight - greater if the word is

A

more frequent in the document

• longer document isn’t favored

43
Q

Computation of related articles

A

weight of each word

44
Q

Similarity score of 2 articles

A

sum of weights of all common words

45
Q

Weight of 1 pair of common words

A
local weight 1
*
local weight 2
* 
global weight
46
Q

The higher the similarity score

A

the closer the 2 articles

47
Q

Similarity scores are

A

pre-computed

48
Q

Database entries have

A

well-defined file formats

• this is important so that data can be read by the computer and extracted automatically

49
Q

All database entries have a few things in common

A
  • accession number

* description

50
Q

Accession number

A

a unique identifier assigned when the entry is originally added to the database - and should not change

51
Q

Description

A
  • name of the gene
  • organism it came from
  • who sequenced it and when
  • some info about structure and function
  • whether there are papers describing research related to it
52
Q

Small errors in file formats are a prime reason why people have problems

A

getting bioinformatics software to work
• there are surprisingly many formats - so many that specialist software has been written just to convert sequences from one format to another

53
Q

Database search strategies

A
  • general search principles - not limited to sequence or to biology
  • start with broad keywords and narrow the search using more specific terms
  • try variants of spelling, numbers, etc.
  • search many databases