Databases (Week 2) Flashcards

Question 1

Q

Define “Bioinformatics”

Answer

A

Oxford: Management info system for molecular biology + practical applications

NCBI: Field of science in which biology, computer science + information technology
merge into 1 discipline. 3 subtypes
✓ Development of new algorithms + statistics to assess relationships among
data set
✓ Analysis + interpretation of data types including nucleotide + amino acid
sequences
✓ Development + implementation of tools that enable efficient access +
management of different information

Question 2

Q

Difference between bioinformaticist and bioinformatician

Answer

A

Bioinformaticist: An expert that knows how to use bioinformatics tools + write
interfaces for effective use of tools. Designs + implements tools + makes use
complex algorithms

Bioinformatician: A trained individual who knows how to use bioinformatics tools
without a deeper understanding. Most biological scientists have a basic
understanding of underlying algorithms.

Question 3

Q

Difference between information storage with regards to single sequences,
features and annotations, as well as collections. Know examples thereof

Question 4

Q

Define “database”

Answer

A

Database: A comprehensive collection of related data organized for convenient access, generally stored in computer.

Question 5

Q

5.1 Raw Data (8)

Answer

A

Sequences = uploaded + maintained by those who submitted it
RAW
Redundant (unnecessary)
Level of info = sparse
Can represent incomplete records
Quality of data = unknown
Longevity can be indefinite
No automatic linking withother database

Question 6

Q

5.2 Curated Data (6)

Answer

A

3rd party that maintains it, who did not necessarily generate data
Originates from a primary database
Non – redundant (not unnecessary)
Created to extract extra information
Represents complete records
Linking with other databases

Question 7

Q

5.3 Specialist Data (4)

Answer

A

Contains mix of Primary + Derived data of only select group or single species Maintained + updated by unofficial collaborators
Contains bulk Whole Genome Sequencing data which can be considered as primary
Contains reference sequences derived from multiple sequences
Restricted public access

Question 8

Q

Difference between level of scope and level of curation of databases

Answer

A

Level of scope: Single, collection + features and annotations [SCFA]

Level of curation: Raw, curated and specialist [RCS]

Question 9

Q

Examples of raw, curated and specialist databases

Question 10

Q

8.1 Direct query (& PROBLEM)

Answer

A

Know exactly what is wanted
Each record in database has unique value = Accession number
Return results ONLY for that query. Unless database contains links to similar data

Problem: few databases share same accession
numbers + use same format

Question 11

Q

8.2 Indirect query (4)

Answer

A

Generally know what is wanted
Can use different text queries such as gene names, organisms, products of genes
Multiple entries for 1 sequence you will receive multiple results
Meta data such as authors, date of publication can also be used

Question 12

Q

8.3 Way to search with multiple queries:

Answer

A

• NCBI’s Entrez system: link out to other databases that might contain
different data of your match
• Return matches from multiple databases
• Boolean Operators
• AND, OR + NOT
• CAPITALS
• Multiple times in 1 query
• ….. AND …. OR …. NOT …. AND …. OR
• The NCBI’s Entrez system: certain info of records individually indexed
allowing you to search specifically for them
• Text qualifiers or Indexed terms 4
• Sequence length [SLEN]
• Organism [ORGN]
• Features [FKEY]
• Properties [PROP]
• gbdiv – GenBank division

Question 13

Q

Data type refers to
AND
Single means

Answer

A

data type - different formats

single - that it is one data entry with more than one data type.

Question 14

Q

9.1 FASTA Text format

Answer

A

FASTA (*. fa or *.fas or *.fasta)
• Simple text-based file format
• Edit with Notepad or any other basic text editor

• Used to download sequence info from records

NO OTHER DATA PRESENT
NO SEQUENCE FEATURES OR RECORD FEATURES
Only 2 lines per record
1st line preceded with ‘>’ to denote name of record
2nd line sequence itself
ONLY IUPAC nucleotide or amino acid letters

Question 15

Q

9.2 GenBank Text format

Answer

A

flat file (*.gb)

Complex file format that preserves ALL sequence information
Sequence features + Meta data
Not readably editable
Cannot open with text editor.
Allows interactive views of sequences when using programs that can accommodate them
Multiple lines per record # of lines dependent on how much data is available

Databases (Week 2) Flashcards

(15 cards)