Databases (Week 2) Flashcards
- Define “Bioinformatics”
Oxford: Management info system for molecular biology + practical applications
NCBI: Field of science in which biology, computer science + information technology
merge into 1 discipline. 3 subtypes
✓ Development of new algorithms + statistics to assess relationships among
data set
✓ Analysis + interpretation of data types including nucleotide + amino acid
sequences
✓ Development + implementation of tools that enable efficient access +
management of different information
- Difference between bioinformaticist and bioinformatician
Bioinformaticist: An expert that knows how to use bioinformatics tools + write
interfaces for effective use of tools. Designs + implements tools + makes use
complex algorithms
Bioinformatician: A trained individual who knows how to use bioinformatics tools
without a deeper understanding. Most biological scientists have a basic
understanding of underlying algorithms.
- Difference between information storage with regards to single sequences,
features and annotations, as well as collections. Know examples thereof
- Define “database”
Database: A comprehensive collection of related data organized for convenient access, generally stored in computer.
5.1 Raw Data (8)
- Sequences = uploaded + maintained by those who submitted it
- RAW
- Redundant (unnecessary)
- Level of info = sparse
- Can represent incomplete records
- Quality of data = unknown
- Longevity can be indefinite
- No automatic linking withother database
5.2 Curated Data (6)
- 3rd party that maintains it, who did not necessarily generate data
- Originates from a primary database
- Non – redundant (not unnecessary)
- Created to extract extra information
- Represents complete records
- Linking with other databases
5.3 Specialist Data (4)
- Contains mix of Primary + Derived data of only select group or single species Maintained + updated by unofficial collaborators
- Contains bulk Whole Genome Sequencing data which can be considered as primary
- Contains reference sequences derived from multiple sequences
- Restricted public access
- Difference between level of scope and level of curation of databases
Level of scope: Single, collection + features and annotations [SCFA]
Level of curation: Raw, curated and specialist [RCS]
- Examples of raw, curated and specialist databases
8.1 Direct query (& PROBLEM)
- Know exactly what is wanted
- Each record in database has unique value = Accession number
- Return results ONLY for that query. Unless database contains links to similar data
Problem: few databases share same accession
numbers + use same format
8.2 Indirect query (4)
- Generally know what is wanted
- Can use different text queries such as gene names, organisms, products of genes
- Multiple entries for 1 sequence you will receive multiple results
- Meta data such as authors, date of publication can also be used
8.3 Way to search with multiple queries:
• NCBI’s Entrez system: link out to other databases that might contain different data of your match • Return matches from multiple databases • Boolean Operators • AND, OR + NOT • CAPITALS • Multiple times in 1 query • ….. AND …. OR …. NOT …. AND …. OR • The NCBI’s Entrez system: certain info of records individually indexed allowing you to search specifically for them • Text qualifiers or Indexed terms 4 • Sequence length [SLEN] • Organism [ORGN] • Features [FKEY] • Properties [PROP] • gbdiv – GenBank division
Data type refers to
AND
Single means
data type - different formats
single - that it is one data entry with more than one data type.
9.1 FASTA Text format
FASTA (*. fa or *.fas or *.fasta)
• Simple text-based file format
• Edit with Notepad or any other basic text editor
• Used to download sequence info from records
- NO OTHER DATA PRESENT
- NO SEQUENCE FEATURES OR RECORD FEATURES
- Only 2 lines per record
- 1st line preceded with ‘>’ to denote name of record
- 2nd line sequence itself
- ONLY IUPAC nucleotide or amino acid letters
9.2 GenBank Text format
flat file (*.gb)
- Complex file format that preserves ALL sequence information
- Sequence features + Meta data
- Not readably editable
- Cannot open with text editor.
- Allows interactive views of sequences when using programs that can accommodate them
- Multiple lines per record # of lines dependent on how much data is available