L2 Flashcards
What is the major challenge of the genomics era?
To store and handle terabytes (TB) of sequence data through the establishment and use of computer databases
What is a database
A computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria
What are databases made of?
Computer hardware and software for data management
What should each record(entry) in a database contain?
A number of fields that hold the actual data items.
What is the process of making a query
Process by which a user expects the computer to retrieve a whole data record by specifying a particular piece of info to be found in a particular field.
What is knowledge discovery
A function of biological databases which refers to the identification of connections between pieces of information that were not known when the information was first entered
Types of databases
- flat file format
- relational database management
- object-oriented database management systems
What is a flat file format
A long text file that contains many entries separated by a delimeter (|)
What are database management systems
Sophisticated computer software programs for organizing, searching and accessing data
What are relational databases
They make us of a set of tables to organize data. They are created using a programming language known as structured query language (SQL)
Each table in a relational database is also called
Relation which is made up of columns and rows. Columns represent individual fields. Rows represent values in the fields
How is a query executed in a relational database
The system selects linked data items from different tables and combines the information into one report
What are primary databases?
They are archives of raw proteins or DNA sequence data submitted by the scientific community
Examples of primary databases
GenBank, Protein Data Bank (PDB)
What databases does the International Nucleotide Sequence Database Collaboration made of?
-GenBank
-European Molecular Biology Laboratory (EMBL)
-DNA Data Bank of Japan (DDBJ)
What is GenBank
The most complete collection of annotated nucleic acid sequence data for almost every known organism
GenBank consists of
- DNA
- mRNA
- cDNA
- ESTs
The Genpept database is for?
Protein sequences, majority of which are conceptual translations from DNA sequences
What are the two ways to search for sequences in GenBank
- using text-based keywords
- using molecular sequences to search by sequence similarity using BLAST
Functional divisions in GenBank
- EST
- GSS
- WGS
- ENV
EST (expressed sequence tags)
Contains short single cDNA reads. Represent what is expressed in a given tissue at a particular development stage
GSS (genome survey sequences)
Contains genomic sequences derived from random single-pass reads
WGS (whole genome shotgun sequence)
Use a whole genome shotgun approach to gain large coverage with the caveat of large amounts of unassembled sequence
ENV (environmental samples)
Contains sequences normally derived from a metagenomic sample
Other functional divisions in GenBank
- PRI (primate sequences)
- ROD (rodent sequences)
- MAM (other mammalian sequences)
- PLN (plant, fungal and algal sequences)
- VRL (viral sequences)
Similarity and difference in data contained in GenBank, EMBL, DDBJ
Similarity: data entered is identical, info regarding species, sequence length are entered via structure fields
What is accession
Usually a single or two letters followed by five or six digits respectively (U12344)
what is GI
If a sequence changes in any way, a new GI number will be assigned
What is the FASTA file format
It is a sequence format because it contains plain sequence information
What are secondary databases
They contain computationally processes sequence information derived from primary databases
Example of a secondary database
SWISS-PROT. It provides detailed sequence annotation that includes structure, function and protein family assignment
specialized databases
Serve a specific research community or focus on a particular organism
Specialized databases include
- Flybase
- WormBase
- TAIR
- EcoCyc
- SGD
What is Entrez
A biological database retrieval system which comes from cross-referencing between NCBI databases
Disadvantages of biological databases
- overreliance on sequence information and related annotations without understanding the reliability of the information
- there can be many errors in sequence databases
- there are high levels of redundancy in primary sequence databases
- annotations of genes can also be false or incomplete