Access to Sequenced Data and Related Information Flashcards
Library of related information
- collection & and preservation, easy access, standardized data presentation, minimize redundancy, data independence, management, updating, and organizing data into knowledge
BIOLOGICAL DATABASES
3 Main Nucleotide Sequence Database
GenBank
European Nucleotide Archive
DNA Database of Japan
National Center for Biotechnology Information (NCBI) of the National Institutes of Health (NIH) in Bethesda
GenBank
European Molecular Biology Laboratory (EMBL)-Bank Nucleotide Sequence Database at the European Bioinformatics Institute (EBI) in Hinxton, England
European Nucleotide Archive
National Institute of Genetics in Mishima
DNA Database of Japan
Other Common Biological Database
PubMed
UCSC
Genome Browser
e!Ensembl
FlyBase
UniProt
WormBase
GENEONTOLOGY
RCSB PROTEIN DATA BANK
tair
Rice Genome Annotation Project
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Integration of Biological Databases
Challenges:
1. Database architecture = similar structure
2. How to access & what can be accessed data surfing
3. Naming system (S. cerevisiae RAD24 =rad17 in S. pombe)
4. Clash of concepts = definitions of terms (definition of GENE)
Integration of Biological Databases
Approaches:
Link Integration
View Integration
Data Warehousing
Integration of Biological Databases
Approach wherein:
▪ researchers begin their query with one data source and then follow hypertext links to related information in other data sources
▪ Vulnerable to naming clashes and ambiguities, updates, researcher-dependent
Link Integration
Integration of Biological Databases
Approach wherein:
▪ leaves the information in its source databases but builds an environment around the databases that makes them all seem to be part of one large system
▪ didn’t perform as well as the source database
View Integration
Integration of Biological Databases
Approach wherein:
▪ bringing all the data under one roof in a single database
▪ Issue on keeping the data warehouse up to date
Data Warehousing
What technique transforms the contents of multiple source databases to common data model. It then integrates the source data into a single large database.
Data warehouse technique
Types of Biological Data
Genomic Databases
RNA Databases
Protein Databases
Genomic Databases (3)
Sequenced Tag Sites (STS)
Genome Survey Sequences (GSSs)
High-Throughput Genomic Sequence (HTGS)
which genomic database?
= short (typically 500 base pairs long)
genomic landmark sequences
Sequenced Tag Sites (STS)
Which genomic database?
= consist of sequences that are genomic in origin
Genome Survey Sequences (GSSs)
Which genomic database?
= contains unfinished DNA sequences from sequencing centers
High-Throughput Genomic Sequence (HTGS)
Which RNA database?
= contain sequence data on “single-pass” cDNA sequences
Expressed Sequence Tags (ESTs)
Which RNA database?
= (unique gene) created for gene-oriented clusters by making nonredundant sets of ESTs
UniGene
Which protein database?
is the most comprehensive, centralized
protein sequence catalog
UniProt (aka Universal Protein Resource)