BIOINFORMATICS Flashcards
Concerned with knowledge and the flow of knowledge in biological systems using computational methods in genetics and genomics
BIOINFORMATICS
study of genes
Genomics
study of proteins
Proteomics
A collection of related information which are:
○ Structured
○ Searchable → index
○ Updated periodically
○ Cross-referenced → hyperlinks
DATABASES
○ These are programs that keep the database
working behind the scenes
○ Computerized data-keeping system
Tier 1: Database management system
○ Facilitates communications between applications or databases
○ Extracts information from either local or remote databases
Tier 2: Middleware layer
○ Enables users to access the database from anywhere without the need for downloading or installing any code
○ The one that we see – the graphic user interface.
Tier 3: Web interface
CLASSIFICATION OF DATABASES
1. Scope of data coverage
give me the 2
● Comprehensive
● Specialized
CLASSIFICATION OF DATABASES
2. Methods of biocuration
give me the 2
● Expert-curated (RefSeq)
● Community-curated (GenWiki)
CLASSIFICATION OF DATABASES
3. Level of biocuration
give me the 3
● Primary
● Secondary
● Composite
CLASSIFICATION OF DATABASES
4. Type of data managed
give me the 3
● DNA/RNA/Protein
● Disease
● Nomenclature/Literature
● Information on sequence or structure alone
● Experimentally derived data submitted directly
● Archival in nature
PRIMARY DATABASE
● A variety of primary databases, that allow for an ‘all-in-one’ search with multiple resources
COMPOSITE DATABASE
● Derived from primary databases
● Based on analysis of the data from the primary
database
SECONDARY DATABASE
“Google” of bioinformatics
COMPOSITE DATABASE
● Primarily used is PubMed
● Contains entries for >11 million abstracts of scientific publications
LITERATURE DATABASE
● GenBank, EMBL-bank, and DDBJ exchange data to ensure comprehensive worldwide coverage;
accession numbers are managed consistently between the three centers
NUCLEIC ACID DATABASE
● Contains publicly available DNA sequences from >100,000 organisms
● Also contains derived protein sequences, and annotations describing biological, structural, and other relevant features
GENBANK
● Contains nucleotide sequences from all public sources.
● Accessible through Sequence Retrieval System (SRS), which allows keyword searching.
● Sequence similarity search tools: BLAST, Blitz, Fasta
EMBL
● Contains curated data on everything that has to do
with proteins, motifs, and interactions with other
substances.
PROTEIN DATABASE
● >18,000 macromolecular structures on proteins,
peptides, viruses, protein/NA complexes, nucleic acids, and carbohydrates.
● Determined by X-ray diffraction and NMR.
PROTEIN DATA BANK
○ Curated database focusing on high level of annotation (sequence, function, structure, post-translational modifications, variants) of proteins.
○ Non-redundant and reviewed.
● SWISS-PROT
○ Computer-annotated supplement to SWISS-PROT.
○ Redundant and unreviewed.
TrEMBL
● Secondary database on protein families, domains and functional sites that contain manually curated
information.
● Provides tools for analysis of protein sequences and motifs.
PROSITE
● Protein family fingerprints (groups/motifs).
● Detects distant relatives of large and highly divergen protein superfamilies by looking at conserved regions in alignments.
PRINTS
● Protein families and domains represented as multiple
sequence alignments.
PFAM
PFAM
___ : Automatically Generated, LQ Entries
Pfam-B
PFAM
___ : Manually Curated, HQ Entries
Pfam-A
● Collection of ungapped multiple alignments of segments of related protein sequences (blocks)
● For: protein family classification, protein structure prediction
BLOCKS
● Contain data regarding structures of nucleic acids and proteins.
STRUCTURAL DATABASES
Easy to use website to align FASTA files.
MULT-ALN
Translates DNA sequences or RNA
sequences into their protein sequences.
EXPASY
Provides a prediction of the protein structure.
I-TASSER