Intro To Bioinformatics Flashcards
What is bioinformatics?
The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of informatic processes in biotic systems
- Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline (NCBI, 2009)
- In bioinformatics, computer databases are used to store, retrieve and assist in understanding biological information
Give examples of biological data in bioinformatics
- DNA( genome)= sequence, pathway
- RNA(transcriptome )= structure, interaction
- Protein(proteome)= evolution, mutations
What can be used as bioinformatics in analysis of DNA?
- simple sequence analysis
- Gene finding
- regulatory regions
- whole genome annotations
- comparative genomics (analysis between species and strains)
What can be used as bioinformatics in analysis of RNA?
- Splice variants
- Tisssue specific expression
- structure
- single gene analysis (various cloning techniques)
- Experimental data involving thousands of genes simultaneously
- DNA chips, micro-arrays and expression array analysis
What can be used as bioinformatics in analysis of Proteins?
- homology
- conserved domains/regions
- structure determination(molecular modeling): 2D, 3D & quartenary structure
- protein function
- Analysis often involve 2D gels & Mass spectrometers
What are the major Nucleotide Sequence databases?
- GenBank: National Center for Biotechnology information
- GenBank is the NIH genetic sequence database which is part of the International Nucleotide Collaboration, it is comprised of the DNA Data Bank of Japan (DDBJ) , the European Molecular Biology Laboratory (EMBL), and gen bank at NCBI
- EMBL: European molecular biology laboratory
- The European Molecular Biology Laboratiry(EMBL), Nucleotide Sequence database is the European equivalent to the U.S.’s Gen Bank
- DDBJ: DNA data bank of Japan
- DNA data bank of Japan(DDDBJ) which is based in Japan’s National Institute of genetics, is the third in the trio of major nucleotide sequence databases
What are the protein major sequence databases?
- Uniprot: United protein database
- PIR: Protein Information Resource Databases
- Swiss-Prot
- ExPASY
Describe the Uniprot: United protein database
Uniprot is a single database that combines the information of the major international databases, European Bioinformatics Institute (EBI), Cambridge, UK; Protein Information Resource(PIR)-Georgetown university medical center(GUMC) & National Biochemical Research Foundation (NBRF), Washington, D.C.; and Swiss Institute of Bioinformatics (SIB) -Geneva, Switzerland
Describe the PIR: Protein Information Resource Databases
PIR grew out of the Atlas of Protein Sequence and Structure (1965- 1978) which was edited by Margaret Dayhoff
Describe the Swiss-Prot
Swiss-Pot is the major European protein sequence database, from from the Swiss institute of bioinformatics
Describe the ExPASY: Expert Protein Analysis System
-is the new Swiss Institute of Bioinformatics(SIB) Resource Portal which provides access to scientific databases and software tools in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc
What is the Query Sequence ?
A sequence, either amino acid or nucleotide chosen by the user to use in a BLAST search
- A query sequence can be typed or pasted into the query window on the search form
- BLAST searches require a minimum query sequence length of 15 nucleotides or amino acids
- Query sequence can either be FASTA, Bare sequence or identifier (Accession number or gene info ID(gi)
What is an alignment ?
A presentation of two compared sequences showing the regions of greatest statistical similarity
What is the score value ?
The score value is a measure of the quality of the alignment between the query sequence and the search results
-the higher the score, the better the alignment
What is the E-value?
The E-value refers to the expectation value
- The number of different alignments with scores equivalent to or better than alignment scores that are expected to occur in a database search by chance
- The lower the E value, the better the match
What is genome annotation?
- Obtaining biological information from unprocessed sequence data
- The ultimate goal is to create a labeled genome, where biological information is linked to sequence
- There are two types structural and functional annotations
What are the structural genome annotation?
The identification of Genomic elements (genes and other important sequence)
- ORFs and their localization
- gene structure
- coding regions
- location of regulatory motifs
What are the functional genome annotation?
Consists of attaching biological function to genomic elements
- biochemical function
- biological function
- involved regulation and interactions
- expression
The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that
Functions of gene prediction software include:
- to identify genes within a long DNA sequence
- A DNA sequence that codes for amino acids shouldn’t contain any stop codons
- Such a coding region is called an open reading frame
- Because each DNA strand can be read in three reading frames and there are two DNA strands
- The computer must analyze a given DNA sequence in six different reading frames.
- An example of a tool that can be used is Open Reading Frame finder (ORF Finder) at NCBI and GENSCAN
What are the diagnostic features of a gene function included in the presence of…
- Open reading Frame(ORF)
- Start codon (Met start codon-ATG)
- Stop codon (TGA, TAG F TAA)
- Terminator sequence (prokaryotes)
- TATA Box (eukaryotes)
- Shine Delgano sequence(prokaryotes)
- Kozak sequence (Eukaryotes)
- Poly A addition signal (eukaryotes)
- Intron/exon boundaries.
- CpG islands.
What is the open reading frame (ORF)?
Effective for the analysis of bacteria genomic DNA sequences
- It is not so effective for analyzing the DNA for Eukaryotes(particularly higher eukaryotes)
- due to the intron/exon structure of eukaryotic gene
- A more sophisticated gene identification software is often required for eukaryotes.
ORF is sometimes refferred as aB initio
-Because they attempt to predict genes based only on the knowledge and understanding gene structure
What are the criteria for judging a good ORF?
An ORF should begin with a Start codon (Methionine residue) and end with an in-frame stop codon
- It must be of reasonable size (the longer the better). Long ORFs are unlikely to occur by chance and thus signify potential genes. Short AAs sequences are probably not ORFs
- It should end with an in-frame stop codon. Many stop codons close together suggest that an ORF is not present
NB: in the lab session, ORF will be used to search for Prokaryotic genes
Describe sequence alignment software
Computer analysis can be employed to determine if a newly sequenced gene is similar to that a,ready known and stored in data base
- the two most popular sequence alignment programs are BLAST(Basic local alignment search tool) & FASTA(fast all)
- the BLAST program searches a nucleic acid database to find matching or similar sequences to that being tested. The BLAST approach first look for similar segments (high-scoring segment pair-HSPs ) between the query sequences and a database sequence
Next, evaluate the statistical significance of any matches that were found
-finally report only those matches that satisfy a user-selectable threshold of significance
What is the emphasis of sequence alignment software?
The emphasis of this tool is to find regions of sequence similarity
These can yield clues about the structure and function of this novel sequence and about its evolutionary history and homology with other sequences in the databases
-Regions of similarity detected via this type of alignment tool can be either local, where the region of similarity is based in 1 location, or global, where regions of similarity can be detected across otherwise unrelated genetic code
What are BLASTP used to research?
BLASTP is used search a protein database using a protein query
What is BLASTX used to research?
BLASTX is used to search a protein database using a translated nucleotide query
It compares the 6-frame translations of DNA query to protein database
What is tblastn used to research?
Used to search translated nucleotide database using a protein query
What is tblastx used to research?
Used to search translated nucleotide database using a translated nucleotide query. It compares the 6 frame translations of DNA query to the 6-frame translations of a DNA database (each sequence I’d comparable to BLASTP searches)
What is FASTA used to research?
Compares a DNA query to DNA database, or a protein query to protein database
What is FASTX used to research?
Compares a translated DNA query to a protein database
What is TFASTA used to research?
Compares a protein query to a translated DNA database
What are the types of Search Format, Blast query input?
The BLAST ‘Search’ box accepts a number of different types of input and autonomically determines the formats. Accepted input types are:
- FASTA
- Bare Sequence
- Identifiers
Describe FASTA as a search format
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (>) symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is….
- Blank lines aren’t allowed in the middle of FASTA input
- Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes
- However, a single hyphen or dash can be used to represent a gap of indeterminate length
Describe Bare sequence as a search format
-This may be just lines of sequence data, without the FASTA definition line, e.g.
QIKDLLGDGGGGSSSSUHTVRFVRDD
- It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept Flatfile report
- Blank lines aren’t allowed in the middle of bare sequence input
Describe identifiers as a search format
- Normally these are simply accession, accession version or gi’s (e.g., p01013, AAA68881.1, 129295)
- However, a bar-separated NCBI sequence identifier (e.g., gi|129295) will also be accepted
- These NCBI sequence identifiers have a very specific syntax. The identifier may consist of only one token (I.e. word)
- Spaces between letters in the input will cause it to be treated as bare sequence (spaces before or after the identifier aren’t allowed)
- Examples of illegal input are: ACCENSION P01013 AAA68881.1 gi|129295
Explain sequence homology
Analysis of DNA sequences of homologous genes to provide clues to the evolutionary relationships between organisms
- Two species that are closely related to each other will have DNA (or amino acid) sequences that are more similar to each other than if they are more distantly related
- Such sequence analyses can be used to construct family trees of organisms
- Bacteria from different species can exchange DNA sequences (horizontal transfer), making it more difficult to establish relationships among bacteria on their DNA sequences
- Multiple Sequence Alignment Spftware - like CLUSTAL W & COBALT (Constraint-based Multiple Alignment Tool) are used by scientists to study the phylogenies relationships between species
Explain what is the phylogenies tree?
A phylogenies tree or evolutionary tree is a branching diagram or “tree”
- It shows the inferred evolutionary relationships among various biological species or other entities
- It is based upon similarities and differences in the species physical and/ or genetic characteristics
- The tax’s joined together in the tree “trunk”; and organisms that have arisen from it are placed at the ends (tip) of tree “branches”
- Closely related groups are located on branches close to one another
What does BLAST do?
The most widely used software in bioinformatics research
It’s main function is to compare a sequence of interest, the query sequence, the sequences in a large database
Explain the MASCOT Search ENGINE from Max Science
- A powerful database search engine
- Integrates all of the proven methods of database searching
- Peptide fingerprint, Sequence Query and MS/MS Ion Search
- Uses any nucleic or amino acid database in FASTA format:
- to identify, characterize and quantify proteins
What are the challenges in bioinformatics?
Biological Redundancy and multiplicity:
- Different sequences with similar structures
- Organisms with similar genes
- Mutliple functions of single genes
- Grouping of genes in pathways
Sequence redundancy in genomes
Significance of relationships and similarities
Signal vs Noise
Lack of data