The E-value refers to the expectation value - The number of different alignments with scores equivalent to or better than alignment scores that are expected to occur in a database search by chance - The lower the E value, the better the match

Intro To Bioinformatics Flashcards by Jhaunte Braithwaite

What is bioinformatics?

The term bioinformatics was coined by Paulien Hogeweg in 1979 for the study of informatic processes in biotic systems

Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline (NCBI, 2009)
In bioinformatics, computer databases are used to store, retrieve and assist in understanding biological information

How well did you know this?

Not at all

Perfectly

Give examples of biological data in bioinformatics

DNA( genome)= sequence, pathway
RNA(transcriptome )= structure, interaction
Protein(proteome)= evolution, mutations

How well did you know this?

Not at all

Perfectly

What can be used as bioinformatics in analysis of DNA?

simple sequence analysis
Gene finding
regulatory regions
whole genome annotations
comparative genomics (analysis between species and strains)

How well did you know this?

Not at all

Perfectly

What can be used as bioinformatics in analysis of RNA?

Splice variants
Tisssue specific expression
structure
single gene analysis (various cloning techniques)
Experimental data involving thousands of genes simultaneously
DNA chips, micro-arrays and expression array analysis

How well did you know this?

Not at all

Perfectly

What can be used as bioinformatics in analysis of Proteins?

homology
conserved domains/regions
structure determination(molecular modeling): 2D, 3D & quartenary structure
protein function
Analysis often involve 2D gels & Mass spectrometers

How well did you know this?

Not at all

Perfectly

What are the major Nucleotide Sequence databases?

GenBank: National Center for Biotechnology information
- GenBank is the NIH genetic sequence database which is part of the International Nucleotide Collaboration, it is comprised of the DNA Data Bank of Japan (DDBJ) , the European Molecular Biology Laboratory (EMBL), and gen bank at NCBI
EMBL: European molecular biology laboratory
- The European Molecular Biology Laboratiry(EMBL), Nucleotide Sequence database is the European equivalent to the U.S.’s Gen Bank
DDBJ: DNA data bank of Japan
- DNA data bank of Japan(DDDBJ) which is based in Japan’s National Institute of genetics, is the third in the trio of major nucleotide sequence databases

How well did you know this?

Not at all

Perfectly

What are the protein major sequence databases?

Uniprot: United protein database
PIR: Protein Information Resource Databases
Swiss-Prot
ExPASY

How well did you know this?

Not at all

Perfectly

Describe the Uniprot: United protein database

Uniprot is a single database that combines the information of the major international databases, European Bioinformatics Institute (EBI), Cambridge, UK; Protein Information Resource(PIR)-Georgetown university medical center(GUMC) & National Biochemical Research Foundation (NBRF), Washington, D.C.; and Swiss Institute of Bioinformatics (SIB) -Geneva, Switzerland

How well did you know this?

Not at all

Perfectly

Describe the PIR: Protein Information Resource Databases

PIR grew out of the Atlas of Protein Sequence and Structure (1965- 1978) which was edited by Margaret Dayhoff

How well did you know this?

Not at all

Perfectly

Describe the Swiss-Prot

Swiss-Pot is the major European protein sequence database, from from the Swiss institute of bioinformatics

How well did you know this?

Not at all

Perfectly

Describe the ExPASY: Expert Protein Analysis System

-is the new Swiss Institute of Bioinformatics(SIB) Resource Portal which provides access to scientific databases and software tools in different areas of life sciences including proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc

How well did you know this?

Not at all

Perfectly

What is the Query Sequence ?

A sequence, either amino acid or nucleotide chosen by the user to use in a BLAST search

A query sequence can be typed or pasted into the query window on the search form
BLAST searches require a minimum query sequence length of 15 nucleotides or amino acids
Query sequence can either be FASTA, Bare sequence or identifier (Accession number or gene info ID(gi)

How well did you know this?

Not at all

Perfectly

What is an alignment ?

A presentation of two compared sequences showing the regions of greatest statistical similarity

How well did you know this?

Not at all

Perfectly

What is the score value ?

The score value is a measure of the quality of the alignment between the query sequence and the search results

-the higher the score, the better the alignment

How well did you know this?

Not at all

Perfectly

What is the E-value?

The E-value refers to the expectation value

The number of different alignments with scores equivalent to or better than alignment scores that are expected to occur in a database search by chance
The lower the E value, the better the match

How well did you know this?

Not at all

Perfectly

What is genome annotation?

Obtaining biological information from unprocessed sequence data
The ultimate goal is to create a labeled genome, where biological information is linked to sequence
There are two types structural and functional annotations

How well did you know this?

Not at all

Perfectly

What are the structural genome annotation?

Study These Flashcards

The identification of Genomic elements (genes and other important sequence)

ORFs and their localization
gene structure
coding regions
location of regulatory motifs

What are the functional genome annotation?

Study These Flashcards

Consists of attaching biological function to genomic elements

biochemical function
biological function
involved regulation and interactions
expression

The basic level of annotation is using BLAST for finding similarities, and then annotating genomes based on that

Functions of gene prediction software include:

Study These Flashcards

to identify genes within a long DNA sequence
A DNA sequence that codes for amino acids shouldn’t contain any stop codons
Such a coding region is called an open reading frame
Because each DNA strand can be read in three reading frames and there are two DNA strands
The computer must analyze a given DNA sequence in six different reading frames.
An example of a tool that can be used is Open Reading Frame finder (ORF Finder) at NCBI and GENSCAN

What are the diagnostic features of a gene function included in the presence of…

Study These Flashcards

Open reading Frame(ORF)
Start codon (Met start codon-ATG)
Stop codon (TGA, TAG F TAA)
Terminator sequence (prokaryotes)
TATA Box (eukaryotes)
Shine Delgano sequence(prokaryotes)
Kozak sequence (Eukaryotes)
Poly A addition signal (eukaryotes)
Intron/exon boundaries.
CpG islands.

What is the open reading frame (ORF)?

Study These Flashcards

Effective for the analysis of bacteria genomic DNA sequences

It is not so effective for analyzing the DNA for Eukaryotes(particularly higher eukaryotes)
- due to the intron/exon structure of eukaryotic gene
- A more sophisticated gene identification software is often required for eukaryotes.

ORF is sometimes refferred as aB initio
-Because they attempt to predict genes based only on the knowledge and understanding gene structure

What are the criteria for judging a good ORF?

Study These Flashcards

An ORF should begin with a Start codon (Methionine residue) and end with an in-frame stop codon

It must be of reasonable size (the longer the better). Long ORFs are unlikely to occur by chance and thus signify potential genes. Short AAs sequences are probably not ORFs
It should end with an in-frame stop codon. Many stop codons close together suggest that an ORF is not present

NB: in the lab session, ORF will be used to search for Prokaryotic genes

Describe sequence alignment software

Study These Flashcards

Computer analysis can be employed to determine if a newly sequenced gene is similar to that a,ready known and stored in data base

the two most popular sequence alignment programs are BLAST(Basic local alignment search tool) & FASTA(fast all)
the BLAST program searches a nucleic acid database to find matching or similar sequences to that being tested. The BLAST approach first look for similar segments (high-scoring segment pair-HSPs ) between the query sequences and a database sequence

Next, evaluate the statistical significance of any matches that were found

-finally report only those matches that satisfy a user-selectable threshold of significance

What is the emphasis of sequence alignment software?

Study These Flashcards

The emphasis of this tool is to find regions of sequence similarity

These can yield clues about the structure and function of this novel sequence and about its evolutionary history and homology with other sequences in the databases

-Regions of similarity detected via this type of alignment tool can be either local, where the region of similarity is based in 1 location, or global, where regions of similarity can be detected across otherwise unrelated genetic code

What are BLASTP used to research?

BLASTP is used search a protein database using a protein query

What is BLASTX used to research?

BLASTX is used to search a protein database using a translated nucleotide query It compares the 6-frame translations of DNA query to protein database

What is tblastn used to research?

Used to search translated nucleotide database using a protein query

What is tblastx used to research?

Used to search translated nucleotide database using a translated nucleotide query. It compares the 6 frame translations of DNA query to the 6-frame translations of a DNA database (each sequence I’d comparable to BLASTP searches)

What is FASTA used to research?

Compares a DNA query to DNA database, or a protein query to protein database

What is FASTX used to research?

Compares a translated DNA query to a protein database

What is TFASTA used to research?

Compares a protein query to a translated DNA database

What are the types of Search Format, Blast query input?

The BLAST ‘Search’ box accepts a number of different types of input and autonomically determines the formats. Accepted input types are: 1. FASTA 2. Bare Sequence 3. Identifiers

Describe FASTA as a search format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (>) symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is.... - Blank lines aren’t allowed in the middle of FASTA input - Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes - However, a single hyphen or dash can be used to represent a gap of indeterminate length

Describe Bare sequence as a search format

-This may be just lines of sequence data, without the FASTA definition line, e.g. QIKDLLGDGGGGSSSSUHTVRFVRDD - It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept Flatfile report - Blank lines aren’t allowed in the middle of bare sequence input

Describe identifiers as a search format

- Normally these are simply accession, accession version or gi’s (e.g., p01013, AAA68881.1, 129295) - However, a bar-separated NCBI sequence identifier (e.g., gi|129295) will also be accepted - These NCBI sequence identifiers have a very specific syntax. The identifier may consist of only one token (I.e. word) - Spaces between letters in the input will cause it to be treated as bare sequence (spaces before or after the identifier aren’t allowed) - Examples of illegal input are: ACCENSION P01013 AAA68881.1 gi|129295

Explain sequence homology

Analysis of DNA sequences of homologous genes to provide clues to the evolutionary relationships between organisms - Two species that are closely related to each other will have DNA (or amino acid) sequences that are more similar to each other than if they are more distantly related - Such sequence analyses can be used to construct family trees of organisms - Bacteria from different species can exchange DNA sequences (horizontal transfer), making it more difficult to establish relationships among bacteria on their DNA sequences - Multiple Sequence Alignment Spftware - like CLUSTAL W & COBALT (Constraint-based Multiple Alignment Tool) are used by scientists to study the phylogenies relationships between species

Explain what is the phylogenies tree?

A phylogenies tree or evolutionary tree is a branching diagram or “tree” - It shows the inferred evolutionary relationships among various biological species or other entities - It is based upon similarities and differences in the species physical and/ or genetic characteristics - The tax’s joined together in the tree “trunk”; and organisms that have arisen from it are placed at the ends (tip) of tree “branches” - Closely related groups are located on branches close to one another

What does BLAST do?

The most widely used software in bioinformatics research It’s main function is to compare a sequence of interest, the query sequence, the sequences in a large database

Explain the MASCOT Search ENGINE from Max Science

- A powerful database search engine - Integrates all of the proven methods of database searching - Peptide fingerprint, Sequence Query and MS/MS Ion Search - Uses any nucleic or amino acid database in FASTA format: - to identify, characterize and quantify proteins

What are the challenges in bioinformatics?

Biological Redundancy and multiplicity: - Different sequences with similar structures - Organisms with similar genes - Mutliple functions of single genes - Grouping of genes in pathways Sequence redundancy in genomes Significance of relationships and similarities Signal vs Noise Lack of data

Intro To Bioinformatics Flashcards

(40 cards)