Bioinformatics Flashcards
Given gene how would you find info about it
Find sequence (EMBL, DDBJ) Literature database Genomics database (MIM) Gene expression database (NCBIGEO) Interaction databases (intact, BIND) Metabolic pathway (ENZYME, KEGG, reactome) Mutation/ polymorphism databases (dbSNP)
What is a database
Data collection that is structured, searchable, updatable, cross-linked and publicly available.
Why does BLAST work?
Similar sequences tend to have similar function
Similar sequences tend to be evolutionarily related
How can you be sure our blast match is significant
E score (roughly equal to probability of chance)
E = mn2s
M - #nucleotides your sequence was compared against
N - #nucleotides in your sequence
2s - 2 to the power of match score (smaller as sequence get more similar.
Blast aa
Blast nucleotide
BLASTp
BLASTn
What is BLASTx
BLAST a translated nucleotide sequence in all 6 frames against aa sequence database
What does tBLASTn do
BLASTs aa sequence against nucleotide sequence that has been translated in all 6 frames
What does tBLASTx do
Your nucleotide in six frames translated into aa against database nucleotides in six frames translated into aa
Good for distantly related sequences.
MegaBLAST
Quicker than BLASTn but less sensitive
Use this for everything unless looking for distantly related sequences (use tBLASTx for that)
PSI-BLAST
Very sensitive blast that takes into account that some regions are more conserved than others. Takes LONG.
What is special about multiple sequence alignments
Can reveal subtle conservation of genome features as these areas evolve/change slower. >3 sequence alignments can show evolutionary relationships.
Eg. Demographic and ecological histories of pops - gene flow, size changes, nat selection, migrations.
Local vs global alignments
Global - end to end alignments
Local - specific regions of sequence
Common mismatch scoring schemes
Nucleotide mismatch Aa mismatch (BLOSSM, PAM)
How are most multiple alignments done
Build multiple alignments from pair wise alignments. Use mismatch scores to find best score. Use a technique called Dynamic Programming.
Pair wise alignment methods
ClustlW - global alignment 20kb long
MUSCLE - global and local 100kb long
MAUVE - global 10Mb long
Uses of sequence databases in bioinformatics
Retrieve known gene sequence Finding info on gene Compare sequence to others in DB Submit sequence to be stored with rest Find how many genes an organism has
Why is it harder to do gene prediction in humans vs bacteria
Bacteria have specific and well understood proctor sequences (easy to identify) Protein coding sequences one contiguous ORF.
Human promotors less well understood and complex (harder) Protein coding is divided into exons and spliced variably.
Why want to know GC content of sequence
Higher GC generally = longer protein coding region.
Melting temp for PCR.
Different orgs have varying GC content
Useful in mapping exon rich regions
Which genes are more homologous this or that
You can’t quantify homology. It is a conceptual framework to define the evolutionary relationship between two genes. You can quantify similarity. If they come from dif species you can look at orthology.
Why bioinformatics needed
Small and large scale analysis New lab techniques Single -> whole genome Collection/storage of data Manipulation of data
Egs of sequence databases
EMBL
DDBJ
GeneBank
What do genomics databases contain
Info about gene chromosomal location
Nomenclature
Links to sequence databases
Eg MIM
What is an isoform
Alternative to a sequence
Egs of gene expression databases
NCBIGEO
How to remove vector sequence from DNA sample sequence
Run against vector sequence database eg. UniVec
How to chose most likely translation result
Usually longest ORF Starting with Met Ending in stop No stops wonton sequence Confirm with promoter prediction
Egs of gene prediction software
GeneMark
GENSCAN
Translators and promoter prediction software
NCBI ORF Finder
Promotor 2.0 prediction server
Protein sequence databases
UniProt
GenPept
RefSeq
Database of 3D structures
Protein Data Bank (PDB)
Protein domain / family databases integrated into what site
InterPro
What is a motif
Sequence of aa encoding for a certain molecular function
Short = motif Long = functional domain
Short linear motifs
Unrelated proteins sharing a functional feature like to contain similar motifs
Etc
Classification of motifs
Modification
Ligand
Targeting
Cleavage
What is a regular expression
Determines what aa is allowed in each position
Used by PROSITE
BioEdit analysis for cloning
Nucleotide composition Six frame translation Determine ORF Length of insert/DNA RE mapping
Transition vs transversion
Transition is purine to purine or pyrimidine to pyrimidine (eg A to G , T to C)
Transversions are opposite
(twice as many transversions possible but twice as many transitions occur)
Types of sequence formates
Fasta
Genbank
Nexus
Phylip
Types of sequence viewers
Sea view
Aliview
Mesquite
MEGA
What is an open reading frame?
A string of in-frame codons that specify an amino acid
Starts with ATG (meth) or Val
Ends with stop codon
Gene prediction software
GeneMark
GENSCAN
microbial Gene Prediction Systm
Glimmer
What are promoters?
DNA sequence involved in regulating transcription
Types of promoters
- core
- proximal
- distal
Functions of promoters
- integrate info about cell conditions and alter rate of transcription in response
- different components responsible for different parts of expression pattern
Tasks of bioinformatics
- identify promoter regions
- find TFBS and TFBS modules in a sequence
- discover novel TFBS motifs
- construct TFBS and their motifs
- analysis of expression data
How to represent TFBS motifs
- consensus sequence
- position weight matrix
Databases of TFBS motifs
Transfac
Jaspar
What is phylogenetic foot printing?
Use of comparative genomics to infer functional genomic regions from conservation
What does phylogenetic foot printing require?
- comparison of correctly identified orthologous promoter regions
- conserved function across species
- species sufficiently diverged to reduce passive conservation
POSSUM workflow
- set of co-expressed genes
- automated sequence retrieval from ensembl
- phylogenetic foot printing
- detection of TFBS
- statistical significance of binding sites
What are methods of miRNA identification based on.
- targets tend to be located in 3’UTR
- some are complementary to the target RNA
What is a motif ?
A sequence of amino acids encoding a particular molecular function
What is PROSITE
Library of regular expressions describing each enzyme active site
Advantages of regular expressions
- memorable to humans
- computationally fast
- standardized in scripting languages
- can describe a motif very well
Disadvantages of regular expressions
- over predict
- motif may vary in other lineages
- do not capture weaker preferences
- easy to make poor representation
Example methods if protein functional domains
Matrix/profile
Hidden Markov model
Sequence clustering