Bioinformatics Flashcards
Given gene how would you find info about it
Find sequence (EMBL, DDBJ) Literature database Genomics database (MIM) Gene expression database (NCBIGEO) Interaction databases (intact, BIND) Metabolic pathway (ENZYME, KEGG, reactome) Mutation/ polymorphism databases (dbSNP)
What is a database
Data collection that is structured, searchable, updatable, cross-linked and publicly available.
Why does BLAST work?
Similar sequences tend to have similar function
Similar sequences tend to be evolutionarily related
How can you be sure our blast match is significant
E score (roughly equal to probability of chance)
E = mn2s
M - #nucleotides your sequence was compared against
N - #nucleotides in your sequence
2s - 2 to the power of match score (smaller as sequence get more similar.
Blast aa
Blast nucleotide
BLASTp
BLASTn
What is BLASTx
BLAST a translated nucleotide sequence in all 6 frames against aa sequence database
What does tBLASTn do
BLASTs aa sequence against nucleotide sequence that has been translated in all 6 frames
What does tBLASTx do
Your nucleotide in six frames translated into aa against database nucleotides in six frames translated into aa
Good for distantly related sequences.
MegaBLAST
Quicker than BLASTn but less sensitive
Use this for everything unless looking for distantly related sequences (use tBLASTx for that)
PSI-BLAST
Very sensitive blast that takes into account that some regions are more conserved than others. Takes LONG.
What is special about multiple sequence alignments
Can reveal subtle conservation of genome features as these areas evolve/change slower. >3 sequence alignments can show evolutionary relationships.
Eg. Demographic and ecological histories of pops - gene flow, size changes, nat selection, migrations.
Local vs global alignments
Global - end to end alignments
Local - specific regions of sequence
Common mismatch scoring schemes
Nucleotide mismatch Aa mismatch (BLOSSM, PAM)
How are most multiple alignments done
Build multiple alignments from pair wise alignments. Use mismatch scores to find best score. Use a technique called Dynamic Programming.
Pair wise alignment methods
ClustlW - global alignment 20kb long
MUSCLE - global and local 100kb long
MAUVE - global 10Mb long
Uses of sequence databases in bioinformatics
Retrieve known gene sequence Finding info on gene Compare sequence to others in DB Submit sequence to be stored with rest Find how many genes an organism has
Why is it harder to do gene prediction in humans vs bacteria
Bacteria have specific and well understood proctor sequences (easy to identify) Protein coding sequences one contiguous ORF.
Human promotors less well understood and complex (harder) Protein coding is divided into exons and spliced variably.
Why want to know GC content of sequence
Higher GC generally = longer protein coding region.
Melting temp for PCR.
Different orgs have varying GC content
Useful in mapping exon rich regions
Which genes are more homologous this or that
You can’t quantify homology. It is a conceptual framework to define the evolutionary relationship between two genes. You can quantify similarity. If they come from dif species you can look at orthology.
Why bioinformatics needed
Small and large scale analysis New lab techniques Single -> whole genome Collection/storage of data Manipulation of data
Egs of sequence databases
EMBL
DDBJ
GeneBank
What do genomics databases contain
Info about gene chromosomal location
Nomenclature
Links to sequence databases
Eg MIM