Part 2 - Lecture 5/6 - Substitution Scoring Matrices Flashcards
What gives us an alignment and a score for similarity for an entire sequence?
global pairwise alignment
What gives us the alignment and score for parts of sequences?
local pairwise alignment
What can precisely indicate interesting residues of nucleotides and amino acids?
Multiple sequence alignment MSA
What does MSA depend on?
accuracy in pairwise alignment - which depends on scoring
What are the desirable features of a scoring matrix?
-we can think in terms of mutation and selection
-should we think about these differently - nucleotides and amino acids and what properties matter the most
If you start with a high confidence alignment what do you get?
-have no gaps or spaces
-hopefully see very few mismatches
-can call these sequences related sequences
How do you calculate the ways to choose k elements from a set of n elements?
nCk = (n!)/k!(n-k)!
How do you move from substitution counts to probabilities?
take the number of pairs of nucleotides in column, and multiple by the number of columns and divide the counts by that calculated product
Which exons are important?
the first and last exons
What happens to the part of a gene after a stop codon?
it will be part of the mRNA post splicing but will not be expressed
Why make a substitution scoring matrix?
had to do this because there was not way to make databases of genes to a single nucelotide of DNA position
Why could you have a long untranslated part of a gene?
tRNA polymerase not starting at some position
What happens more frequently and has less of an effect on function and is less penalized in a scoring matrix than translation?
transition
In a real scoring matrix why are values scaled so that the highest entry is 100?
makes things easier to calculate
How do amino acids affect protein structure?
hydrophobic residues go inside and hydrophilic outside which affects shape and not all parts of the protein are important; need to pay attention to H bonding, acidic, basic, polar, nonpolar; will amino acids be able to same role in chemical sense
If you have two unreleated sequences if they are i.i.d than the pN is what which means the expected number of matches is what?
pN=0.25; 1/4 of the sequences is matches
What is the null hypothesis for two sequences S1 and S2?
S1 and S2 have no more similarity than expected by chance
What is the alternative hypothesis for two sequences S1 and S2?
S1 and S2 seem related more than similar than expected by chance
What is testing hypotheses equivalent to?
comparing models; allows us to compare two models which describe relationships between two factors
What is the probability for twp sequences by chance under Ho?
What is the probability for twp sequences by chance under H1 or alternative hypothesis?
What is the likelihood ratio and what does its value represent?
that the sequence is 5X more likely to have arisen from our related model than our unrelated model
What does it mean if our starting data is symmetric?
-no species are ancestors of others
-substitutions are not all symmetric in their biological rates - dinucleotides are not in equilibrium
How did we use our original scoring scheme?
-add scores corresponding to different alignment positions
In the original scoring scheme what score were good and bad positions given?
good positions - positive score
bad positions - negative score
What is mu or u in the original scoring scheme?
the relative weight of matches and mismatches
How do you factor the likelihood ratio to emphasize individual positions?
Is any information lost by taking the log of likelihood ratios?
No information is lost by taking the log of likelihood ratios
Why is it better to add logs than multiply ratios of prababilities?
it is easier computationally for computer and humans
What is the log likelihood scoring scheme?
just take the log of each term in the matrix
What does a score of more than zero mean for the log value?
is a high confidence alignment
Should match scores always be equal?
no they do not have to be
Should there only be on scoring matrix?
no because it depends on if we have different species and the rates and types of changes vary between different species over generations due to evolution
Can we make different scoring matrices for different situations?
yes you can begin with a high confidence alignment which corresponds to different time periods
What can we use a scoring matrix to get?
pairwise alignment
What can we use a pairwise alignment to get?
a MSA or multiple sequence alignment
What do we use out MSA or multiple sequence alignment as the basis of?
a scoring matrix
What is the function inference using sequence similarity?
(1)it works very well and (2) we can have a problem of drift in biases and (3) if it is recognized and persists it maybe inherent to genomics