5 - Motifs, profiles and PSSMs Flashcards
How do you write regular expressions for motifs?
x - any AA observed in that alignment column
[] - two or more AA observed in column
{} - disallowed AAs
(n) number of times the amino acid or pattern is repeated (place after any of the above)
(n,m) - number of times you see X, n is minimum and m is maximum
Cap it off at both ends with < (amino) and > (carboxyl)
What is Prosite?
A database of protein domains, families and functional distes.
Prosite stores patterns as profiles, which are more complex than simple regular expressions.
What are PSSMs?
Position Specific Scoring Matrix (used in PSI-BLAST)
- Makes a profile using log-odds score
What is the formula for calculating the log odds socre for amino acids?
What does it mean when it is positive? Negative?
w = ln(f/p)
f: frequency of AA in alignment column
p: Frequency of AA in database (or multiple alignment)
Positive: AA is much more frequent in alignment column than background frequency
Negative: AA is much less frequent in alignment column than background frequency (penalty for residue in alignment)
Many amino acids are simply not observed in alignment columns and will have undefined w values (ie ln(0/p) )
What is the solution to this?
Add pseudocounts (fake counts) for unobserved amino acids.
Then, for a given alignment column:
f = (n + b)/(N + B) = corrected frequency of AA
n: observed counts of AA
b: pseudocount
N: total counts of all AA in column
B: total pseudocounts invoked in that column
REVIEW PROBLEM SET 2!
What is the most easily accessible PSSM-based searching method? List its steps
PSI-BLAST
1) takes protein sequence query and performs gapped-BLASTP on database
2) hits of E < user-defined threshold (default = 0.001) are used to build a multiple protein alignment
3) protein alignment is used to make a PSSM of the same length as the query sequence (no position-specific gaps tho)
4) profile is compared to protein database using slight modification of gapped BLAST program (seeks local alignments)
5) assess statistical significance of local alignments with profile (using same stats as gapped-BLAST) return to step 2 until you hit…
6) Convergence = no new sequences are found with E < threshold CELEBRATE!
Define the following PSSM / Sequence LOGO terms:
- Uncertainty
- Information
Uncertainty: Η = -Σf(lnf)
f: frequency of AA at site
Η: uncertainty, measured in bits (log) or nats (ln)
Information: The decrease in uncertainty given by seeing the data
Information = R = H(before) - H(after)
Also, information: height of a Sequence LOGO
Define Shannon entropy
Named after Boltzmann’s H-theorem, Shannon defined the entropy H (Greek letter Eta) of a discrete random variable X with possible values {x1, …, xn} and probability mass function
Η(x) = ΣP(x)ln(P(X))
This quantity should be understood as the amount of randomness in the random variable X given that you know the value of Y.
What are Markov models? How do they differ from Hidden Markov Models?
A stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.
Eg. monopoly, snakes and ladders and dice games, but not card games!
HMMs: Include states AND SYMBOLS. There are P for transition probabilities as well as emission probabilities (what symbol you have in that state)
You see the symbol but the states are hidden.
Why bother with HMMs?
- PSSMs do not deal easily with insertions and deletions
- Multiple alignments are done in an ad-hoc manner by optimization of the gap opening and gap extension penalties
- HMM is a natural framework where insertion/deletions are dealt with explicitly in a probabilistic model
How do you train an HMM?
- Give it lots of examples of the data
- From the frequencies of various symbols (eg. bases or amino acids, insertions or deletions) it can build the emission and transition probabilities that are augmented by prior information (eg. PAM matrices)
- Both emission and transition probabilities are conditional
Once it has lots of data, it can optimize itself by adjusting these e and t probabilities to maximize the probability of generating the data is was trained on.
What is the law of total probability and how does it related to HMMs?
The probability of an event occurring that can only be caused by n different (mutually exclusive) circumstances C is:
P(E) = P(E,C1) + P(E,C2) + P(E,C3)… P(E,Cn)
therefore:
P(E) = ΣnP(E, Cn)
In sequence alignment, this is relevant when there are multiple paths through which the model could have generated the sequence, otherwise you just use the multiplication theorem.