10. HMMs Flashcards
What is a sequence logo
a sequence logo is a graphical representation of the sequence conservation of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences).
A sequence logo is created from a collection of aligned sequences and depicts the consensus sequence and diversity of the sequences
What is meant by statistical modelling of an MSA?
Give some examples of the different types of datasets we may encounter.
Advantages?
For further analysis, we need to describe position-specific information (conservation) statistically –> statistically model the alignment.
Types of datasets:
- single domain proteins
- multiple domain proteins, all with same architecture
- multiple domain proteins with different architectures –> use shared/homologous domain only!
What are the applications of statistical modelling of MSAs?
Application:
- find additional family members: add/score additional sequences
Alignment of gene/protein families:
position-specific information is important - why?
What do we need for further analysis?
How can different position specific information be visualised?
position-specific information is important
* different levels of conservation in different regions
* different functional/structural constraints
* different selective pressures in different regions
We need a statistical model - to describe statistically for further analysis
(e.g., to add/score additional sequences)
sequence logo
What are the advantages of statistically modelling MSAs, eg with a profile HMM?
- probabilistic interpretation of results
- creating accurate sequence alignment
- alignment of additional family members to a profile HMM:
- fast (faster than eg Tcoffee)
- gives MSAs of good quality
-Profile HMMs turn MSA into position-specific scoring system suitable for searching databases for (even remotely) homologous sequences
- Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis
What would be the sequence profile for this MSA:
seq1 ACGCA
seq2 ACGCT
seq3 ACGCT
seq4 AGGGT
seq5 CGGTT
and what could we use the sequence profile for?
col A C T G
1 0.8 0.2 0.0 0.0
2 0.0 0.6 0.0 0.4
3 0.0 0.0 0.0 1.0
4 0.0 0.6 0.2 0.2
5 0.2 0.0 0.8 0.0
use the profile to
- describe the alignment
- score / align another sequence to the MSA
What is PSSM
Position-specific scoring matrix (PSSM)
- for each nt/aa, compute a score for each MSA column
- score for aligning “D”, “M”, …? to a specific column
(related to scoring matrix for alignment eg BLOSUM)
Calculate a PSSM for this MSA
seq1 ACGCA
seq2 ACGCT
seq3 ACGCT
seq4 AGGGT
seq5 CGGTT
with this scoring matrix
scoring matrix
A C G T
A 1
C -1 1
G -1 -1 1
T -1 -1 -1 1
(is this a realistic matrix to use?)
realistically would use a substitution matrix eg BLOSUM
here basic matrix for simplicity
PSSM:
rows = number of MSA columns (pos)
columns = Nts
1st row 1st column:
(14/5 * 1 ) (1/5 * -1) = 0.6
1 0.6 -0.6 -1 -1
2 -1 0.2 -0.2 -1
3 -1 -1 1 -1
4 -1 0.2 -0.6 -0.6
5 -0.6 -1 -1 0.6
What is PSI-BLAST
PSI (position-specific iterated) -BLAST
PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) derives a position-specific scoring matrix (PSSM) or profile from the multiple sequence alignment of sequences detected above a given score threshold using protein–protein BLAST.
(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program [10].
(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions.
(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm [10,12] can be used for this directly.
(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale [13], and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments [14] remain applicable to profile alignments [10].
(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.
How can a sequence profile be used as a generative model?
eg
col A C T G
1 0.8 0.2 0.0 0.0
2 0.0 0.6 0.0 0.4
3 0.0 0.0 0.0 1.0
4 0.0 0.6 0.2 0.2
5 0.2 0.0 0.8 0.0
what is prob of
- ACGCT
- CGGTA
- TCGCT
to find new members of the gene family
by calculating probability of a sequence being generated from this profile
profile as a generative model
- ACGCT 0.8×0.6×1.0×0.6×0.8 = 0.2304
- CGGTA 0.2×0.4×1.0×0.2×0.2 = 0.0032
- TCGCT 0.0×0.6×1.0×0.6×0.8 = 0
What can we do if we create a generative model, but the true diversity of a family is not yet captured?
–> pseudocounts
use pseudocounts: adjust the frequency of (rare) amino
acid occurrences in a given alignment column
* Laplace’s rule
* background frequency pseudocounts
* pseudocounts based on substitution matrices
a sequence profile can’t be used to model gaps, deletions etc.
how can we make a more general model which can also deal with gaps and deletions?
more general model: sequences with deletions (gaps)
* add deletion edges
* adjust transition probabilities so edges sum to 1 (per node)
can generate / model shorter sequences
- where deletion edges are present! (large/complex model!)
cannot generate / model longer sequences
(not suited to model sequences of different lengths –> need more general model –> profile HMM)
We have a generative model from a sequence profile that can also deal with gaps and deletions, but is not suited to longer sequences.
How can we improve on this and create a more general model?
use profile HMMs
model sequence length heterogeneity
(without adding a huge number of transitions)
- deletions: nodes (states)
- longer deletions: connect deletion nodes
- insertions: nodes (states)
- longer insertions: self-loops on insertion nodes
What are the elements of an HMM.
What are the most successful applications in bioinformatics?
Elements
- states: hidden
- transitions
- transition probabilities
- outcomes / emissions
- emission probabilities
most successful applications of HMMs in bioinformatics
- gene prediction
- analysis of gene/protein families
many applications beyond bioinformatics!
How can an HMM be used to predict gene structure?
We:
1. train (generate) the model with suitable data (known protein coding genes, as many as possible, long. short etc). training changes probabilities, but no change to topology
2. given a DNA sequence,
- where does it most likely contain gene features?
- what probability is associated with this result?
–> use HMM after genome assembly to predict protein-coding genes for structural genome annotation
We use an HMM to predict gene structure:
what is hidden/wanted? what is observed?
observed: DNA sequence
hidden/wanted: exons, introns, UTRs etc.
How can HMMs be used for alignment?
Consider we have a MSA of a protein family.
- We can generate and train a profile HMM.
- We can then compare protein sequences eg from a database to the profile HMM
given a protein sequence,
* where does it most likely contain amino acids corresponding to (conserved) alignment columns?
* what probability is associated with this result?
–> find sequences that match the profile HMM
We use an HMM for alignment.
what is hidden/wanted? what is observed?
–> profile HMM
observed: protein (or DNA) sequence
hidden / wanted: MSA match, insert, deletion