Position-specific scoring matrix (PSSM) for each nt/aa, compute a score for each MSA column score for aligning “D”, “M”, …? to a specific column (related to scoring matrix for alignment eg BLOSUM)

10. HMMs Flashcards by Stevie Davies

What is a sequence logo

a sequence logo is a graphical representation of the sequence conservation of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences).

A sequence logo is created from a collection of aligned sequences and depicts the consensus sequence and diversity of the sequences

How well did you know this?

Not at all

Perfectly

What is meant by statistical modelling of an MSA?

Give some examples of the different types of datasets we may encounter.

Advantages?

For further analysis, we need to describe position-specific information (conservation) statistically –> statistically model the alignment.

Types of datasets:
- single domain proteins
- multiple domain proteins, all with same architecture
- multiple domain proteins with different architectures –> use shared/homologous domain only!

How well did you know this?

Not at all

Perfectly

What are the applications of statistical modelling of MSAs?

Application:
- find additional family members: add/score additional sequences

How well did you know this?

Not at all

Perfectly

Alignment of gene/protein families:

position-specific information is important - why?

What do we need for further analysis?

How can different position specific information be visualised?

position-specific information is important
* different levels of conservation in different regions
* different functional/structural constraints
* different selective pressures in different regions

We need a statistical model - to describe statistically for further analysis
(e.g., to add/score additional sequences)

sequence logo

How well did you know this?

Not at all

Perfectly

What are the advantages of statistically modelling MSAs, eg with a profile HMM?

probabilistic interpretation of results
creating accurate sequence alignment
alignment of additional family members to a profile HMM:
- fast (faster than eg Tcoffee)
- gives MSAs of good quality

-Profile HMMs turn MSA into position-specific scoring system suitable for searching databases for (even remotely) homologous sequences

Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis

How well did you know this?

Not at all

Perfectly

What would be the sequence profile for this MSA:

seq1 ACGCA
seq2 ACGCT
seq3 ACGCT
seq4 AGGGT
seq5 CGGTT

and what could we use the sequence profile for?

col A C T G
1 0.8 0.2 0.0 0.0
2 0.0 0.6 0.0 0.4
3 0.0 0.0 0.0 1.0
4 0.0 0.6 0.2 0.2
5 0.2 0.0 0.8 0.0

use the profile to
- describe the alignment
- score / align another sequence to the MSA

How well did you know this?

Not at all

Perfectly

What is PSSM

Position-specific scoring matrix (PSSM)

for each nt/aa, compute a score for each MSA column
score for aligning “D”, “M”, …? to a specific column

(related to scoring matrix for alignment eg BLOSUM)

How well did you know this?

Not at all

Perfectly

Calculate a PSSM for this MSA

seq1 ACGCA
seq2 ACGCT
seq3 ACGCT
seq4 AGGGT
seq5 CGGTT

with this scoring matrix

scoring matrix
A C G T
A 1
C -1 1
G -1 -1 1
T -1 -1 -1 1

(is this a realistic matrix to use?)

realistically would use a substitution matrix eg BLOSUM
here basic matrix for simplicity

PSSM:
rows = number of MSA columns (pos)
columns = Nts

1st row 1st column:
(14/5 * 1 ) (1/5 * -1) = 0.6

1 0.6 -0.6 -1 -1
2 -1 0.2 -0.2 -1
3 -1 -1 1 -1
4 -1 0.2 -0.6 -0.6
5 -0.6 -1 -1 0.6

How well did you know this?

Not at all

Perfectly

What is PSI-BLAST

PSI (position-specific iterated) -BLAST

PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) derives a position-specific scoring matrix (PSSM) or profile from the multiple sequence alignment of sequences detected above a given score threshold using protein–protein BLAST.

(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program [10].
(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions.
(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm [10,12] can be used for this directly.
(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale [13], and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments [14] remain applicable to profile alignments [10].
(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

How well did you know this?

Not at all

Perfectly

How can a sequence profile be used as a generative model?

eg
col A C T G
1 0.8 0.2 0.0 0.0
2 0.0 0.6 0.0 0.4
3 0.0 0.0 0.0 1.0
4 0.0 0.6 0.2 0.2
5 0.2 0.0 0.8 0.0

what is prob of
- ACGCT
- CGGTA
- TCGCT

to find new members of the gene family
by calculating probability of a sequence being generated from this profile

profile as a generative model
- ACGCT 0.8×0.6×1.0×0.6×0.8 = 0.2304
- CGGTA 0.2×0.4×1.0×0.2×0.2 = 0.0032
- TCGCT 0.0×0.6×1.0×0.6×0.8 = 0

How well did you know this?

Not at all

Perfectly

What can we do if we create a generative model, but the true diversity of a family is not yet captured?

–> pseudocounts

use pseudocounts: adjust the frequency of (rare) amino
acid occurrences in a given alignment column
* Laplace’s rule
* background frequency pseudocounts
* pseudocounts based on substitution matrices

How well did you know this?

Not at all

Perfectly

a sequence profile can’t be used to model gaps, deletions etc.

how can we make a more general model which can also deal with gaps and deletions?

more general model: sequences with deletions (gaps)
* add deletion edges
* adjust transition probabilities so edges sum to 1 (per node)

can generate / model shorter sequences
- where deletion edges are present! (large/complex model!)

cannot generate / model longer sequences

(not suited to model sequences of different lengths –> need more general model –> profile HMM)

How well did you know this?

Not at all

Perfectly

We have a generative model from a sequence profile that can also deal with gaps and deletions, but is not suited to longer sequences.

How can we improve on this and create a more general model?

use profile HMMs

model sequence length heterogeneity
(without adding a huge number of transitions)
- deletions: nodes (states)
- longer deletions: connect deletion nodes
- insertions: nodes (states)
- longer insertions: self-loops on insertion nodes

How well did you know this?

Not at all

Perfectly

What are the elements of an HMM.

What are the most successful applications in bioinformatics?

Elements
- states: hidden
- transitions
- transition probabilities
- outcomes / emissions
- emission probabilities

most successful applications of HMMs in bioinformatics
- gene prediction
- analysis of gene/protein families
many applications beyond bioinformatics!

How well did you know this?

Not at all

Perfectly

How can an HMM be used to predict gene structure?

We:
1. train (generate) the model with suitable data (known protein coding genes, as many as possible, long. short etc). training changes probabilities, but no change to topology
2. given a DNA sequence,
- where does it most likely contain gene features?
- what probability is associated with this result?

–> use HMM after genome assembly to predict protein-coding genes for structural genome annotation

How well did you know this?

Not at all

Perfectly

We use an HMM to predict gene structure:

what is hidden/wanted? what is observed?

observed: DNA sequence

hidden/wanted: exons, introns, UTRs etc.

How well did you know this?

Not at all

Perfectly

How can HMMs be used for alignment?

Consider we have a MSA of a protein family.

We can generate and train a profile HMM.
We can then compare protein sequences eg from a database to the profile HMM

given a protein sequence,
* where does it most likely contain amino acids corresponding to (conserved) alignment columns?
* what probability is associated with this result?

–> find sequences that match the profile HMM

How well did you know this?

Not at all

Perfectly

We use an HMM for alignment.

what is hidden/wanted? what is observed?

–> profile HMM

observed: protein (or DNA) sequence

hidden / wanted: MSA match, insert, deletion

How well did you know this?

Not at all

Perfectly

Define profile HMM.

What do they capture?

Statistical description of conserved MSA regions

captures position-specific
- scores for residues
- gap penalties

What is training of a profile HMMs based on?

example algorithm?

requires (good, sufficient!) training data

training, based on a “seed” alignment (Baum-Welch Algorithm)
- weighting training sequences
- estimating transition probabilities
- estimating emission probabilities

–> changes probabilities, but not toplogy

In which two ways can a profile HMM be used to analyse sequences? give example algorithms for both

scoring (Forward algorithm)
- database searching: does a sequence belong to a family?

detecting (Viterbi algorithm, DP)
- aligning a sequence to a profile HMM

Profile HMM for sequence alignment:

What are the different states?

What do arrows represent?

states:
* M: match state
* I: insert state
* D: delete state
* B: beginning
* E: end

arrows: transition

Profile HMM for sequence alignment:

What are the different types of emissions ?

Match:
emits a single residue according to the frequency of amino acids in the alignment column (probability for all AAs/Nts adds up to 1)

Insertion:
emits one or more residues according to the background frequency of amino acids (probability for all AAs/Nts adds up to 1)

Deletion:
emits no amino acids but a gap

Profile HMM for sequence alignment:

How long is the model?

the model is as long as the length of conserved positions!

Would not include a region from an MSA that constitutes an insertion in one sequence but not present in most other sequences.

What is HMMer? Pfam?

HMMer: an implementation of profile HMMs for biological sequence analysis HMMer basis for the Pfam database - 1997: protein databases increasingly large - newly added proteins are new members of existing protein (domain) families - classification & management of protein sequence: organize into (domain) families - 20,000 families, integrated into InterPro

HMMer architecture?

M, D, I - match, deletion, insertion B, E: dummy begin/end states of the model N-, C-terminal states, unaligned S, T: start/termination states J: joining unaligned sequence states

HMMer architecture? global vs local alignment?

global/_local_ alignment with respect to the sequence _global_/local alignment with respect to the model single/_multiple_ hit alignments global: if N probability going back on itself is 0 --> forces sequence to match at the beginning If B trans. is 1: forces global alignment

How are insertions in a sequence (compared to the profile HMM) denoted?

gap symbol: . at least one sequence has an insertion, compared to the profile HMM

How are deletions in a sequence (compared to the profile HMM) denoted?

gap symbol: - the sequence is missing a column that the HMM expects to be there

Name 5 HMMer programs

hmmbuild: build a profile HMM from an alignment hmmscan: search one or more sequences against an HMM database hmmsearch: search a sequence database with a profile HMM hmmalign: align sequences to an HMM profile jackhmmer: iteratively search sequence(s) against a protein database

What does the HMMer program hmmscan do?

hmmscan: search one or more sequences against an HMM database

What does the HMMer program hmmsearch do?

hmmsearch: search a sequence database with a profile HMM

What does the HMMer program hmmalign do?

hmmalign: align sequences to an HMM profile

What does the HMMer program jackhmmer do?

jackhmmer: iteratively search sequence(s) against a protein database

What is the output of hmmsearch like?

BLAST-like output & E-values

What are the disadvantages of HMMs for sequence analysis?

disadvantages * requires (good, sufficient!) training data * cannot model dependencies (RNA!) same as all alignment methods: each column is considered independent from others. But there are dependencies because of protein structure. There are some methods that do but not part of this lecture

HMM vs Markov Chain?

Markov Model or Markov Chain? A Markov chain is simplest type of Markov model: all states are observable and probabilities converge over time. Hidden Markov Models are similar to Markov chains, but they have hidden states

Name the three different types of scores (model parameters) represented in a profile HMM

Match emission, insert emission, and state transition scores.

What is profile HMM?

Profile analysis is a tool used for aligning multiple sequences, and identifying known sequence domains in unannotated sequences. It uses position specific scoring. Profile HMMs estimates frequencies of residues at given positions from observations in training data.

What is transition probability and emission probability?

Transition probability is the probability of transitioning from one state to another. If there are no gaps, insertion or deletions, this probability from one match state to next match state is 1. Emission probability is associated with each match state, it’s the probability of a specific residue being in given position.

What are the advantages of using profile HMMs compared to other similar methods?

Compared to other database searching methods like BLAST that use position independent scoring, using profile is more sensitive. Searching sequences in a database is more computationally expensive than matching a profile. It requires less training data compared to non-probabilistic models.

Name the two most successful applications of HMMs in bioinformatics that were discussed in the lecture and list three differences regarding their (profile) HMM.

Gene Prediction - Training data to create profile HMM: DNA sequences of known protein coding genes - Hidden States: Genomic features such as Internal exons, promoter, intergenetic region, terminal exons,…. - Application Focus: Structural genome annotation Analysis of Gene/Protein Families - Training data to create profile HMM: Multiple sequence alignment of a core set of family members - Hidden States: Match, Insertion, Deletion - Application Focus: evolutionary analysis

EXAM QUESTION Define (p?)HMM, covariance model, 1 similiarity and 1 difference (2022)

EXAM QUESTION what are hidden states, states, transition and emission probabilities of HMM for prokaryotic sequences (2022)

depends on whether for gene prediction or for gene family prediction?

Define HMM

A Hidden Markov Model (HMM) is a statistical model that represents a system containing hidden states where the system evolves over time. It is "hidden" because the state of the system is not directly visible to the observer; instead, the observer can only see some output that depends on the state. Markov models are characterized by the Markov property, which states that the future state of a process only depends on its current state, not on the sequence of events that preceded it.