10. HMMs Flashcards

1
Q

What is a sequence logo

A

a sequence logo is a graphical representation of the sequence conservation of nucleotides (in a strand of DNA/RNA) or amino acids (in protein sequences).

A sequence logo is created from a collection of aligned sequences and depicts the consensus sequence and diversity of the sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is meant by statistical modelling of an MSA?

Give some examples of the different types of datasets we may encounter.

Advantages?

A

For further analysis, we need to describe position-specific information (conservation) statistically –> statistically model the alignment.

Types of datasets:
- single domain proteins
- multiple domain proteins, all with same architecture
- multiple domain proteins with different architectures –> use shared/homologous domain only!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the applications of statistical modelling of MSAs?

A

Application:
- find additional family members: add/score additional sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Alignment of gene/protein families:

position-specific information is important - why?

What do we need for further analysis?

How can different position specific information be visualised?

A

position-specific information is important
* different levels of conservation in different regions
* different functional/structural constraints
* different selective pressures in different regions

We need a statistical model - to describe statistically for further analysis
(e.g., to add/score additional sequences)

sequence logo

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the advantages of statistically modelling MSAs, eg with a profile HMM?

A
  • probabilistic interpretation of results
  • creating accurate sequence alignment
  • alignment of additional family members to a profile HMM:
    • fast (faster than eg Tcoffee)
    • gives MSAs of good quality

-Profile HMMs turn MSA into position-specific scoring system suitable for searching databases for (even remotely) homologous sequences

  • Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What would be the sequence profile for this MSA:

seq1 ACGCA
seq2 ACGCT
seq3 ACGCT
seq4 AGGGT
seq5 CGGTT

and what could we use the sequence profile for?

A

col A C T G
1 0.8 0.2 0.0 0.0
2 0.0 0.6 0.0 0.4
3 0.0 0.0 0.0 1.0
4 0.0 0.6 0.2 0.2
5 0.2 0.0 0.8 0.0

use the profile to
- describe the alignment
- score / align another sequence to the MSA

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is PSSM

A

Position-specific scoring matrix (PSSM)

  • for each nt/aa, compute a score for each MSA column
  • score for aligning “D”, “M”, …? to a specific column

(related to scoring matrix for alignment eg BLOSUM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Calculate a PSSM for this MSA

seq1 ACGCA
seq2 ACGCT
seq3 ACGCT
seq4 AGGGT
seq5 CGGTT

with this scoring matrix

scoring matrix
A C G T
A 1
C -1 1
G -1 -1 1
T -1 -1 -1 1

(is this a realistic matrix to use?)

A

realistically would use a substitution matrix eg BLOSUM
here basic matrix for simplicity

PSSM:
rows = number of MSA columns (pos)
columns = Nts

1st row 1st column:
(14/5 * 1 ) (1/5 * -1) = 0.6

1 0.6 -0.6 -1 -1
2 -1 0.2 -0.2 -1
3 -1 -1 1 -1
4 -1 0.2 -0.6 -0.6
5 -0.6 -1 -1 0.6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is PSI-BLAST

A

PSI (position-specific iterated) -BLAST

PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) derives a position-specific scoring matrix (PSSM) or profile from the multiple sequence alignment of sequences detected above a given score threshold using protein–protein BLAST.

(1) PSI-BLAST takes as an input a single protein sequence and compares it to a protein database, using the gapped BLAST program [10].
(2) The program constructs a multiple alignment, and then a profile, from any significant local alignments found. The original query sequence serves as a template for the multiple alignment and profile, whose lengths are identical to that of the query. Different numbers of sequences can be aligned in different template positions.
(3) The profile is compared to the protein database, again seeking local alignments. After a few minor modifications, the BLAST algorithm [10,12] can be used for this directly.
(4) PSI-BLAST estimates the statistical significance of the local alignments found. Because profile substitution scores are constructed to a fixed scale [13], and gap scores remain independent of position, the statistical theory and parameters for gapped BLAST alignments [14] remain applicable to profile alignments [10].
(5) Finally, PSI-BLAST iterates, by returning to step (2), an arbitrary number of times or until convergence.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How can a sequence profile be used as a generative model?

eg
col A C T G
1 0.8 0.2 0.0 0.0
2 0.0 0.6 0.0 0.4
3 0.0 0.0 0.0 1.0
4 0.0 0.6 0.2 0.2
5 0.2 0.0 0.8 0.0

what is prob of
- ACGCT
- CGGTA
- TCGCT

A

to find new members of the gene family
by calculating probability of a sequence being generated from this profile

profile as a generative model
- ACGCT 0.8×0.6×1.0×0.6×0.8 = 0.2304
- CGGTA 0.2×0.4×1.0×0.2×0.2 = 0.0032
- TCGCT 0.0×0.6×1.0×0.6×0.8 = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What can we do if we create a generative model, but the true diversity of a family is not yet captured?

A

–> pseudocounts

use pseudocounts: adjust the frequency of (rare) amino
acid occurrences in a given alignment column
* Laplace’s rule
* background frequency pseudocounts
* pseudocounts based on substitution matrices

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

a sequence profile can’t be used to model gaps, deletions etc.

how can we make a more general model which can also deal with gaps and deletions?

A

more general model: sequences with deletions (gaps)
* add deletion edges
* adjust transition probabilities so edges sum to 1 (per node)

can generate / model shorter sequences
- where deletion edges are present! (large/complex model!)

cannot generate / model longer sequences

(not suited to model sequences of different lengths –> need more general model –> profile HMM)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

We have a generative model from a sequence profile that can also deal with gaps and deletions, but is not suited to longer sequences.

How can we improve on this and create a more general model?

A

use profile HMMs

model sequence length heterogeneity
(without adding a huge number of transitions)
- deletions: nodes (states)
- longer deletions: connect deletion nodes
- insertions: nodes (states)
- longer insertions: self-loops on insertion nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the elements of an HMM.

What are the most successful applications in bioinformatics?

A

Elements
- states: hidden
- transitions
- transition probabilities
- outcomes / emissions
- emission probabilities

most successful applications of HMMs in bioinformatics
- gene prediction
- analysis of gene/protein families
many applications beyond bioinformatics!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can an HMM be used to predict gene structure?

A

We:
1. train (generate) the model with suitable data (known protein coding genes, as many as possible, long. short etc). training changes probabilities, but no change to topology
2. given a DNA sequence,
- where does it most likely contain gene features?
- what probability is associated with this result?

–> use HMM after genome assembly to predict protein-coding genes for structural genome annotation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

We use an HMM to predict gene structure:

what is hidden/wanted? what is observed?

A

observed: DNA sequence

hidden/wanted: exons, introns, UTRs etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

How can HMMs be used for alignment?

A

Consider we have a MSA of a protein family.

  • We can generate and train a profile HMM.
  • We can then compare protein sequences eg from a database to the profile HMM

given a protein sequence,
* where does it most likely contain amino acids corresponding to (conserved) alignment columns?
* what probability is associated with this result?

–> find sequences that match the profile HMM

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

We use an HMM for alignment.

what is hidden/wanted? what is observed?

A

–> profile HMM

observed: protein (or DNA) sequence

hidden / wanted: MSA match, insert, deletion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Define profile HMM.

What do they capture?

A

Statistical description of conserved MSA regions

captures position-specific
- scores for residues
- gap penalties

20
Q

What is training of a profile HMMs based on?

example algorithm?

A

requires (good, sufficient!) training data

training, based on a “seed” alignment (Baum-Welch Algorithm)
- weighting training sequences
- estimating transition probabilities
- estimating emission probabilities

–> changes probabilities, but not toplogy

21
Q

In which two ways can a profile HMM be used to analyse sequences? give example algorithms for both

A

scoring (Forward algorithm)
- database searching: does a sequence belong to a family?

detecting (Viterbi algorithm, DP)
- aligning a sequence to a profile HMM

22
Q

Profile HMM for sequence alignment:

What are the different states?

What do arrows represent?

A

states:
* M: match state
* I: insert state
* D: delete state
* B: beginning
* E: end

arrows: transition

23
Q

Profile HMM for sequence alignment:

What are the different types of emissions ?

A

Match:
emits a single residue according to the frequency of amino acids in the alignment column (probability for all AAs/Nts adds up to 1)

Insertion:
emits one or more residues according to the background frequency of amino acids (probability for all AAs/Nts adds up to 1)

Deletion:
emits no amino acids but a gap

24
Q

Profile HMM for sequence alignment:

How long is the model?

A

the model is as long as the length of conserved positions!

Would not include a region from an MSA that constitutes an insertion in one sequence but not present in most other sequences.

25
Q

What is HMMer?

Pfam?

A

HMMer: an implementation of profile HMMs for biological sequence analysis

HMMer basis for the Pfam database
- 1997: protein databases increasingly large
- newly added proteins are new members of existing protein (domain) families
- classification & management of protein sequence: organize into (domain) families
- 20,000 families, integrated into InterPro

26
Q

HMMer architecture?

A

M, D, I - match, deletion, insertion
B, E: dummy begin/end states of the model
N-, C-terminal states, unaligned
S, T: start/termination states
J: joining unaligned sequence states

27
Q

HMMer architecture?

global vs local alignment?

A

global/local alignment with respect to the sequence
global/local alignment with respect to the model
single/multiple hit alignments

global:
if N probability going back on itself is 0 –> forces sequence to match at the beginning
If B trans. is 1: forces global alignment

28
Q

How are insertions in a sequence (compared to the profile HMM) denoted?

A

gap symbol: .
at least one sequence has an insertion, compared to the profile HMM

29
Q

How are deletions in a sequence (compared to the profile HMM) denoted?

A

gap symbol: -
the sequence is missing a column that the HMM expects to be there

30
Q

Name 5 HMMer programs

A

hmmbuild: build a profile HMM from an alignment

hmmscan: search one or more sequences against an HMM database

hmmsearch: search a sequence database with a profile HMM

hmmalign: align sequences to an HMM profile

jackhmmer: iteratively search sequence(s) against a protein database

31
Q

What does the HMMer program hmmscan do?

A

hmmscan: search one or more sequences against an HMM database

32
Q

What does the HMMer program hmmsearch do?

A

hmmsearch: search a sequence database with a profile HMM

33
Q

What does the HMMer program hmmalign do?

A

hmmalign: align sequences to an HMM profile

34
Q

What does the HMMer program jackhmmer do?

A

jackhmmer: iteratively search sequence(s) against a protein database

35
Q

What is the output of hmmsearch like?

A

BLAST-like output
& E-values

36
Q

What are the disadvantages of HMMs for sequence analysis?

A

disadvantages
* requires (good, sufficient!) training data
* cannot model dependencies (RNA!)

same as all alignment methods: each column is considered
independent from others.

But there are dependencies because of protein structure. There are some methods that do but not part of this lecture

37
Q

HMM vs Markov Chain?

A

Markov Model or Markov Chain?

A Markov chain is simplest type of Markov model: all states are observable and probabilities converge over time.

Hidden Markov Models are similar to Markov chains, but they have hidden states

38
Q

Name the three different types of scores (model parameters) represented in a profile HMM

A

Match emission, insert emission, and state transition scores.

39
Q

What is profile HMM?

A

Profile analysis is a tool used for aligning multiple sequences, and identifying known sequence domains in unannotated sequences. It uses position specific scoring. Profile HMMs estimates frequencies of residues at given positions from observations in training data.

40
Q

What is transition probability and emission probability?

A

Transition probability is the probability of transitioning from one state to another. If there are no gaps, insertion or deletions, this probability from one match state to next match state is 1.

Emission probability is associated with each match state, it’s the probability of a specific residue being in given position.

41
Q

What are the advantages of using profile HMMs compared to other similar methods?

A

Compared to other database searching methods like BLAST that use position independent scoring, using profile is more sensitive.

Searching sequences in a database is more computationally expensive than matching a profile.

It requires less training data compared to non-probabilistic models.

42
Q

Name the two most successful applications of HMMs in bioinformatics that were discussed in the lecture and list three differences regarding their (profile) HMM.

A

Gene Prediction
- Training data to create profile HMM: DNA sequences of known protein coding genes
- Hidden States: Genomic features such as Internal exons, promoter, intergenetic region, terminal exons,….
- Application Focus: Structural genome annotation

Analysis of Gene/Protein Families
- Training data to create profile HMM: Multiple sequence alignment of a core set of family members
- Hidden States: Match, Insertion, Deletion
- Application Focus: evolutionary analysis

43
Q

EXAM QUESTION

Define (p?)HMM, covariance model, 1 similiarity and 1 difference (2022)

A
44
Q

EXAM QUESTION

what are hidden states, states, transition and emission probabilities of HMM for prokaryotic sequences (2022)

A

depends on whether for gene prediction or for gene family prediction?

45
Q

Define HMM

A

A Hidden Markov Model (HMM) is a statistical model that represents a system containing hidden states where the system evolves over time.

It is “hidden” because the state of the system is not directly visible to the observer;

instead, the observer can only see some output that depends on the state.

Markov models are characterized by the Markov property, which states that the future state of a process only depends on its current state, not on the sequence of events that preceded it.