Kursusgang 10 (Time series models) Flashcards
Why use time series model?
It can model dependencies in the input, useful for when inputs are no longer iid.
It is suited for use in speech, handwriting and DNA sequences for respectively; words in a sequence, pen movements and base pairs.
What are some prominent examples of time series models?
Autoregressive (AR)
Moving Average (MA)
ARMA
Recurrent Neural Networks
Transformers
Hidden Markov Models
What is a Markov chain?
The defining characteristic of a Markov Chain is the Markov Property, which states that the future state depends only on the current state and not on the sequence of states that preceded it.
It is typically represented by a set of states, s = {s_1, s_2, …, s_N}, and a transition matrix that defines the probabilities of moving from one state to another, A = {a_{ij}}, a_{ij}= P(q_{t+1} = s_j /q_t = s_i), 1 ≤ i ≤ N.
What is a hidden Markov model?
In a hidden Markov model, the states in a markov chain are hidden, meaning they cannot be observed directly.
Instead, each hidden state produces observable outputs, M, according to a probability distribution, known as the observation probability distribution, B = {b_j(k)}, where b_j(k) = P(O_t = v_k | q_t = s_j), 1 ≤ j ≤ N, 1 ≤ k ≤ M.
The initial state distribution is denoted by π = {π_i}, 1 ≤ i ≤ N.
For convenience, use the notation λ = (A, B, π).
What are the three basic problems of hidden Markov models?
Scoring:
* Given an observation sequence O = {o_1, o_2, …, o_T} and a model λ = (A, B, π), how does one compute P(O, λ), the probability of the observation sequence?
Solution: Forward-backward algorithm
Matching:
* Given an observation sequence O = {o_1, o_2, …, o_T} and a model λ = (A, B, π), how should a state sequence q = {q_1, q_2, …, q_T} be chosen to be an optimum in some sense?
Solution: Viterbi Algorithm
Training:
* Given an observation sequence O = {o_1, o_2, …, o_T} and an untrained model λ = (A, B, π), how should the model parameters be adjusted to maximize P(O | λ)?
Solution: Baum-Welch Re-estimation procedures.
Explain the scoring problem of hidden Markov models.
Given an observation sequence O = {o_1, o_2, …, o_T} and a model λ = (A, B, π), how does one compute P(O, λ), the probability of the observation sequence?
A naive approach considers all possible state sequences (N^T) of length T. However, this requires N^T * 2T computations, quickly becoming too many.
What is the forward-backward algorithm?
The forward-backward algorithm is a technique used in hidden Markov models to calculate the posterior probabilities of hidden states given an observed sequence.
Mathematically define the forward algorithm in the forward-backward algorithm.
The forward variable α_t(i) = P(o_1o_2…o_t, q_t = i | λ) is the probability of the partial observation sequence until time t and for state at time t being i, given the model λ.
α_t(i) can be solved inductively by
* Initialisation, for 1 ≤ i ≤ N:
α_t(i) = π_i * b_i(o_1)
- Induction, for 1 ≤ t ≤ T-1, 1 ≤ j ≤ N:
α_{t+1}(i) = \sum_{j=1}^N [ α_t(j) * a_{ji} ] * b_i(o_{t+1}). - Termination:
P(O | λ) = \sum_{i=1}^N α_T(i)
Mathematically define the backward algorithm in the forward-backward algorithm.
Used for the scoring problem in hidden Markov models.
The backward variable β_t(i) = P(o_{t+1}o_{t+2}…o_T, q_t = i | λ) is the probability of the partial observation sequence from time t+1 to the end, given state i at time t and model λ. Similarly, β_t(i) can be solved inductively by
* Initialisation, for 1 ≤ i ≤ N:
β_T(i) = 1
- Induction, for t=T-1, T-2, …, 1:
β_t(i) = \sum_{j=1}^N a_{ij} * b_j(o_{t+1}) * β_{t+1}(j) - Termination:
P(O | λ) = \sum_{i=1}^N π_i * b_i(o_1) β_1(i)
Explain the matching problem of hidden Markov models
Given an observation sequence O = {o_1, o_2, …, o_T} and a model λ = (A, B, π), how should a state sequence q = {q_1, q_2, …, q_T} be chosen to be an optimum in some sense?
In applications like speech recognition or gene prediction, the hidden states represent the underlying structure we want to infer.
One optimality criterion is to choose the states q_i that are individually most likely at each time t,
q_t^* = argmax_{1 ≤ i ≤ N} [α_t(i) * β_t(i)] / (\sum_{i=1}^N α_t(i) * β_t(i)).
However, the “optimal” state sequence may not even be a valid sequence, a_{ij} = 0 for some i and j.
Another optimality criterion is to find the single best state sequence (path), i.e., to maximize P(q , O | λ), which gives rise to the Viterbi algorithm.
What does the Viterbi aIgorithm do?
It is related to the matching problem in hidden Markov models. It finds the most likely sequence of hidden states for a given observation sequence by maximizing the joint probability of transitions and emissions.
Explain the training problem of hidden Markov models
Given an observation sequence O = {o_1, o_2, …, o_T} and an untrained model λ = (A, B, π), how should the model parameters be adjusted to maximize the likelihood of a given observation sequence, P(O | λ)?
There is no efficent algorithm for global optimisation. This problem is crucial when the hidden Markov model’s parameters are unknown and need to be learned from data.
A solution to this is the Baum-Welch re-estimation that refines the parameteres iteratively to locally maximize P(O | λ).
Explain the Baum-Welch re-estimation algorithm
It is a special case of the expectation-maximization algorithm that computes probabilities using the current model, λ, and then refines λ to λ* such that P(O | λ) is locally maximized.
It uses α and β from the forward-backward algorithm.
Can a hidden Markov model have continuous densities?
Yes. Replace the discrete observation probabilities, b_j(k) by a continuous PDF, b_j(x). The PDF is often represented as a mixture of Gaussians.