CHAPTER 4: Inference for discrete time Markov chains Flashcards
A weather problem
We focus on the problem of constructing, fitting and checking a model for daily rainfall at a specific location. As a first step, given observed rainfall amounts, r1,…,rn say, on a sequence of consecutive days, we suppose that daily rainfalls are the realised values of a sequence of random variables R = R1,…,Rn.
Given Rainfall at Snoqualmie Falls
1948-1843 Januaries
The fact that on some days there is no rain and on others a positive amount suggests that it might be useful to consider first the succession of dry days and wet days, and later model the amount of rain falling on a day conditionally on the day being wet. Let X_i = {1 if Ri bigger than 0 {0 if not
so that Xi is an indicator variable for day i being wet
We seek a simple model for the Xi. It is possible that precipitation patterns vary seasonally, so initially we concentrate on modelling only within a limited period during which conditions can be expected to remain reasonably stable. For this reason we concentrate on the sequence of Xi during Januaries
Independence model:
Snoqualmie Falls
- simplest model supposes that the Xi are independent and identically distributed; that is, they form a Bernoulli sequence
- ie that its indep of W or D the day before
- so one parameter p for prob of being wet, 1-p for dry
*If p = P(X_i = 1) then the likelihood of p based on observations x1,…,xn is
L(p) = p ^{Σ i=,n [xi]} (1−p)^{n−Σi=1, n [xi]}.
(ie probability of a particular sequence is p^{number of wet days}(1-p)^{number of dry days})
Over the Januaries 1948-1983 there were 791 wet days out of 1116 days altogether.
Thus Σi=1,1116 [ xi] = 791
and the log-likelihood is
l(p) = 791log(p) + 325log(1−p)
figure of the log likelihood for parameters at p is a trajectory skewed st peak at around 3.71 ( which is the number of dry days/ total days) a line intersecting at log-likelihood just above -675 gives suitable likelihood region/ 95% CI)
The MLE
ˆ p =Σi=1,n [xi/n] = 791/1116 = 0.71
and an approximate 95% confidence interval for p obtained from the log-likelihood is (0.68, 0.74)
Independence model:
Snoqualmie Falls
NOTES
In fact in this case we can find a confidence interval directly, since under our assumptionsPn i=1 Xi has a Binomial Bin(1116,p) distribution, so that the estimated standard error of ˆ p is sqrt root(pˆ (1− ˆ p)/n) and an approximate 95% confidence interval based on the Normal approximation is ˆ p±2×e.s.e.
(2*standard error)
Here the two intervals, likelihood-based and Binomial-based, are numerically the same to 2 dp.
Independence model:
Snoqualmie Falls
Model adequacy
- We might suspect not, because large weather systems may take several days to pass a point, and whilst they are doing so the weather conditions at the point will tend to persist, undermining the independence assumption.
- For a rough check, consider the number of dry days followed by wet days. Under independence, the expected number of such pairs in the data is E( no. dry wet pairs) = 36×30×(1−p)p, since there are 36 Januaries in the data-set and 30 pairs of days in each. Similarly for other combinations of wet and dry. To estimate these expected numbers we replace p by ˆ p.
*given a table for the numbers of pairs of days we have more D-D and W-W pairs than expected
(Today)
Wet| Dry |Total Wet 643 (542.6) |128 (223.0) |771
Dry 123 (223.0) |186 (91.6) |309 Total 776 |314 |1080
Yesterday actual (estimates from model)
*we might consider carrying out a goodness of fit test. The Pearson X^2 statistic
X^2 =
Σ[(observed−expected^2/expected]
is a natural choice for a test statistic for such a test, since large values of X^2 will indicate discrepancy from our expectations under the independence assumption. For a standard 2×2 contingency table, the observed value of X^2 would be compared with a χ^2_1 distribution, which is the approximate distribution of X^2 under the hypothesis of independence in a contingency table
For such a table the underlying model is a Multinomial distribution with four categories, and the basis for use of this distribution is a sampling scheme in which each of the items. whose frequencies are recorded in the table can fall independently into the four cells with prescribed probabilities.
Is that appropriate for Table 1?
In Table 1 the items counted in the four cells are pairs of days. From the way the data have been collected it is clear that consecutive pairs of days overlap; the second day of one is the first day of the next (within each January). Thus our null hypothesis of independence and identical distribution of days does not translate immediately into the standard Multinomial basis for a X2 goodness of fit test. We need an extension of the standard theory to the case when items are collected in such a way that each is dependent on its predecessor.
( WE DONT HAVE INDEP SO MULTINOMIAL NOT APPLICABLE)
If instead we consider non-overlapping consecutive pairs, then individual items being classified would be independent under our hypothesis and the χ2 theory would apply. Frequencies from non-overlapping pairs are
Today
Wet Dry Total
Wet 325 (271.3) 55 (111.5) 380
Dry 66 (111.5) 94 (45.8) 160 Total 391 149 54
Yesterday (on lhs)
A standard X2 test applied to Table 2 (with expected values calculated from the marginal totals in the table rather than from our estimate ˆ p = 0.709) gives X2 = 108. For comparison the 95% quantile of χ2 1 is 3.84. There is overwhelming evidence against the independence assumption even with this partial view of the data. Wetness or dryness has a tendency to persist from one day to another to a greater extent than the independence theory would predict.
This points the need for the analysis of dependent variables. We need models that allow a variable’s value today to affect its value tomorrow. A discrete time Markov chain is such a model
Estimation of transition probabilities
for n states
- We now assume that we have observations of a process which we think might be well modelled by a Markov chain on a known state space S, but with unknown transition matrix P
- estimate P by using likelihood:
Suppose we have observations x1,…,xn modelled as successive states of a Markov chain X1,…,Xn with transition probabilities p_ij
So probability of observing a given sequence of states is given by transition probabilities:
P(X0 = x0,…,Xn = xn) =
= P(Xn = xn | Xn−1 =xn−1,…,X0 = x0) P(Xn−1 = xn−1,…,X0 = x0)
= P(Xn = xn | Xn−1 = xn−1) P(Xn−1 = xn−1 | Xn−2 = xn−2,…,X0 = x0) × P(Xn−2 = xn−2,…,X0 = x0) = P(Xn = xn | Xn−1 = xn−1) P(Xn−1 = xn−1 | Xn−2 = xn−2)… …×P(X1 = x1 | X0 = x0)×P(X0 = x0)
= P(X0 = x0) px0 x1 px1 x2 … pxn−1 xn
by the Markov property, which is the likelihood L for the parameters P = (pij) given data.
n_ij = number of transitions observed from state i to state j. So:
L = L(P) = P(X_0 = x_0)∏_{ i,j} of [p^{n_ij} _ij]
log likelihood for P:
l(P) = log(P(X_0 = x_0) +∑_{i,j} of [nij log(pij)]
As usual, values of the pij giving large l are plausible in the light of the data.
Estimation of transition probabilities
:
L = L(P) = P(X_0 = x_0)∏_{ i,j} of [p^{n_ij} _ij]
log likelihood for P:
l(P) = log(P(X_0 = x_0) +∑_{i,j} of [nij log(pij)]
The maximum likelihood estimates of the transition probabilities are obtained by maximizing l(P) subject to P being a transition matrix. This means that we will need to apply the constraint that the sum of the entries in each row is 1; we can do this either using Lagrange multipliers (see MAS211) or by, for example,
writing the last term in each row in terms of the others as we did in Example 9: ] if S = {1,2,...,N} then p_{i,N} = 1−(p_{i,1}+p_{i,2}+...+p_{i,N−1}).
In some cases we may assume a particular form for the transition matrix where the constraint that the row sums are 1 is already assumed
- the issue is that P(X_0 = x_0) isn’t dep on parameters p_ij?
if we argue conditionaly:
the (conditional) likelihood is defined as P(Xn = xn,… | X0 = x0) and the above argument gives a log-likelihood without the P(X_0 = x_0) term - Another approach is to assume that the chain is in equilibrium, meaning that we can take P(X0 = x0) to be the probability of finding the chain in state x0 under its stationary distribution
- when n is large, l is likely to be dominated by its second term, so the precise treatment of the first term is unimportant- we ignore it
Estimation of transition probabilities
:
ARGUING CONDITIONALLY
L = L(P) = P(X_0 = x_0)∏_{ i,j} of [p^{n_ij} _ij]
log likelihood for P:
l(P) = log(P(X_0 = x_0) +∑_{i,j} of [nij log(pij)]
Suppose the state space S is finite. Then, arguing conditionally on X0 = x0, we wish to maximize
∑{i,j} [ nij log(p{ij})
subject to ∑_j [pij] = 1,
for i = 1,…,|S|.
Form the Lagrangian function
L(P,λ) =∑{i,j} [nij logpij] +∑{i} λi .
Then
∂L ∂pij
=nij/pij + λi, i,j = 1,…,|S|. (23)
From this we get that
ˆ pij =nij/ −λi
and thus, using
∑{j} [ˆ pij] = 1, it follows that
ˆ pij = nij/n{i.} (24)
Here we use ni. to mean ∑{j} nij.
= total number transitions from i
Similarly, we will later use n.. to mean∑{i,j} [nij]
= total number transitions from i to j
eg if we know that we have 25 transitions from state i and 6 of them are to state j, estimate for the transition prob p_ij = 6/25
Example 13. Snoqualmie Falls
Given transition count data - to fit a 2 state MC model for sequence of W or D. Under this model the log likelihood has form:
l(P) = log(P(X_0 = x_0) +∑_{i,j} of [nij log(pij)]
for each January but for the full likelihood we consider relationship between years.
Transition probabilities for the month of jan given, we assume the underlying behaviour is the same in each year
We can suppose the years are independent and that TPs {p_ij} are the same over the years. (eg 1948,1949,…)
As 1st model:
1) assume independence from 1 year to another and
2) stationarity over the years.
Let n_ij ^ (k) = observed transition count from i to j in year k.
Then the likelihood of P based on the observed transition count from i to j in year k.
implies likelihood of P based on observed seq, of W/D, conditionally on the 1st obs in jan each year is:
L(P) =∏_ k ∏ {i,j} [p ^{n^(k) ij} _ ij ,
giving log-likelihood
l(P) =∑ k ∑ {ij} [n^(k) _ij logp_ij]
=
ij[nij logpij]_∑
The maximization used above therefore applies directly, and from Table 1 gives estimated transition matrix: The estimated probabilities are quite different from those based on the independence model, which would be (0.29, 0.71) in each row
EG
2x2
3X3 TRANSITION
[1-α α]
[β 1-β]
[α β 1-α- β]
[γ δ 1-γ-δ]
[ε ζ 1- ε-ζ ]
[0 P 1-P]
[1-P 0 P]
[P 1-P 0]
[p_11 p_12…. p_1N]
[p_21… p_2N]
[… … …. ]
[p_M1……. p_MN]
Uncertainty on these transition probabilities estimates:
Consider the conditional likelihood function and to allow for constraint on the parameters in a given row, we can reparametrize the transition matrix
Table 3: Estimated Transition Matrix; Januaries 1948–1983
Today Dry Wet
Dry 0.602 (186/309) 0.398 (123/309)
Wet 0.166 (128/771) 0.834 (643/771)
Yesterday
(MLE for TPs)
for uncertainty we assume conditional on value of 1st jan of that year
*To allow for constraint: reparametrize the TM in terms of first |S|-1 probabilities on each row and look for likelihood region
eg |S|=2 easy
Alt: asymptotic approximation
Testing the model: We have considered pairs of days . We look at data we haven’t used to fit the model
Example 14. Snoqualmie Falls: persistence of spells
The Markov chain model in the example has been fitted by considering pairs of days. A natural question to ask is how well does it represent features of the data other than those used in its fitting
Length of wet/dry spells for jan 1983:
11....01...101...10100 1x10 0 1x9 0 1x6 0 1x1 00
3 dry spells whose beginnings and ends fell in this month and lasted a day.(1=W, 0=D)
over 36 years average length of dry spells that did not overlap Jan 1 or Jan 31 is 2.21 days and standard deviation of length is 1.64 days.
Using markov properties we can calculate the expected duration of dry spells in the model and compare observed values with calculated to give a rough check for model:
- Let T denote the duration of dry spell
*To observe a dry spell there must be a wet day followed by a dry day, we start numbering time from a wet day followed by a dry day. Then
….
TEST FOR SPECIFIED P
modelled by Markov chain with known transition matrix P*
data modelled by Markov chain with known transition matrix P*.
Null hypothesis-
H_0: P=P*
alternative H_1: P is unrestricted
Use the log likelihood ratio test statistic
W=-2(l(P*) - l(P^))
2*(log likelihood at assumed null hypothesis value - maximum log likelihood / estimate)
Note: Large values of W discredit H0
If hypothesis is true distribution approximatelt χ2 with d −s degrees of freedom.
d= # entries in TM that are non-zero eg s^2 if all
s= state space size ie # rows in TM (from sum equal 1 we need all but 1 element of each row)
**Unless we are assuming a particular form for the transition matrix, d here will be the number of entries in the matrix, i.e. s2, so the number of degrees of freedom will be s^2 −s
The test statistic W is given by
W = −2(l(P∗)−l(ˆ P)) =
2Σ _{i,j} [nij log [nij/(ni.p∗_ij)]
where p∗ ij is the i,j entry of P∗
2* sum over all pairs of states of
(#transitions from state i to j) log( (#transitions from state i to j) over ( #transitions from state i * mle for p*_ij)
mle for p*_ij (transition probabilities) is # transitions from i to j/ #total from i
Example: Pseudo-random numbers
Sequence of random digits produced by a computer’s random number generator: for example
5,3,1,8,7,9,0,6,4,5,…
Are they independent, or is there a tendency for certain digits to follow others?
Given data for 200 random integers between 1 and 10, generated in R:
for our transition matrix we want 1/10 for probabilities of next states
we count numbers of transitions of all possible pairs of states for given data set
mle for each entry
: for p_{1,1} = number observed 1 to 1 transitions /sum of all entries in top row
= 1/15
(find MLE for transition matrix by number of transitions for i to j divided by total number transitions from i)
we test:
H_0: 10 by 10 TM is entries of 1/10??
Alternative hypothesis: TM is not all entries 1/10
With these data we obtain W = 98.308.
(expected value would be 90, our degrees of freedom )
1-pchisq(98.308,90) in R gives 0.25
So no evidence against H_0
we have degrees of freedom = (100 entries -10 determined by others) = 90
*some numbers in mle TM were 0 treat 0log0 as 0.
- In the example above, if we were given the total number, ni. , of digits i seen in the data, then, under H_0, we would expect to see ni.p∗ ij = ni./10 transitions i j. The observed number is nij, so W is a comparison of observed and expected frequencies
Writing ni.p∗ ij = eij,
we can express W as
W = 2Σ_{ i,j}[ nij log(nij/eij) =
−2Σ_{ i,j} nij log(1− ((nij −eij)/ nij)) ,
This gives numerical values close to those from the Pearson X2 statistic for comparing observed and expected frequencies
X^2 =Σ_{ i,j} (nij −eij)^2/eij .
In fact expansion of the log term shows asymptotic equivalence of the two statistics in large samples.
The example above uses a W test to check for independence, but it can be used in the same way to test any other simple hypothesis about the transition probabilities P.
Goodness of fit: testing independence
COMPOSITE HYPOTHESIS
Earlier, we wished to test independence against the alternative of Markov dependence in a model for the Snoqualmie Wet/Dry data.
(wanted pearson but the pairs overlapped)
We now have a general approach.
The difference from the last example is that H0, the hypothesis of independence, is now composite: it does not specify the value of P completely, only that rows of P are identical
null hypothesis is: pij = qj for each i, where {qj} is a probability distribution on the state space S.
ie each day is wet or dry w/o reference to previous day
ie rows are identical for P, entries in each row only dep on column
alternate hypothesis is P transition matrix P=
[α 1-α]
[β 1-β]
under H_0: P is determined by q=s-1 parameters.
Without restriction P is determined by s(s-1) parameters.
TEST STATISTIC for testing composite hypotheses:
W = −2(l( ˜ P)−l( ˆ P))
(we replace P*, specific H_0 value, with P~ restricted MLE assuming H_0 is true)
(d.o.f is difference between)
χ2 distribution on d−s−q = s(s−1)−(s−1) = (s−1)^2 degrees of freedom.
RESTRICTED MLEs ˜ pij (under H_0):
log likelihood is
l =Σ_{i,j} [nij logpij =
Σ_{i,j} [nij logqj] =
Σ_ j [n.j logqj]
so that ˜ pij = ˆ qj = n.j/n…
TEST STATISTIC:
W = 2[ Σ_{ i,j} [nij log(nij/ni.) −Σ_{j} [n.j log(n.j/n..)]]