CHAPTER 3: Useful ideas and methods for inference Flashcards

1
Q

Theorem 10 (CLT for iid variables).

A

If random variables X1, . . . , Xn are independent and
identically distributed with mean µ and variance σ^2 < ∞, then

[{Σ from i=1 to n of [X_i] } − nµ]/σ√n

→ Z ∼ N (0, 1), as n →∞

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Theorem 11 (CLT for iid random vectors).

A

If random vectors bold(X1), . . . , bold(Xn) are independent
and identically distributed with mean vector bold(µ) and variance-covariance matrix bold(Σ), finite,
then
[{Σ from i=1 to n of [X_i] } − nµ]/√n
→ Z ∼ N (0, Σ), as n → ∞

vectors in the normal dist

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

CLT notes:

A
  1. The → in these theorems denotes convergence in distribution.
  2. converges in distribution to a Normal distribution it is often said to be asymptotically Normal as n → ∞.
  3. The Central Limit true for dependent and/or non-identically distributed random variables/vectors under suitable conditions.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Likelihood and inference

A

conclusions about unknown parameter θ, one- or multi-dimensional on basis of x and model f_X

sample observations x1, . . . , xn = bold(x) are modelled as the values of random variables X1, . . . , Xn = bold(X)

probability density function (probability function in the
discrete case) f_X of X depends on an unknown parameter θ,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Definition 12:

LIKELIHOOD

A

The likelihood of θ based on observed data x is defined to be the function
of θ:
L(θ) = L(θ; x) = f_X(x; θ).

*In the discrete case, for each θ, L(θ) gives the probability of observing the data x if θ is
the true parameter (provided f is from the correct family of distributions).

  • L(θ) as a measure of how plausible θ is as the value that generated the observed
    data x
  • in continuous case measurements are made only to a bounded precision, prob density funct is proportional to the probability of finding the RV in a small interval
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Ratio of likelihoods

A

The ratio L(θ_1)/L(θ_2) measures how plausible θ_1 is relative to θ_2 as the value generating
the data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

maximum likelihood

A

If θˆ is the most plausible value; that is, the value of θ for which
L(θˆ) = max_θ [L(θ)]

maximum likelihood estimate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Relative Likelihood

A

all values of θ for which the Relative Likelihood
RL(θ) = L(θ)/L(θˆ)

is not too much different from 1 are plausible in the light of the observed x.

(L(θ_1)/L(θ_2) when θ_2 is the parameter maximizing likelihood )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

log-likelihood

A

convenient to plot the likelihood on a log scale

log-likelihood is defined to be
l(θ) = log L(θ).

  • independence - multiplications - log transforms into +
  • Statements about relative likelihoods become statements about differences of log-likelihoods.
  • exp dist easier
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

likelihood regions

A

Thus values of θ plausible in the light of the data (or consistent with the data) are those
contained in sets of the form
{θ : l(θ) > l(θˆ) − c}
for suitable constants c

*1-D case interval

  • value θˆ is the maximum likelihood estimator (mle) of θ: the value within
    the parameter space – the set of permissible values of the parameter – maximizing
    L(θ). dependence on data x : θˆ(x).

*For inferences about θ, only relative values of the likelihood matter,
can neglect constants (factors not dep on θ) and use whatever version of L or
l is convenient.

*If we re-parametrize to φ = g(θ) where g is a continuous invertible function, then the
likelihood L changes in the obvious way: if L1 denotes likelihood with respect to φ,
then L1(φ) = L(g^−1(φ)). Also, most usefully, φˆ = g(θˆ).
expect same likelihood estimator ~transformation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Indep and log

A

independent X_i. Then

L(θ) = ∏ from i=1 to n [f_{Xi}(xi; θ)

and

l(θ) = Σ from i=1 from n [log f_{Xi}(xi; θ)

where f_{Xi} denotes the density function of X

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

likelihood equation(s)

A

θˆ may be found as the solution of the likelihood equation(s)

∂L(θ)/∂θ= 0

or equivalently,
∂l(θ)/∂θ = 0

ie max of functs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

EXAMPLE:

random sample obs x_1,.. x_n from exp dist with unknown mean θ ≥ 0. (For example, we could observe a Poisson process until
we have n occurrences, and let xi be the ith inter-occurrence time.)

A

The probability density function for each observation is
f_{Xi}(x; θ) =
{(1/θ)e^{−x/θ} x ≥ 0
{0 x < 0

so that
l(θ) =
{n (log θ - ¯x/θ) if min xi ≥ 0
{−∞ otherwise

Since
∂l/∂θ = n(−1/θ + ¯x/θ^2),

the maximum likelihood estimator is ˆθ = ¯x.

Recall that the usual parametrization of the exponential distribution uses the rate parameter
λ = 1/θ so that replaced by
f_{Xi}(x; θ) = 
{λ e^{−λx}  x ≥ 0
{0  x < 0

If we write down the log likelihood for λ, we get
l(λ) = n(log λ − λx¯), and maximizing this
gives λˆ = 1/x¯ = 1/ˆθ as expected.

A likelihood interval
would be found in this case by finding the values of θ for which
l(ˆθ) − l(θ) =
n (¯x/θ − 1 − log(¯x/θ)) < c.

Evidently numerical or graphical solution would be needed.

example plot of l(θ) based on a sample of size n = 10 for which x¯ = 2.3. shows skewed hump at 2.3, likelihoods within 2 of max are good est para given small sample

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Example 9. Markov chain
We consider a two state Markov chain (Xn), as in Example 4 but with state space S = {1, 2},
with transition matrix

[1 − θ θ ]
[ φ 1 − φ]

We assume that the chain is in equilibrium, and we consider finding the likelihood for the
parameters θ = (θ, φ).

A

The stationary distribution here is ( φ/(θ+φ) θ/(θ+φ) )

Imagine we observe X_0 = 2, X_1 = 1. Because we assume the chain is in equilibrium, we have

P(X_0 = 2) = θ/(θ+φ)
so

P(X_0 = 2, X_1 = 1) =
[θ/(θ + φ)]φ

Hence this expression also gives us the likelihood of (θ, φ) given our observation, and we can
write
L(θ, φ; x) = [θφ/(θ + φ)]
.

OR imagine d observe the sequence of states 2, 1, 1, 2, 2, 2. Then our likelihood
becomes
L(θ, φ; x) =
[θ/(θ + φ)]φ(1 − θ)θ(1 − φ)(1 −φ) = [θ^2φ(1 − θ)(1 − φ)^2]/(θ + φ)

plotting
* plotting θ against φ and showing varying values of likelihood

likelihood increases as θ increases and φ increases. Start in state 2 and to 1 prob is high, implies θ is high

graph similar to reciprocal graph for varying values

**contour plot, 6 states so more info 0.57≈ θ^0.26≈φ^

(FOUND BY stationary distribution probability *probabilities of successive states)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Approximating the log likelihood

A

Taylor series about max and at max first deriv disappears

It turns out that in many cases it can usefully be
approximated by a quadratic function of θ, so can be summarized by the position of the
maximum and the curvature there.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Example 10. Exponential sample continued

A

(from max gen data: less plausible that theta a fixed distance away)

Figure: θ against log relative likelihood trajectory like curves narrower with peaks

shows the log relative likelihoods from samples of sizes n = 10, 20, 40 and 80 from
the exponential distribution .
Each sample had mean
x¯ = 2.3.

Evidently as n increases
the log-likelihood becomes more peaked around its maximum. Thus it becomes less and less plausible that values of θ a fixed distance away from the maximum generated the data.

The curvature of l at ˆθ is measured by minus the second derivative −∂^2l/∂θ^2:

−∂^2l(θ)/∂θ^2=
n(2x¯/θ^3−1/θ^2)

which reduces at θ = ˆθ to n/x¯^2
increasing with n.

(neg as peak)

17
Q

DEF 13 observed info

A

For 1-dimensional θ the function J(θ) = −∂^2l/∂θ^2
is called the observed
information about θ in the sample.

For p-dimensional θ the observed information is a matrix with components
J(θ)_rs =
−∂^2l(θ)/(∂θ_r∂θ_s)

*in 1-d case we will usually find J(θ^) bigger than 0, multi-d then matrix which is positive definite

18
Q

log

likelihood approximated by a quadratic

A

For most likelihoods, not just the one in the example, it’s true that close to ˆθ the log
likelihood is well approximated by a quadratic function of θ:
l(θ) − l(ˆθ) ≈
(1/2)(θ − ˆθ)^2 ∂^2l(ˆθ)/∂θ^2

= −1/2(θ − ˆθ)^2 J(ˆθ)

This is only useful if ‘close’ includes the values of θ that are plausible. Usually, this is
increasingly true as the amount of information increases, for example as n increases in the i.i.d. case.

*increasingly true as the amount of info increases i e as n increases for the i i d vars

19
Q

How uncertain are findings from the likelihood?

A

The mle θˆ will

generally take different values for different data x

20
Q

SAMPLING VARIABILITY

A

CONSIDER five samples, each of size 20, from the exponential distribution with mean θ = 2.3

SAMPLING VARIABILITY addressed by thinking of ˆθ as RV ˆθ(X) (funct of R vector)

Let θ_0 denote the value of θ in the dist from which the X_i were generated- true value of theta and denote the max likelihood estimated of θ based on a random sample of size n by

θˆ_n = θˆ(X_1, . . . , X_n).

Under repeated sampling θˆ differs from θ0 by an amount which for large n is
Normally distributed. Moreover we can find its variance from the log likelihood

21
Q

DEF 14 expected information funct/matrix

A

For 1-dimensional θ define the expected information function1 about
θ by
I(θ) = −E(∂^2l(θ; X)/∂θ^2)

and for vector θ correspondingly define the

expected information matrix as the matrix
with components
I(θ)rs = −E(∂^2l(θ; X)/(∂θ_r∂θ_s)

The expectations here are with respect to the variation in X

For 1-dimensional θ define the expected information function1 about
θ by
I(θ) = −E


2
l(θ; X)
∂θ2

, (12)
and for vector θ correspondingly define the expected information matrix as the matrix
with components
I(θ)rs = −E


2
l(θ; X)
∂θr∂θs

, (13)

The expectations here are with respect to the variation in X

22
Q

Key Fact 1 (Asymptotic Normality of mles in the iid case)

A

for θ of dimension p ≥ 1 we have the following result:

In the random sample case, under mild conditions, as sample size n → ∞,
I(θ_0)^{1/2}(θˆn(X) − θ_0) → N_p(0, 1p)
in distribution, where
N_p(0, 1_p) denotes the multivariate Normal distribution with covariance matrix the p-dimensional unit matrix 1_p.

ie large n,
θˆ .∼ N_p(θ_0, I(θ_0)^−1)
( dot over ~)

*variants θˆ.∼ N_p(θ_0, I(θˆ)−1)
θˆ.∼ N_p(θ_0, J(θˆ)^−1).

These follow from continuity of I or J and the fact that θˆ approximates θ0 more and
more closely as n increases. They are useful since θˆ replaces the unknown θ0 in the (co)variances, simplifying calculations. For example, in the 1-dimensional case,
is just ˆθ − θ_0.∼ N (0, 1/J(ˆθ)),

giving 95% CI

23
Q

95% CI

A

the approximate 95% confidence interval for θ_0

(ˆθ − 1.96√(1/J(ˆθ), ˆθ + 1.96√(1/J(ˆθ))

There is some evidence that this interval based on observed information has better coverage properties than the corresponding interval based on expected information I.

24
Q

Example 11.

A
Exponential sample continued
From Examples 8 and 10, we have
∂l/∂θ = n(−1/θ + ¯x/θ^2)
and
−∂^2l(θ)/∂θ^2= n(2(x¯/θ^3)−(1/θ^2)

Hence J(θ) = n(2(x¯/θ^3 −1/θ^2)

Since the expected value of X¯ is also θ, we have
I(θ) = n(2(θ/θ^3)−(1/θ^2))
=
n/(θ^2)

and as ˆθ = ¯x both
I(ˆθ) and J(ˆθ) are equal to n/(x¯^2)

Hence an approximate 95% confidence interval for θ is
(x¯ − 1.96 (x¯/√n)
, x¯ + 1.96 (x¯/√n))

25
Q

Likelihood Ratio Tests

A

(a) Likelihood Ratio Test for a simple null hypothesis

Likelihood Ratio Test for the Simple Hypothesis H0

(b) Generalized Likelihood Ratio Test

26
Q

(a) Likelihood Ratio Test for a simple null hypothesis

A

Suppose θ∗ is a specific value and we wish to test the hypothesis H_0 : θ_0 = θ∗ against the alternative H_1: θ_0 ≠ θ∗.

The relative likelihood
RL(θ∗) = L(θ∗)/ L(ˆ θ)
is useful for this, because:

RL(θ∗) small suggests evidence against H_0

RL(θ∗) close to 1 suggests H_0 plausible

Thus, since
logRL(θ∗) = l(θ∗)−l(ˆ θ),

l(θ∗)−l(ˆ θ) well below 0 suggests evidence against H_0

l(θ∗)−l(ˆ θ) close to 0 suggests H_0 plausible

Equivalently
we want to deal with positive values

W = −2l(θ∗)−l(ˆ θ) well above 0 suggests evidence against H_0

W = −2l(θ∗)−l(ˆ θ) close to 0 suggests H_0 plausible

27
Q

Key Fact 2 (Wilks’ Theorem I: Asymptotic χ2 distribution of W).

A

In the random sample case, when the true value of θ is θ∗ (ie H0 true),
W = −2(l(θ∗)−l(ˆ θn))
= −2logRL(θ∗) → χ^2 _p

in distribution as n →∞, where p is the dimension of θ.

  • for larger values, if H_0 is true: tells us that when H_0 is true, the probability of observing a value of W larger thn a particular w is
    p_obs = P(W ≥ w | H0) ≈ P(χ^2 _p ≥ w).

eg p=2 2 state MC with parameters θ= (α,β)
or normal dist with unknown mean and variance

28
Q

Likelihood Ratio Test for the Simple Hypothesis H0,

A

Likelihood Ratio Test for the Simple Hypothesis H0,
1. From the data calculate the observed value w of the test statistic W,

  1. Find (from χ2 tables or via a computer) the probability pobs = P(χ2 p ≥ w)
  2. Interpret pobs as a measure of the weight of evidence in the data against H0 in the sense that the smaller pobs, the more surprising would the observed data be if H0 were true (and therefore the stronger the evidence against H0).
    e. g if we specify the values of alpha and beta in a vector and check ie checking a specific value of theta*
    * small p value means stronger evidence AGAINST H_0
29
Q

key fact 2: likelihood regions

A

Key Fact 2 also gives us a way to choose the constant c in likelihood regions

{θ_0 : l(θ_0) bigger than l(ˆ θ)−c}

If we choose 2c = χ^2_{p,0.95}, then P(l(θ_0) bigger than l(ˆ θ)−c) = P(l(ˆ θ)−l(θ0) < c) ≈ P(χ2 p < 2c) = 0.95,

when θ_0 is the true parameter value, so the likelihood interval with this c is an approximate 95% confidence interval. For example, when p = 1 we have χ2 1,0.95 = 3.84, so we can choose c = 1.92, sometimes approximated as c = 2

30
Q

(b) Generalized Likelihood Ratio Test

A

we might want to test whether alpha and beta are the same ie many different values might satisfy this (rather than their specific values)

ie a set of values that theta might be in

Suppose that θ is p-dimensional with values in a set Θ ∈ R^p and suppose that we wish to test a null hypothesis

H_0 that the true value θ_0 belongs to a subspace Θ_0 of Θ, where Θ0 is q-dimensional with q < p.

The alternative hypothesis H1 is that θ0 ∈ Θ\Θ_0.

α=β
p=2 and q=1 as if we determine β we determine the other.

31
Q

(b) Generalized Likelihood Ratio Test

TEST STATISTIC

A
Consider the statistic
GLR = max θ∈Θ0
RL(θ) =
maxθ∈Θ0 L(θ) /maxθ∈Θ L(θ)
=
L(˜ θ) /L(ˆ θ)
,
where ˆ θ is the usual mle (the global mle) and ˜ θ is the value of θ maximizing L within Θ0 (the restricted mle). 

*the largest value that the relative likelihood can take for θ within Θ_0 (ie theta which satisfy our hypothesis)

  • denominator: unconstrained θ maximum likelihood
    numerator: θ which satisfy hypothesis Θ_0 eg α=β

*Note that when Θ0 is a single point, −2logGLR reduces to the W used in subsubsection (a) above

32
Q

VALUES OF GLR

A
  • if H0 is true, then the global maximum of L is likely to occur close to Θ0, so ˜ θ and ˆ θ are likely to nearly coincide and GLR to take a value close to 1.
  • On the other hand, if H1 is true, then the maximum of L within Θ0 is likely to be considerably less than the global maximum, so GLR will tend to be considerably smaller than 1.

This suggests that a test statistic for H0 could be based on GLR.

−2logGLR reduces to W above in the special case of a point Θ

33
Q

Key Fact 3 (Wilks’ Theorem II: Asymptotic χ2 distribution of W).

A

In the random sample case, when the true value of the parameter θ0 ∈ Θ0,
W = −2(l(˜ θ)−l(ˆ θ)) = −2logGRL → χ2 _{p−q}

in distribution, as n →∞,

where p is the dimension of Θ, the full parameter space, and q is the dimension of the restricted parameter space Θ0.

(p-q is the difference in dimensionality between θ and Θ_0) = degrees of freedom- ie restricting toΘ_0

34
Q

Generalized Likelihood Ratio Test for the Composite Hypothesis H0,

A
  1. From the data calculate the observed value w of the test statistic W = −2logGLR,
  2. Find (from χ2 tables or via a computer) the p-value P(χ^2 _{p−q} ≥ w)
  3. Interpret the p-value as a measure of the weight of evidence in the data against H0 in the sense that the smaller the p-value, the stronger the evidence against H0
    * ie restriction α=β 1 degree of freedom from 2 to 1
    * When Θ0 is a single point then q = 0 and the test reduces to the version in (a)
35
Q

EXAMPLE 12: Two exponential sample

Imagine we have two samples from exponential distributions with possibly different means: X1,X2,…,Xn from an exponential distribution with mean φ and Y1,Y2,…,Yn from an exponential distribution with mean ψ. We are interested in whether φ = ψ;

A

θ = (φ,ψ),

the null hypothesis that
φ = ψ
can be written as
θ ∈ Θ0, where Θ_0 is the line ψ = φ.

**The log likelihood given data x and y is
l(φ,ψ;x,y) =
−n(logφ + logψ + ¯x/φ +¯y/ψ).

(independence so we can add together)
*****

(putting xbar and y bar as φ and ψ )
It is easy to see that ˆ θ = (¯ x, ¯ y).

Thus
l(ˆ θ) = −n(log ¯ x + log ¯ y + 2).
(log likelihood at max value)

The restricted MLE assuming φ = ψ is
˜ θ =((¯ x+¯ y)/2 ,( ¯ x+¯ y)/ 2)*

(sample means, are the same as sample sizes are for both ˜ θ,
ˆ θ is individual sample means )

(now substituting φ,ψ as *)

 so l(˜ θ) = −n(2log((¯ x + ¯ y)/2) + (2¯ x)/( ¯ x + ¯ y) +
(2¯ y)/(¯ x + ¯ y)).

(log likelihood at restricted value)
So we have

W = −2(l(˜ θ)−l(ˆ θ)) =
2n(2log((¯ x + ¯ y)/ 2) −log( ¯ x)−log (¯ y)).

With n = 10, ¯ x = 2.3 and ¯ y = 1.7, we get W = 0.455, which is to be compared with χ2 _1. 
As pchisq(0.455,1) in R gives 0.5, there is no evidence in this case against H_0
36
Q

Estimated standard error

A

Ese

Sqrt( standard dev/n)

Eg sqrt( p(1-p)/n)

n=227 out of 1000?

37
Q

random sample obs x_1,.. x_n from exp dist with unknown mean θ ≥ 0.

A

The probability density function for each observation is
f_{Xi}(x; θ) =
{(1/θ)e^{−x/θ} x ≥ 0
{0 x < 0