Chapter 4 Flashcards by Shannon Quinn

s^2=?

sum from i=1 to n{(xi-xbar)^2/n}

How well did you know this?

Not at all

Perfectly

Why is inference not straight-forward in non-conjugate problems? Why are non-conjugate priors then used?(2,2)

not using a conjugate prior distribution can cause many basic
problems such as plotting the posterior density or determining posterior moments. But
having to use conjugate priors is far too restrictive for many real data analyses: (i) our
prior beliefs may not be captured using a conjugate prior; (ii) most models for complex
data do not have conjugate priors.

How well did you know this?

Not at all

Perfectly

Lags and correlations in MCMC.(2)

We know that the simulator rnorm() produces independent realisations and so the
(sample) correlation between say consecutive values corr(xi,xi+1) will be almost zero.
This is also the case for correlations at all positive lags. Finally the lag 0 autocorrelation
corr(xi,xi) must be one (by definition).

How well did you know this?

Not at all

Perfectly

What is key to MCMC. What is this?(2)

This alternate sampling from conditional distributions defines a bivariate Markov chain, and the above is an intuitive explanation for why f (x,y) is its stationary distribution. Thus being able to simulate easily from conditional distributions is key to this methodology.

Further: We have already seen that dealing with conditional posterior distributions is straightfor-
ward when the prior is semi-conjugate, so let’s assume that simulating from f (y|x) and f (x|y) is straightforward. The key problem with using either of the above methods is that, in general, we can’t simulate from the marginal distribution, f (x) and f (y).
For the moment, suppose we can simulate from the marginal distribution for X, that is, we have an X = x from f (x). We can now simulate a Y = y from f (y|x) to give a pair
(x,y) from the bivariate density. Given that this pair is from the bivariate density, the y value must be from the marginal f (y), and so we can simulate an X = x′ from f (x|y) to give a new pair (x′,y) also from the joint density. But now x′ is from the marginal f (x), and so we can simulate a Y = y′ from f (y|X = x′) to give a new pair (x′,y′) also from
the joint density. And we can keep going.

How well did you know this?

Not at all

Perfectly

Outline the Gibbs sampler.(4)

Suppose we want to generate realisations from the posterior density π(θ|x), where θ =(θ1,θ2,…,θp)^T, and that we can simulate from the full conditional distributions (FCDs) π(θi|θ1,…,θi−1,θi+1,…,θp,x) = π(θi|·), i = 1,2,…,p.
The Gibbs sampler follows the following algorithm:
1. Initialise the iteration counter to j = 1.
Initialise the state of the chain to θ(0) = (θ(0)
1 ,…,θ(0)p )^T.
2. Obtain a new value θ(j) from θ(j−1) by successive generation of values
θ(j)1 ∼π(θ1|θ2(j−1),θ(j−1)3 ,…,θ(j−1)p ,x)
θ(j)2 ∼π(θ2|θ(j)1 ,θ(j−1)3 ,…,θ(j−1)p ,x)
… … …
θ(j)p ∼π(θp|θ(j)1 ,θ(j)2 ,…,θ(j)p−1,x)
3. Change counter j to j + 1, and return to step 2.

How well did you know this?

Not at all

Perfectly

What is the burn-in period and how would you determine one?(2)

How long it takes before simulations appear to be from the same distribution
The most effective method is simply to look at a trace plot of the posterior sample and detect the point after which the realisations look to be from the same distribution.

How well did you know this?

Not at all

Perfectly

Two major issues that arise with Gibbs sample.(2)

Convergence thus need to determine burn-in

Autocorrelation hence may require thinning

How well did you know this?

Not at all

Perfectly

What are two major issues that arise with Gibbs sample output?Provide a strategy to ensure sampling really is from the stationary distribution(2,3)

Convergence thus need to determine burn-in
Autocorrelation hence may require thinning.
Strategy
1. Determine the burn-in period, where Gibbs sampler has reached its stationary distribution. This may involve thinning the posterior sample as slowly snaking trace plots may be due to high autocorrelations rather than a lack of convergence.
2. After this, determine the level of thinning needed to obtain a posterior sample whose autocorrelations are roughly zero.
3. Repeat steps 1 and 2 several times using different initial values to make sure that the sample really is from the stationary distribution of the chain, that is, from the
posterior distribution.

How well did you know this?

Not at all

Perfectly

What is thinning and how would you determine this? What happens if you thin too much?(3)

Thinning here means not taking every realisation by say taking say every mth realisation.
In general, an appropriate level of thinning is determined by the largest lag m at which any of the variables have a non-negligible autocorrelation.
If doing this leaves a (thinned) posterior sample which is too small then the original Gibbs sampler should be re-run (after convergence) for a sufficiently large number of iterations until the thinned sample is of the required size.

How well did you know this?

Not at all

Perfectly

Accuracy of mu-bar in un-autocorrelated MCMC output for moving average process?Wb for autoregressive process?(3)

This means that the moving average(MA) model does not uses the past forecasts to predict the future values whereas it uses the errors from the past forecasts. While, the autoregressive model(AR) uses the past forecasts to predict future values.

Note r(k) = Corr(μ(j),μ(j+k))

If the MCMC output is un-autocorrelated then the accuracy of ̄μ is roughly ±2sμ/√N.
the accuracy of ̄μ is roughly ±2sμ/√N{1 −r(1)}2-due to autocorrelation. the amount of information in the data is equivalent to a random sample
with size Nef f = N{1 −r(1)}^2 more complicated for high order autocorrelations hence use of thinning
It’s worth noting that, in general, MCMC
output with positive autocorrelations has Nef f < N. Also sometimes MCMC output with some negative autocorrelations can have Nef f > N.

How well did you know this?

Not at all

Perfectly

The asymptotic posterior distribution about the mean and precision
(µ, τ)
T using a random sample from a normal N(µ, 1/τ) distribution is…

µ|x ∼ N(¯x, s^2/n), τ|x ∼ N{1/s^2, 2/(ns^4)}, independently

How well did you know this?

Not at all

Perfectly

Posterior mean and sd distributions for MCMC sample.(2)

M∼N(¯µ, s^2µ/N), Σ^−2∼ N{1/s^2µ, 2/(Ns^4µ)}, independently.

How well did you know this?

Not at all

Perfectly

Approx 95% HDI for M. What about sigma?(2)

µ¯ ± (z0.025sµ)/√N~ µ¯ ±(2sµ)/√N
sµ ± sµ*sqrt(2/N).

Fairly accurate even for non-normal looking distributions, using large enough N for MCMC gives asymptotic properties hence why these are the results.

How well did you know this?

Not at all

Perfectly

What is semi-conjugacy?(1)

Notice that, since µ and τ independent a priori, µ|τ ∼ N(b, 1/c). Therefore, given τ, the normal prior for µ is conjugate. Similarly, τ|µ ∼ Ga(g, h) and so, given µ, the
gamma prior for τ is conjugate. Therefore, both conditional priors (for µ|τ and τ|µ) are
conjugate. Such priors are called semi-conjugate.

How well did you know this?

Not at all

Perfectly

What can you use a converged and thinned MCMC sample to do?(4)

Obtain the posterior distribution for any (joint) functions of the parameters, such as σ = 1/√τ or (θ1=µ−τ, θ2=e^(µ+τ/2)^T
Look at bivariate posterior distributions via scatter plots
Look at univariate marginal posterior distributions via histograms or boxplots
Obtain numerical summaries such as the mean, standard deviation and confidence intervals for single variables and correlations between variables.

How well did you know this?

Not at all

Perfectly

Which MCMC would you use for non-standard distributions?(1)

Study These Flashcards

Metropolis-Hastings.

Gibbs can only be used if the full conditional distributions(FCDs) are standard.

Describe the metropolis Hastings algorithm.(5)

Study These Flashcards

Suppose we want to simulate realisations from the posterior density π(θ|x) and all of the
FCDs are non-standard. Suppose further that we have a proposal distribution with density
q(θ∗|θ), which is easy to simulate from. This distribution gives us a way of proposing new values θ∗ from the current value θ.
Consider the following algorithm:
1. Initialise the iteration counter to j = 1, and initialise the chain to θ(0).
2. Generate a proposed value θ∗using the proposal distribution q(θ∗|θ^(j−1)).
3. Evaluate the acceptance probability α(θ
^(j−1), θ∗) of the proposed move
4. Set θ^(j) = θ∗ with probability α(θ^(j−1), θ∗), and set theta^(j) = θ^(j−1) otherwise.
5. Change the counter from j to j + 1 and return to step 2.

Commonly used symmetric chains proposal distributions for M-H sampling.(3)

Study These Flashcards

The simplest case is the Metropolis sampler and uses a symmetric proposal distribution, that is, one with
q(θ∗|θ) = q(θ|θ∗), ∀ θ, θ∗.

random walk proposal
normal random walk proposal
uniform random walk proposal

What are commonly used independence chain proposal distributions for M-H sampling.(1)

Study These Flashcards

In this case, the proposal is formed independently of the position of the chain, and so
q(θ∗|θ) =f(θ∗) for some density f (·).

What are hybrid methods? Give two examples.(2)

Study These Flashcards

Componentwise transitions

- Metropolis within Gibbs

What are componentwise transitions?(1)

Study These Flashcards

Note: need to memorise
Given a posterior distribution with full conditional distributions that are awkward to sample from directly, we can define a Metropolis-Hastings scheme for each full conditional distribution, and apply them to each component in turn for each iteration. This is like
the Gibbs sampler, but each component update is a Metropolis-Hastings update, rather than a direct simulation from the full conditional distribution. Each of these steps will require its own proposal distribution.

What is Metropolis with Gibbs?(1)

Study These Flashcards

Note: need to memorise
Given a posterior distribution with full conditional distributions, some of which may be simulated from directly, and others which have Metropolis-Hastings updating schemes, the Metropolis within Gibbs algorithm goes through each in turn, and simulates directly from the full conditional, or carries out a Metropolis-Hastings update as necessary. This algorithm is, in fact, just the componentwise algorithm but uses the full conditional distributions as the proposal distributions when they are easy to simulate from.

What is a random walk proposal?(1)

Study These Flashcards

Proposed value theta* depends on current theta (theta=theta+w, where w is a random px2 vector from thr zero mean density f(.), which is symmetric about the mean and independent of the state of the chain)
We simulate proposal values through innovation w, setting theta=theta+w
Obvious choices are uniform or normal distributions.

What does it mean if the chain is too cold? What about hot?(1)

Study These Flashcards

Random walk proposal if innovation w variance is too low most proposed values will be accepted but will move very slowly around the space-too cold
If the variance of the innovation is too large not many will be accepted but when they are will correspond to large moves (optimal acceptance rate 0.234-between 20-30% okay, needs to be tuned)

What does a symmetric normal random walk proposal take the form of? What would tuning involve?(1)

theta*|theta~N(theta,k^2) for some k>0 innovation Symmetric as theta*=theta+w where w~N(0,k^2) If symmetric can drop proposal distribution for acceptance probability. Choosing a value for the cov matrix, if posterior is normally distributed (as is the case for latge samples the optimal has been 2.38^2/p*Var(theta|x) however in practise Var(theta) can be used from a first MCMC run and if this does not converge quickly use it to inform next choice for Var(theta|x) estimation.

What does a symmetric uniform random walk proposal take the form of?(1)

theta*|theta~U(theta-a,theta+a) for some innovation a>0 this is symmetric as theta*=theta+w where w~U(-a,a) density symmetric about 0. Also proposal ratio: q(theta,theta*)/q(theta*/theta)=1/(2a)/1(2a)=1

Chapter 4 Flashcards

(26 cards)