4. Generalised linear models, Maximum likelihood Flashcards
Show the Probability density, Likelihood function, Log likelihood, Maximum likelihood estimate µˆ of µ, MLE ˆθ of θ = T(µ):
And tell how they are related/work together
Maximum likelihood estimate is the vector fx that maximises
What do we use a likelihood function for?
We are looking for a way to estimate the parameters for the probablity density function.
The likelihood function is the same as the probiblity density, but we just dont know the parameters, we know the data
Why do we have a log likelihood
It is much easier working with logs, because we dont have to work with productis, which is not nice when doing derivatives.
What are advanteges of MLE?
- Automatic estimate without further statistical assumptions.
- Excellent frequentist properties: Nearly unbiased in large samples.
- Bayesian justification:
show how this is a bayesian justification
How can we measure how good the MLE is:
we can use the fisher information, It tells us something about the rubustness.
What are the downsides of MLE?
MLE estimates can be extremely off if estimated on little data.
▶ MLE estimate of the Bernoulli parameter of a coin flip based on a sample of one
will always be exactly 0 or 1.
With large numbers of parameters, ˆθ = T(ˆµ) may be off
even if each component of µˆ is well estimated.
▶ Next week: James-Stein estimator
what is the score function?
It is the derivative of the log likelihood function, with respect the parameters. This becaomes the derivate om the non-likelihood function over the log liklihood function.
The score function indicates how much the log-likelihood changes if you vary the
parameter estimate by an infinitesimal amount, given data.
When we solve for the MLE we solve for 0.
What does fisher tell us
It is the expectation of the dervivate of the loglikelihood squared. If the fisher information is high, then the MLE is very sensitive to the data, and therefore also has a small varience, so if we repeat this with similar data, we would expect similar fisher information. If there is no change,
What can we use the fisher for?
We can use it so gain some insights about
In a large sample the MLE estimates are aprx normallt distr. and has a variance of 1/fisher rinformation
What is the conditionality principles?
We only care about experiments we can actually perform.
What is an ancillary statistic?
A statistic that contains “no direct information by itself”, but describes the experiment
that was performed.
▶ Sample size
▶ Marginals of a contingency table
What is the observed fisher information
Fisher would rather use the observed fisher information than the fisher information itself, because the fisher information contains the “expectation”, he would rather compute the expression on actual data. It gives a better and more specific idea of the acc of the estimated (this is kinda debated)
What is randomisation?
In an experiment (trial) comparing two treatments A and B,
participants should be randomly assigned to either treatment A or treatment B.
▶ Participants may have any number of confounding traits favouring a positive or
negative outcome, regardless of the treatment. We can’t control for all of them.
▶ By assigning participants randomly, the effects of the confounders should even out.
▶ This enables us to conclude that any observed effect is in fact
due to the variables we’re testing.
▶ “Forced frequentism, with the statistician imposing
his or her preferred probability mechanism upon the data.”
The problem is: you need large and expensive studies. This is the gold standard in medicin
What is permutation?
▶ Much of Fisher’s methodology depends on normal sampling assumptions.
▶ Permutation testing is a non-parametric alternative.
▶ To test significance in a two-sample comparison:
▶ Pool all items in the two samples.
▶ Randomly partition them into two parts and compute test statistic
(e.g., difference of means).
▶ Construct empirical distribution of test statistic.
▶ Very similar to the bootstrap.
▶ Application: Testing performance of NLP systems by BLEU score.