Topic 2: Maximum-likelihood estimation, exponential families and generalised linear models Flashcards

Question 1

Q

Describe Likelihood

Answer

A

Likelihood: we have fixed data (x) and vary with µ

Use case: estimating model parameters.

differences here:
https://docs.google.com/document/d/1fpirIac-cSF-w1Z-1xi0UEVf9y86SHFUZtsMj75snJ4/edit?tab=t.0

Question 2

Q

Describe maximum likelihood estimation

Answer

A

The MLE is the value of µ in the parameter space Ω that maximises the log likelihood function, l_x(µ)

We use the log likelihood function, to estimate the MLE, which is the maximum likelihood estimate, which is the value of µ in the parameter space Ω that maximises the log likelihood function, l_x(µ)

EQUATION
https://docs.google.com/document/d/19IRKCg6HzfxRyHde8y36wg64piSMiMNNaDxv-jDk7mU/edit?tab=t.0

Question 3

Q

Describe the Advantages of MLE and Disadvantages

Answer

A

Advantages of MLE:

You can automatically estimate without further statistical assumptions
The properties for frequentists are great, the estimate is almost unbiased in large samples of data (the parameter value is close to the true value on average)
But you can still justify it to be Bayesian

https://docs.google.com/document/d/1jzW0lrx0r_XYJVHobjtU2kqwUsgsnqfvK5dxrYR2TZI/edit?tab=t.0

Disadvantages
- MLE estimates can be extremely off if estimated on little data.

With large numbers of parameters, ˆθ = T(hat µ) may be off
even if each component of hat µ is well estimated.

Question 4

Q

Describe Fisher information

Answer

A

It is defined to be the variance of the score function.

https://docs.google.com/document/d/1lRA0qLss_qqWXMDJhTpjgoauScTDoFTUhS_Kq0mInL0/edit?tab=t.0

The bigger the Fisher information is, the smaller the variance is for the MLE.

If the Fisher information is high, then the MLE hat θ is very sensitive to the data
meaning that the estimate carries A LOT OF INFORMATION about the data
- if the data was very different, the estimate would also be different
If the Fisher information is low, it means that the parameter doesn’t tell us much about the data
Larger Fisher information implies a smaller variance of the MLE
The Fisher information is the negative curvature of the log-likelihood. High Fisher information means the log-likelihood surface is peaked

Question 5

Q

Describe observed Fisher
information

Answer

A

Fisher recommends using the observed information as it gives a better, more specific idea of the accuracy of hat θ.

It is also easier to calculate you just need to take the second derivative at your estimate, you dont need to do integrals.

The observed Fisher information acts as an approximate ancillary, enjoying both of the virtues claimed by Fisher: it is more relevant than the unconditional information Fisher information, and it is usually easier to calculate.

https://docs.google.com/document/d/1HBFyTvQ6zgD6_fuReRFXEeBmd9BSlb5z8MB63gQODu4/edit?tab=t.0

Question 6

Q

Describe the score function

Answer

A

The score function is the derivative of the log-likelihood function.
The score function indicates how much the log-likelihood changes if you vary the parameter estimate by an infinitesimal amount, given data x.

https://docs.google.com/document/d/1s55QLkPMBiJWs7osOnIKHf3In5x30vafckipx7wRmmw/edit?tab=t.0

When we compute the maximum likelihood estimate we solve for the score function as 0.

Question 7

Q

Describe conditional inference

Answer

A

Ancillary statistic describes information about the actual experiment (sample size), not information about things such as parameters.

When doing conditional inference, we conditions on the ancillary statistics.

Advantages:
- relevant data
- makes the calculations more simple

Disadvantages:
- information loss

old:

This means to condition on the ancillary statistics (sample size fx, contains no direct information by itself),

Ancillary statistic: A statistic that contains “no direct information by itself”, but describes the experiment that was performed.

Question 8

Q

Describe permutation

Answer

A

Much of Fisher’s methodology faced criticism for its dependence on normal sampling assumptions. Permutation testing is frequentistic.

When it happens: After data has already been collected, during statistical analysis

Purpose: To assess the statistical significance of an observed effect by comparing it to a reference distribution

Key Idea: Tests the null hypothesis by generating an empirical distribution of a test statistic

Example: Observed data points are randomly shuffled between two groups to see if the observed difference in means is unusual

Question 9

Q

Describe randomisation

Answer

A

Ensure unbiased groups.

Experimental randomisation almost guarantees that confounding factors such as age and weight will be well-balanced between the treatment groups. Fisher’s randomised clinical trial was and is the gold standard for statistical inference in medical trials.

When it happens: At the experimental design stage, before data is collected

Purpose: To ensure unbiased assignment of participants to groups, balancing confounding factors

Key Idea: Creates a controlled probability structure for valid inference (frequentist framework)

Example: Participants are randomly assigned to either treatment group A or B

Question 10

Q

Describe univariate and multivariate families of distributions

Answer

A

Univariate families has only one random variable and they are all related, the five most famous ones are:
- Normal
- Poisson
- Binomial
- Gamma
- Beta

Multivariate is when we have multiple random variables

Question 11

Q

Describe the 5 familiar univariate densities

Answer

A

Normal: Heights and weights of people, for real life scenarios where we would expect the data to be normally distributed.

Poisson: Number of events happening in an interval: number of customers waiting in an hour (0-60 interval). The probability for a customer waiting for 5 minutes is 0.10

Binomial: Success/failure outcomes, how many times we get tails from a coin

Gamma: model continuous variables that are always positive and have skewed distributions, rainfall

Beta: Tells you about the underlying probability of success itself, . A natural candidate for modeling continuous data on the unit interval [0, 1]. The choice of two parameters, (v_1, v_2) provide a variety of possible shapes.

https://docs.google.com/document/d/1X_2VY5bDT50yFchEcqB9Wbpnjwtul2Q4EAJgWfstv-E/edit?tab=t.0

Question 12

Q

Describe the Schur complements

Answer

A

Working with the covariance matrix of the multivariate normal distrubution we need to invert it. Schur complements allows us for inverting a large matrix efficiently. A way to break down the problem of inverting a large matrix into smaller, more manageable pieces.

Question 13

Q

Describe multivariate normal
distribution and its properties

Answer

A

Normal distributions can be univariate, but they can also be multivariate (weather, height, age etc.). You have a p-dimensional space (p # of variables/predictors), a random vector x = (x1, x2, x3, …, xp), whose mean is the expectation of all the random variables:

The expectation is: EQUATION**
https://docs.google.com/document/d/1vEzJwb0Twovt4ZWPp4xcUTOhN6mG_wsGLia7umFyvDA/edit?tab=t.0

The p*p covariance is:
EQUATION**

Therefore, we can define the multivariate normal distribution to be a combination of the expectation (mu) and the covariance (sum).

The conditinals are themself normal.

Question 14

Q

Describe fisher information

Answer

A

In parametric models, Fisher Information measures how much information an observed dataset carries about an unknown parameter. MLE quantifies the precision of the estimate.
High Fisher information = more precise estimate
Low Fisher information = less precise estimate

When the sample size is large, the MLE estimates approaches a multivariate normal distribution.

The score function is now the gradient of the log likelihood function (we still expect 0).
EQUATION
https://docs.google.com/document/d/1A2HOvCBQsEh59-j0khRLxYR-gUU3FPRZqJkOvoousdQ/edit?tab=t.0

The Fisher information: Is the covariance of the gradient of the log likelihood function, and can be calculate two ways:
- original: covariance of the score
- alternative: negative expected second derivative of log likelihood

Question 15

Q

Describe multinomial distribution

Answer

A

Multinomial (cloudy, rainy, sunny etc.).
It being multi means it has k>2 categories.
A multinomial logistic regression would e.g. predict for whether the animal in a picture is a cat, dog or horse.

The observations take on only a finite number of discrete values, say we have L=4 possible outcomes: (new, success), (new, failure), (old, success), (old, failure).

EQUATION
https://docs.google.com/document/d/1YNnEhPa_3cwXr1-xR9B8U13SDTAHqWXznkULUVckdfM/edit?tab=t.0

Just as there’s a close relation between binomial distribution and Poisson, there’s also a close relation between multinomial and poisson

Question 16

Q

Describe exponential family

Answer

A

A family of distributions is said to belong to the
the exponential family if the probability density function (or probability mass
function for discrete distributions), can be written in the form

https://docs.google.com/document/d/1QZ41DBPmOIhYYwh4XXqrqyp0ifOtvToq9PkJ-pDp6vk/edit?tab=t.0

lambda: natural canonical

gamma: normalising, makes the pdf live up to the fact it has to sum to 1 when integrating, makes it for real to a pdf

y = t(x): sufficient statistics

g_x0: carries density, has information about what kind of PDF we are working with

Question 17

Q

Describe logistic regression

Answer

A

We have logits, which are the inverse of a sigmoid function. Logit will take a value between 0 and 1 spits out a real value.
Logit is the natural parameter, and the underlying distribution is the binomial distribution.

We try to find the probability of k successes in a trial (mice death) given an independent success probability of pi (the proportion of mice death)

We assume that the logit is a linear function of the dose we give the mice e.g.
EQUATION
https://docs.google.com/document/d/1UTHw9J1YMWFLBw5K-_gGSEhw-ybnOtqbrrQBsHDY5jE/edit?tab=t.0

To find the alphas, we use MLE to estimate the parameters.

And since the sigmoid is the inverse of the logit, we can write it like this:
EQUATION

Question 18

Q

Describe poisson regression

Answer

A

Poisson regression is good at smoothing the raw bin counts. Poisson regression “smooths” these raw bin counts by modeling the underlying pattern of the data rather than just reporting the raw counts.

https://docs.google.com/document/d/12cHHWcJPUn4Vml2e34hC8b34K5RyB8VUJlAPysgfZkk/edit?tab=t.0

Question 19

Q

Describe generalised linear model

Answer

A

Generalised linear models are useful for regression when the response data distribution is not normal. The link function makes sure the paramters equals a linear function of predictors.

Generalised linear models are a principled way to apply regression to quantities that are not normally distributed.

Question 20

Q

What is deviance

Answer

A

The deviance between two densities, f_1(x) and f_2(x) is a measure of how different the two distributions are.