Topic 2: Maximum-likelihood estimation, exponential families and generalised linear models Flashcards

1
Q

Describe Likelihood

A

Likelihood: we have fixed data (x) and vary with µ

Use case: estimating model parameters.

differences here:
https://docs.google.com/document/d/1fpirIac-cSF-w1Z-1xi0UEVf9y86SHFUZtsMj75snJ4/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe maximum likelihood estimation

A

The MLE is the value of µ in the parameter space Ω that maximises the log likelihood function, l_x(µ)

We use the log likelihood function, to estimate the MLE, which is the maximum likelihood estimate, which is the value of µ in the parameter space Ω that maximises the log likelihood function, l_x(µ)

EQUATION
https://docs.google.com/document/d/19IRKCg6HzfxRyHde8y36wg64piSMiMNNaDxv-jDk7mU/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the Advantages of MLE and Disadvantages

A

Advantages of MLE:

  • You can automatically estimate without further statistical assumptions
  • The properties for frequentists are great, the estimate is almost unbiased in large samples of data (the parameter value is close to the true value on average)
  • But you can still justify it to be Bayesian

https://docs.google.com/document/d/1jzW0lrx0r_XYJVHobjtU2kqwUsgsnqfvK5dxrYR2TZI/edit?tab=t.0

Disadvantages
- MLE estimates can be extremely off if estimated on little data.

  • With large numbers of parameters, ˆθ = T(hat µ) may be off
    even if each component of hat µ is well estimated.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe Fisher information

A

It is defined to be the variance of the score function.

https://docs.google.com/document/d/1lRA0qLss_qqWXMDJhTpjgoauScTDoFTUhS_Kq0mInL0/edit?tab=t.0

The bigger the Fisher information is, the smaller the variance is for the MLE.

  • If the Fisher information is high, then the MLE hat θ is very sensitive to the data
  • meaning that the estimate carries A LOT OF INFORMATION about the data
    • if the data was very different, the estimate would also be different
  • If the Fisher information is low, it means that the parameter doesn’t tell us much about the data
  • Larger Fisher information implies a smaller variance of the MLE
  • The Fisher information is the negative curvature of the log-likelihood. High Fisher information means the log-likelihood surface is peaked
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe observed Fisher
information

A

Fisher recommends using the observed information as it gives a better, more specific idea of the accuracy of hat θ.

It is also easier to calculate you just need to take the second derivative at your estimate, you dont need to do integrals.

The observed Fisher information acts as an approximate ancillary, enjoying both of the virtues claimed by Fisher: it is more relevant than the unconditional information Fisher information, and it is usually easier to calculate.

https://docs.google.com/document/d/1HBFyTvQ6zgD6_fuReRFXEeBmd9BSlb5z8MB63gQODu4/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe the score function

A

The score function is the derivative of the log-likelihood function.
The score function indicates how much the log-likelihood changes if you vary the parameter estimate by an infinitesimal amount, given data x.

https://docs.google.com/document/d/1s55QLkPMBiJWs7osOnIKHf3In5x30vafckipx7wRmmw/edit?tab=t.0

When we compute the maximum likelihood estimate we solve for the score function as 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe conditional inference

A

Ancillary statistic describes information about the actual experiment (sample size), not information about things such as parameters.

When doing conditional inference, we conditions on the ancillary statistics.

Advantages:
- relevant data
- makes the calculations more simple

Disadvantages:
- information loss

old:

This means to condition on the ancillary statistics (sample size fx, contains no direct information by itself),

Ancillary statistic: A statistic that contains “no direct information by itself”, but describes the experiment that was performed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe permutation

A

Much of Fisher’s methodology faced criticism for its dependence on normal sampling assumptions. Permutation testing is frequentistic.

When it happens: After data has already been collected, during statistical analysis

Purpose: To assess the statistical significance of an observed effect by comparing it to a reference distribution

Key Idea: Tests the null hypothesis by generating an empirical distribution of a test statistic

Example: Observed data points are randomly shuffled between two groups to see if the observed difference in means is unusual

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe randomisation

A

Ensure unbiased groups.

Experimental randomisation almost guarantees that confounding factors such as age and weight will be well-balanced between the treatment groups. Fisher’s randomised clinical trial was and is the gold standard for statistical inference in medical trials.

When it happens: At the experimental design stage, before data is collected

Purpose: To ensure unbiased assignment of participants to groups, balancing confounding factors

Key Idea: Creates a controlled probability structure for valid inference (frequentist framework)

Example: Participants are randomly assigned to either treatment group A or B

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe univariate and multivariate families of distributions

A

Univariate families has only one random variable and they are all related, the five most famous ones are:
- Normal
- Poisson
- Binomial
- Gamma
- Beta

Multivariate is when we have multiple random variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe the 5 familiar univariate densities

A

Normal: Heights and weights of people, for real life scenarios where we would expect the data to be normally distributed.

Poisson: Number of events happening in an interval: number of customers waiting in an hour (0-60 interval). The probability for a customer waiting for 5 minutes is 0.10

Binomial: Success/failure outcomes, how many times we get tails from a coin

Gamma: model continuous variables that are always positive and have skewed distributions, rainfall

Beta: Tells you about the underlying probability of success itself, . A natural candidate for modeling continuous data on the unit interval [0, 1]. The choice of two parameters, (v_1, v_2) provide a variety of possible shapes.

https://docs.google.com/document/d/1X_2VY5bDT50yFchEcqB9Wbpnjwtul2Q4EAJgWfstv-E/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Describe the Schur complements

A

Working with the covariance matrix of the multivariate normal distrubution we need to invert it. Schur complements allows us for inverting a large matrix efficiently. A way to break down the problem of inverting a large matrix into smaller, more manageable pieces.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe multivariate normal
distribution and its properties

A

Normal distributions can be univariate, but they can also be multivariate (weather, height, age etc.). You have a p-dimensional space (p # of variables/predictors), a random vector x = (x1, x2, x3, …, xp), whose mean is the expectation of all the random variables:

The expectation is: EQUATION**
https://docs.google.com/document/d/1vEzJwb0Twovt4ZWPp4xcUTOhN6mG_wsGLia7umFyvDA/edit?tab=t.0

The p*p covariance is:
EQUATION**

Therefore, we can define the multivariate normal distribution to be a combination of the expectation (mu) and the covariance (sum).

The conditinals are themself normal.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Describe fisher information

A

In parametric models, Fisher Information measures how much information an observed dataset carries about an unknown parameter. MLE quantifies the precision of the estimate.
High Fisher information = more precise estimate
Low Fisher information = less precise estimate

When the sample size is large, the MLE estimates approaches a multivariate normal distribution.

The score function is now the gradient of the log likelihood function (we still expect 0).
EQUATION
https://docs.google.com/document/d/1A2HOvCBQsEh59-j0khRLxYR-gUU3FPRZqJkOvoousdQ/edit?tab=t.0

The Fisher information: Is the covariance of the gradient of the log likelihood function, and can be calculate two ways:
- original: covariance of the score
- alternative: negative expected second derivative of log likelihood

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe multinomial distribution

A

Multinomial (cloudy, rainy, sunny etc.).
It being multi means it has k>2 categories.
A multinomial logistic regression would e.g. predict for whether the animal in a picture is a cat, dog or horse.

The observations take on only a finite number of discrete values, say we have L=4 possible outcomes: (new, success), (new, failure), (old, success), (old, failure).

EQUATION
https://docs.google.com/document/d/1YNnEhPa_3cwXr1-xR9B8U13SDTAHqWXznkULUVckdfM/edit?tab=t.0

Just as there’s a close relation between binomial distribution and Poisson, there’s also a close relation between multinomial and poisson

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Describe exponential family

A

A family of distributions is said to belong to the
the exponential family if the probability density function (or probability mass
function for discrete distributions), can be written in the form

https://docs.google.com/document/d/1QZ41DBPmOIhYYwh4XXqrqyp0ifOtvToq9PkJ-pDp6vk/edit?tab=t.0

lambda: natural canonical

gamma: normalising, makes the pdf live up to the fact it has to sum to 1 when integrating, makes it for real to a pdf

y = t(x): sufficient statistics

g_x0: carries density, has information about what kind of PDF we are working with

17
Q

Describe logistic regression

A

We have logits, which are the inverse of a sigmoid function. Logit will take a value between 0 and 1 spits out a real value.
Logit is the natural parameter, and the underlying distribution is the binomial distribution.

We try to find the probability of k successes in a trial (mice death) given an independent success probability of pi (the proportion of mice death)

We assume that the logit is a linear function of the dose we give the mice e.g.
EQUATION
https://docs.google.com/document/d/1UTHw9J1YMWFLBw5K-_gGSEhw-ybnOtqbrrQBsHDY5jE/edit?tab=t.0

To find the alphas, we use MLE to estimate the parameters.

And since the sigmoid is the inverse of the logit, we can write it like this:
EQUATION

18
Q

Describe poisson regression

A

Poisson regression is good at smoothing the raw bin counts. Poisson regression “smooths” these raw bin counts by modeling the underlying pattern of the data rather than just reporting the raw counts.

https://docs.google.com/document/d/12cHHWcJPUn4Vml2e34hC8b34K5RyB8VUJlAPysgfZkk/edit?tab=t.0

19
Q

Describe generalised linear model

A

Generalised linear models are useful for regression when the response data distribution is not normal. The link function makes sure the paramters equals a linear function of predictors.

Generalised linear models are a principled way to apply regression to quantities that are not normally distributed.

20
Q

What is deviance

A

The deviance between two densities, f_1(x) and f_2(x) is a measure of how different the two distributions are.