Topic 2: Maximum-likelihood estimation, exponential families and generalised linear models Flashcards
Describe Likelihood
Likelihood: we have fixed data (x) and vary with µ
Use case: estimating model parameters.
differences here:
https://docs.google.com/document/d/1fpirIac-cSF-w1Z-1xi0UEVf9y86SHFUZtsMj75snJ4/edit?tab=t.0
Describe maximum likelihood estimation
The MLE is the value of µ in the parameter space Ω that maximises the log likelihood function, l_x(µ)
We use the log likelihood function, to estimate the MLE, which is the maximum likelihood estimate, which is the value of µ in the parameter space Ω that maximises the log likelihood function, l_x(µ)
EQUATION
https://docs.google.com/document/d/19IRKCg6HzfxRyHde8y36wg64piSMiMNNaDxv-jDk7mU/edit?tab=t.0
Describe the Advantages of MLE and Disadvantages
Advantages of MLE:
- You can automatically estimate without further statistical assumptions
- The properties for frequentists are great, the estimate is almost unbiased in large samples of data (the parameter value is close to the true value on average)
- But you can still justify it to be Bayesian
https://docs.google.com/document/d/1jzW0lrx0r_XYJVHobjtU2kqwUsgsnqfvK5dxrYR2TZI/edit?tab=t.0
Disadvantages
- MLE estimates can be extremely off if estimated on little data.
- With large numbers of parameters, ˆθ = T(hat µ) may be off
even if each component of hat µ is well estimated.
Describe Fisher information
It is defined to be the variance of the score function.
https://docs.google.com/document/d/1lRA0qLss_qqWXMDJhTpjgoauScTDoFTUhS_Kq0mInL0/edit?tab=t.0
The bigger the Fisher information is, the smaller the variance is for the MLE.
- If the Fisher information is high, then the MLE hat θ is very sensitive to the data
- meaning that the estimate carries A LOT OF INFORMATION about the data
- if the data was very different, the estimate would also be different
- If the Fisher information is low, it means that the parameter doesn’t tell us much about the data
- Larger Fisher information implies a smaller variance of the MLE
- The Fisher information is the negative curvature of the log-likelihood. High Fisher information means the log-likelihood surface is peaked
Describe observed Fisher
information
Fisher recommends using the observed information as it gives a better, more specific idea of the accuracy of hat θ.
It is also easier to calculate you just need to take the second derivative at your estimate, you dont need to do integrals.
The observed Fisher information acts as an approximate ancillary, enjoying both of the virtues claimed by Fisher: it is more relevant than the unconditional information Fisher information, and it is usually easier to calculate.
https://docs.google.com/document/d/1HBFyTvQ6zgD6_fuReRFXEeBmd9BSlb5z8MB63gQODu4/edit?tab=t.0
Describe the score function
The score function is the derivative of the log-likelihood function.
The score function indicates how much the log-likelihood changes if you vary the parameter estimate by an infinitesimal amount, given data x.
https://docs.google.com/document/d/1s55QLkPMBiJWs7osOnIKHf3In5x30vafckipx7wRmmw/edit?tab=t.0
When we compute the maximum likelihood estimate we solve for the score function as 0.
Describe conditional inference
Ancillary statistic describes information about the actual experiment (sample size), not information about things such as parameters.
When doing conditional inference, we conditions on the ancillary statistics.
Advantages:
- relevant data
- makes the calculations more simple
Disadvantages:
- information loss
old:
This means to condition on the ancillary statistics (sample size fx, contains no direct information by itself),
Ancillary statistic: A statistic that contains “no direct information by itself”, but describes the experiment that was performed.
Describe permutation
Much of Fisher’s methodology faced criticism for its dependence on normal sampling assumptions. Permutation testing is frequentistic.
When it happens: After data has already been collected, during statistical analysis
Purpose: To assess the statistical significance of an observed effect by comparing it to a reference distribution
Key Idea: Tests the null hypothesis by generating an empirical distribution of a test statistic
Example: Observed data points are randomly shuffled between two groups to see if the observed difference in means is unusual
Describe randomisation
Ensure unbiased groups.
Experimental randomisation almost guarantees that confounding factors such as age and weight will be well-balanced between the treatment groups. Fisher’s randomised clinical trial was and is the gold standard for statistical inference in medical trials.
When it happens: At the experimental design stage, before data is collected
Purpose: To ensure unbiased assignment of participants to groups, balancing confounding factors
Key Idea: Creates a controlled probability structure for valid inference (frequentist framework)
Example: Participants are randomly assigned to either treatment group A or B
Describe univariate and multivariate families of distributions
Univariate families has only one random variable and they are all related, the five most famous ones are:
- Normal
- Poisson
- Binomial
- Gamma
- Beta
Multivariate is when we have multiple random variables
Describe the 5 familiar univariate densities
Normal: Heights and weights of people, for real life scenarios where we would expect the data to be normally distributed.
Poisson: Number of events happening in an interval: number of customers waiting in an hour (0-60 interval). The probability for a customer waiting for 5 minutes is 0.10
Binomial: Success/failure outcomes, how many times we get tails from a coin
Gamma: model continuous variables that are always positive and have skewed distributions, rainfall
Beta: Tells you about the underlying probability of success itself, . A natural candidate for modeling continuous data on the unit interval [0, 1]. The choice of two parameters, (v_1, v_2) provide a variety of possible shapes.
https://docs.google.com/document/d/1X_2VY5bDT50yFchEcqB9Wbpnjwtul2Q4EAJgWfstv-E/edit?tab=t.0
Describe the Schur complements
Working with the covariance matrix of the multivariate normal distrubution we need to invert it. Schur complements allows us for inverting a large matrix efficiently. A way to break down the problem of inverting a large matrix into smaller, more manageable pieces.
Describe multivariate normal
distribution and its properties
Normal distributions can be univariate, but they can also be multivariate (weather, height, age etc.). You have a p-dimensional space (p # of variables/predictors), a random vector x = (x1, x2, x3, …, xp), whose mean is the expectation of all the random variables:
The expectation is: EQUATION**
https://docs.google.com/document/d/1vEzJwb0Twovt4ZWPp4xcUTOhN6mG_wsGLia7umFyvDA/edit?tab=t.0
The p*p covariance is:
EQUATION**
Therefore, we can define the multivariate normal distribution to be a combination of the expectation (mu) and the covariance (sum).
The conditinals are themself normal.
Describe fisher information
In parametric models, Fisher Information measures how much information an observed dataset carries about an unknown parameter. MLE quantifies the precision of the estimate.
High Fisher information = more precise estimate
Low Fisher information = less precise estimate
When the sample size is large, the MLE estimates approaches a multivariate normal distribution.
The score function is now the gradient of the log likelihood function (we still expect 0).
EQUATION
https://docs.google.com/document/d/1A2HOvCBQsEh59-j0khRLxYR-gUU3FPRZqJkOvoousdQ/edit?tab=t.0
The Fisher information: Is the covariance of the gradient of the log likelihood function, and can be calculate two ways:
- original: covariance of the score
- alternative: negative expected second derivative of log likelihood
Describe multinomial distribution
Multinomial (cloudy, rainy, sunny etc.).
It being multi means it has k>2 categories.
A multinomial logistic regression would e.g. predict for whether the animal in a picture is a cat, dog or horse.
The observations take on only a finite number of discrete values, say we have L=4 possible outcomes: (new, success), (new, failure), (old, success), (old, failure).
EQUATION
https://docs.google.com/document/d/1YNnEhPa_3cwXr1-xR9B8U13SDTAHqWXznkULUVckdfM/edit?tab=t.0
Just as there’s a close relation between binomial distribution and Poisson, there’s also a close relation between multinomial and poisson