MLE Flashcards

1
Q

Likelihood and maximum likelihood

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Maximum likelihood estimation

A

“The MLE algorithm is automatic: in theory, and almost in practice, a
single numerical algorithm producesOwithout further statistical input.
This contrasts with unbiased estimation, for instance, where each new
situation requires clever theoretical calculations.
The MLE enjoys excellent frequentist properties. In large-sample situa-
tions, maximum likelihood estimates tend to be nearly unbiased, with the
least possible variance. Even in small samples, MLEs are usually quite
efficient, within say a few percent of the best possible performance.
The MLE also has reasonable Bayesian justification

Downsides:
MLE estimates can be extremely off if estimated on little data. ▶ MLE estimate of the Bernoulli parameter of a coin flip based on a sample of one will always be exactly 0 or 1. With large numbers of parameters, ˆθ = T (ˆμ) may be off even if each component of ˆμ is well estimated. ▶ Next week: James-Stein estimato

Most MLEs require numerical maximization, as for the gamma model.

There is a downside to maximum likelihood estimation that remained
nearly invisible in classical applications: it is dangerous to rely upon in
problems involving large numbers of parameters”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Score function

A

“The score function is the derivative of the log-likelihood function.
The score function indicates how much the log-likelihood changes if you vary the parameter estimate by an infinitesimal amount, given data x. Note: When we compute the maximum-likelihood estimate, we solve for a score of 0!”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Fisher information

A

“The variance of the score is called the Fisher information! If the Fisher information is high, then the MLE ˆθ is very sensitive to the data, meaning it describes the data well.

His paradigm-shifting work concerned the favorable
inferential properties of the MLE, and in particular its achievement of the
Fisher information bound

That means the estimate carries a lot of information about the data – if the data were very different, the estimate would also be very different. ▶ Low Fisher information means the parameter doesn’t tell us much about the data. ▶ Larger Fisher information implies smaller variance of the MLE. ▶ The Fisher information is the negative curvature of the log-likelihood. High Fisher information means the log-likelihood surface is peaked

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Cramer–Rao lower bound / Fisher information bound

A

“The MLE is generally not unbiased, but its bias is small (of order 1/n, compared with standard deviation of order 1/sqrt(n), making the comparison with unbiased estimates and the Cramer–Rao bound appropriate.

The MLE has variance at least as small as the best unbiased estimate theta! “

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Observed Fisher information

A

“Computed using the second derivative of the log-likelihood function with respect to the parameter(s).

Fisher recommends using the observed information as it gives a better, more specific idea of the accuracy of ˆθ

Key Differences from Fisher Information:

Fisher Information is the expected value of the second derivative of the log-likelihood (averaged over all possible data).
Observed Fisher Information is calculated directly from observed data without averaging.

Limitations:

May vary significantly depending on the observed sample.
Less stable in small-sample scenarios."
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Randomisation

A

“▶ In an experiment (trial) comparing two treatments A and B, participants should be randomly assigned to either treatment A or treatment B.

▶ Participants may have any number of confounding traits favouring a positive or negative outcome, regardless of the treatment. We can’t control for all of them.

▶ By assigning participants randomly, the effects of the confounders should even out.

▶ This enables us to conclude that any observed effect is in fact due to the variables we’re testing.

▶ “Forced frequentism, with the statistician imposing his or her preferred probability mechanism upon the data.””

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Permutation

A

“▶ Much of Fisher’s methodology depends on normal sampling assumptions.

▶ Permutation testing is a non-parametric alternative.

▶ To test significance in a two-sample comparison:
▶ Pool all items in the two samples.
▶ Randomly partition them into two parts and compute test statistic (e.g., difference of means).
▶ Construct empirical distribution of test statistic.

▶ Very similar to the bootstrap.
▶ Application: Testing performance of NLP systems by BLEU score”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Logistic regression

A

“Logistic regression is a special case of generalized linear models (GLMs) - (a GLM with binomial distribution and logit link).
Logistic regression is a specialized technique for regression analysis of count or proportion data.

Advantages of logit transformation

lambda isn’t restricted to the range Œ0;1, so model (8.5) never verges on forbidden territory.
the exploitation of exponential family properties.

Link Function: The logit link ensures a linear relationship between the predictors and the log-odds of the outcome.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Deviance

A

“Saturated Model: A model that uses as many parameters as there are observations, perfectly fitting the data. Its log-likelihood is the highest possible.
Fitted Model: The GLM with the estimated parameters.
Deviance quantifies the difference between the fitted model and the saturated model in terms of their log-likelihoods. Smaller deviance indicates better fit.

the smaller the residual deviance, the closer the fitted model is to the saturated model.

In linear regression (a special case of GLMs), the deviance reduces to the residual sum of squares (RSS). In GLMs, the deviance generalizes this concept for non-normal error distributions and link functions.

KL divergence is 2xdeviance

GLM maximum likelihood fitting is “least total deviance” in the same way that ordinary linear regression is least sum of squares.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Generalized linear models

A

“Regression analysis, either in its classical form or in modern formula-
tions, requires covariate information x to put the various cases into some
sort of geometrical relationship.

Logistic regression is a special case of generalized linear models (GLMs) - (a GLM with binomial distribution and logit link).

Relation to exponential families:

GLMs extend ordinary linear regression, that is least squares curve-
fitting, to situations where the response variables are binomial, Poisson,
gamma, beta, or in fact any exponential family form.

The main point is that all the information
from a p-parameter GLM is summarized in the p-dimensional vector z,
no matter how large N may be, making it easier both to understand and to
analyze. We have now reduced the N-parameter model (8.20)–(8.21) to the p-
parameter exponential family (8.24), with pusually much smaller than N in this way avoiding the difficulties of high-dimensional estimation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Three levels of statistical modeling

A

“he inner circle of Figure 8.3 represents normal theory, the preferred
venue of classical applied statistics. Exact inferences—t-tests, F distribu-
tions, most of multivariate analysis—were feasible within the circle. Out-
side the circle was a general theory based mainly on asymptotic (large-
sample) approximations involving Taylor expansions and the central limit
theorem.

Exponential family theory, the second circle in Figure 8.3,
unified the special cases into a coherent whole. It has a “partly exact” fla-
vor, with some ideal counterparts to normal theory—convex likelihood sur-
faces, least deviance regression—but with some approximations necessary,
as in (8.30). Even the approximations, though, are often more convincing
than those of general theory, exponential families’ fixed-dimension suffi-
cient statistics making the asymptotics more transparent.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Poisson regression

A

“Poisson Regression is a type of generalized linear model (GLM) used to model count data or event data, where the response variable (yy) is assumed to follow a Poisson distribution. It is often used when the outcome represents the number of events occurring in a fixed interval of time, space, or some other unit.

The third most-used member of the GLM family, after normal theory least
squares and logistic regression, is Poisson regression.

N independent Poisson variates are observed

Chapter 12 demonstrates some other examples of Poisson density esti-
mation. In general, Poisson GLMs reduce density estimation to regression
model fitting, a familiar and flexible inferential technology.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly