Statistical concepts Flashcards
What is the goal of Maximum Likelihood Estimation (MLE)?
The goal of MLE is to find the parameters of a probability distribution that make the observed data most probable. It ‘fits’ a distribution to the data by maximizing the likelihood function.
What is the likelihood function?
The likelihood function measures how probable the observed data is given a set of model parameters. It is denoted as L(θ) = P(X | θ).
Why do we use the log-likelihood function instead of the likelihood function?
Because it simplifies calculations by turning products (usually probabilities come in a product) into sums and avoids numerical underflow.
How do we find the MLE of a model with parameters theta?
To find the MLE, we take the derivative of the log-likelihood function, set it to zero, and solve for the parameters.
What does it mean to fit a distribution to data using MLE?
It means estimating the best parameters for a probability distribution that most likely generated the observed data.
What is an example of a distribution where MLE is used?
MLE is commonly used to estimate parameters of Normal distribution. But also Poisson, Exponential, and other probability distributions are being used.
What are the MLE estimates for a given observed data assuming normal distribution?
For a normal distribution, MLE estimates the mean as the sample average and the variance as the sample variance.
What are some applications of MLE?
MLE is used in machine learning, statistical inference, financial modeling, and natural language processing.
What assumption does MLE rely on?
MLE assumes that the observed data follows a known probability distribution with unknown parameters.
How is MLE related to Bayesian estimation?
MLE finds the most likely parameters without prior knowledge, while Bayesian estimation incorporates prior beliefs through Bayes’ theorem.
What is the goal of Maximum Likelihood Estimation (MLE) in classification?
MLE aims to maximize the likelihood of the correct class label given the input data.
What is the loss function derived from the negative log-likelihood in a classification problem?
Cross-entropy loss.
What does the cross-entropy loss measure?
It measures how different the predicted probability distribution is from the true one, penalizing incorrect predictions more strongly.
Why is softmax used in classification neural networks?
Softmax converts raw scores (logits) into a valid probability distribution, making them interpretable for categorical classification.
How is softmax mathematically defined?
Softmax for class c: p_c = exp(z_c) / sum(exp(z_j)), ensuring all outputs sum to 1.
What is the relationship between softmax and cross-entropy?
Softmax outputs probabilities, and cross-entropy measures the difference between these and the true labels, forming the standard classification loss.
What assumption about data leads to using softmax + cross-entropy?
We assume a categorical distribution over classes, making softmax + cross-entropy the natural choice under MLE.
Why does softmax ensure a valid probability distribution?
Softmax ensures all outputs are positive and sum to 1, making them interpretable as probabilities.
Break down the steps for MLE in a classification problem.
1 Assume a categorical distribution: given data x, the true label follows a one-hot encoded probability distribution (Dirac delta), meaning it’s 1 for the correct class and 0 for the others.
2 Using MLE we want to maximise the probability for all the data points.
3 We create a model p(y∣x,θ) (using an NN) which gives a value for each class (logits).
4 We use Softmax so the output can be interpreted as a probability function for each class.
5 To maximise the log-likelihood, we choose Cross entropy loss to make the predicted distribution as close as possible to our assumption (Dirac function).