Machine Learning and AI Flashcards
What is probabilistic machine learning?
A probabilistic machine learning model will assign probabilities to it’s outcomes based on how likely the model considers the prediction to be correct
What are the two types of uncertainty and their definitions
Aleatoric uncertainty:
Uncertainty due to randomness
Epistemic Uncertainty:
Due to lack of knowledge
How can we define a statistic model?
p(Y|θ)
Where Y is the dataset a
Where θ is the model parameters
What is another name for Gaussian distribution?
Unvariate normal distribution
What does it mean when samples are identically distributed?
They all come from the same probability distribution
Why might we use Capital Pi instead of Capital Sigma for a summation?
Capital Pi indicated multiplication of the elements while capital sigma indicates summation
Try Maximum likelihood for different distributions Slides lecture 1 page 29
I did it and i’m a good boy!!
What are the three types of prior?
Non- informative
Weakly informative
Informative
When should we use a Non-Informative Prior
When there is a normal distribution with a constant mean and a very large variance
The p(θ) is approximately the same for all values of θ
When should we use a Weakly Informative Prior?
When there is a Normal distribution with a constant mean and very large variance.
The p(θ) is the same for θ which are plausible
When should we use Informative Priors
When there is a normal distribution with a constant mean and very small variance
Why is the poisson distribution more suitable for count data
The Poisson distribution is more suitable for count data because it naturally handles discrete, non-negative values and aligns with the way events are expected to occur over fixed intervals, often capturing the mean-variance relationship seen in such data.
How do we calculate the Likelihood for Bayesian linear regression
Likelihood: 𝑝 (𝒚|𝑋, 𝜎2, 𝒘) = 𝒩(𝒚|𝑋𝒘,𝜎^2𝐼)
How can we calculate a posterior in the context of Bayesian Linear regression
Posterior: 𝑝(𝒘|𝒚, 𝑋, 𝜎^2)=𝒩(𝒘|𝒘𝑛, 𝜮𝑛)
What does Bayes’ rule state in the context of updating beliefs?
Bayes’ rule states that the posterior (our updated belief after seeing data) is proportional to the likelihood (how well the model explains the data) multiplied by the prior (our initial belief about the parameters).
How is the posterior mathematically expressed using Bayes’ rule?
It is expressed as:
P(θ∣D)∝P(D∣θ)P(θ)
where
θ are the parameters and
D is the data.
Why do we take the logarithm of the posterior, likelihood, and prior?
Taking the logarithm simplifies the calculations because it converts products into sums, making the math easier to handle, especially for optimization.
What does the equation look like after taking the log of Bayes’ rule? Write this down, the answer is on the back
logP(θ∣D)∝logP(D∣θ)+logP(θ)
How does a Gaussian likelihood relate to Ordinary Least Squares (OLS)?
When the likelihood is Gaussian, maximizing the likelihood (or minimizing the negative log likelihood) is equivalent to minimizing the sum of squared errors, which is exactly what OLS does.
What role does a Gaussian prior play in model building?
A Gaussian prior on the parameters implies that we expect the parameters to be centered around zero and not be too large. Its logarithm introduces a penalty for large parameter values, serving as a regularization term.
What is regularization and why is it important and how is it represented in a Bayesian framework
Regularization is a technique that penalizes overly complex models (or large parameter values) to prevent overfitting. In the Bayesian framework, the log prior acts as a regularization term by discouraging large parameter values
How do the log likelihood and log prior together affect model optimization?
They create a balance: the log likelihood term ensures the model fits the data well (like OLS), while the log prior term keeps the parameter values reasonable (like regularization), leading to a model that is both accurate and generalizable.
How can this Bayesian interpretation be connected to methods like Ridge Regression?
Ridge Regression can be seen as a special case where the Gaussian prior (with a penalty on the size of the parameters) is applied. The regularization term in Ridge Regression is equivalent to the negative log of a Gaussian prior, thus deriving Ridge Regression from a Bayesian perspective.
What is the basic equation assumed for each observation in a Bayesian linear model?
yi=wxi+ϵi where 𝑤𝑥𝑖 is the linear part and
𝜖𝑖 is the error
Why is the normal distribution usually assumed for the error term in a linear model?
The normal distribution is used because it simplifies the math and is suited for continuous data, but it covers all real numbers including negatives.
Why might a linear model with normally distributed errors be unsuitable for modeling the number of cases (compared to days)?
Because the number of cases is count data (discrete and non-negative), and the normal distribution is not ideal for representing count data.
What alternative models might be better suited for count data?
Models like the Poisson or negative binomial regression are more appropriate for count data.
How can we generalize a linear model for a given data set
We need a link function which can combine non-normal data with predictors in a linear way.
* In the case of Poisson distribution, the expectation (or mean) is the rate parameter 𝜆:
𝜆𝑖 = 𝑤𝑥𝑖
* The link function above will not be appropriate as 𝜆 must be greater than 0,
𝜆𝑖 = 𝑓(𝑤𝑥𝑖)
Where 𝑓(∙) is called an activation function in the ML literature
Define the design matrix
-Represents the input data where each row is a data point and each column is a feature
-Usual notation is X
Define the Weight Vector
-Represents the coefficients of the model.
-Each row is a model coefficient.
-Denoted as w
Define the output vector
Represents the observed values(dependent variables) per row
-Denoted as y
Define the Posterior Covariance matrix
-Represents the uncertainty in the posterior belief about w(looks like an error function)
-Becomes smaller as more data is observed
-Formula:
Σn = (Σ0^-1 + σ^-2X^TX)^-1
Define the prior covariance matrix
-Represents the uncertainty in the prior belief about w
-Larger values indicate higher uncertainty
-Denoted as Σ0
Posterior Mean Vector(Wn)
-Represents the updated estimate of w after observing data
-Formula:
Wn = Σn(Σ0^-1w0 + σ^-2X^Ty)^1
Likelihood noise(^2I)
-Represents the noise in the observed data
-size nxn
Formula:
p(y|X,w,σ^2) = N(Xw,σ^2I)
What is logistic regression
-A classification model that predicts probabilities instead of strictly defined class predictions.
-e.g. There is a 75% chance this image is a cat and 25% this image is a dog
Formula is:
θi = 1 / (1+ e^-ni)
where ni = w0 + w1xi
What is the difference between Discriminative and Generative Classification models?
Discriminative models learn the boundaries between categories while generative models learn how each category is distributed
How are weights defined in bayseian logistic regression and why?
The weights are defined as probablitites
-This can account for the uncertainty in classification
-The posterior must be approximated using methods such as Markov Chain Monte Carlo or Laplace approximation.
Why might we use the Naive Bayes Classifier?
It’s simple and fast - Why is it simple and fast?
What is the probability for the Naive Bayes Classifier
p(y=c|X) = p(X|y=c) x p(y=c) / p(X)
What are the two types of Naive Bayes Classifiers?
Bernoulli is for Binary data(yes/no) and Multinomial is used for categorical data e.g. document classification
Week 3
Why would we choose to use a Laplace approximation?
The posterior distribution is often complex and does not have a closed form distribution(cannot solve exactly)
The Laplace approximation simplifies the posterior by approximating it as a Gaussian distribution
What is Bayesian Linear Regression
Similar to linear regression but the weights are represented as a probability distribution
Why is Laplace Approximation used in Bayesian inference?
In Bayesian inference, the posterior distribution often lacks a closed-form solution, making exact inference difficult. Laplace Approximation provides a way to approximate this posterior using a Gaussian distribution centered around the mode, simplifying computations.
How is the posterior computed in Bayesian Linear Regression?
The posterior is determined by combining prior knowledge about the model parameters with the likelihood of the observed data. Given that both the prior and likelihood follow a normal distribution, the result is another normal distribution whose center (mean) is a weighted combination of prior beliefs and observed data.
What is the likelihood function in Bayesian Linear Regression?
The likelihood function represents how probable the observed data is given specific model parameters. In Bayesian Linear Regression, it assumes that the observed data follows a normal distribution around the model’s predictions, with some amount of noise.
What challenge arises in Bayesian Logistic Regression that requires approximation?
Unlike Bayesian Linear Regression, Bayesian Logistic Regression does not yield a closed-form posterior due to the non-Gaussian likelihood (Bernoulli-distributed output). This necessitates approximations such as Laplace Approximation, Variational Inference, or Markov Chain Monte Carlo.
How does the Occam Factor influence model selection?
The Occam Factor penalizes complex models by reducing their posterior probability, favouring simpler models that fit the data well without excessive parameters.
What is the fundamental goal of Laplace Approximation?
The goal of Laplace Approximation is to approximate a complex, continuous posterior distribution with a Gaussian distribution by expanding it around the mode using a second-order Taylor series expansion.
How does Laplace Approximation work in 1D?
In one-dimensional problems, Laplace Approximation finds the highest probability point of the distribution and fits a Gaussian curve around it. The shape of this curve is determined by how quickly the probability changes near this peak.
How is Laplace Approximation extended to multiple dimensions?
In higher dimensions, the concept remains the same, but instead of a single value, the curvature of the probability distribution is represented using a matrix (called the Hessian matrix). This helps define the shape and spread of the Gaussian approximation.
How is a Gaussian approximation derived using Taylor expansion?
The probability function is expanded as a quadratic function around its highest probability point using Taylor series. This quadratic form closely resembles a Gaussian curve, allowing it to be used as an approximation.
What are the limitations of Laplace Approximation?
Ineffective for multimodal distributions (assumes unimodality).
Less reliable for small datasets where Gaussian approximation may be poor.
Only applies to real variables unless modified.
May overlook global features of the posterior.
What is the Bayesian Information Criterion (BIC)?
BIC is a statistical measure used to compare models. It balances how well a model fits the data with how complex it is, discouraging overly complicated models that may not generalize well.
How does BIC compare models?
BIC approximates model evidence by balancing fit quality and complexity. A lower BIC value indicates a better trade-off between explanatory power and complexity.
How does the Occam Factor influence model selection?
The Occam Factor favors simpler models by reducing their probability if they have too many parameters. This ensures models do not become overly complex without a significant improvement in accuracy.
Why is Laplace Approximation useful in Bayesian inference?
It transforms intractable posteriors into manageable Gaussian distributions, simplifying computations while retaining Bayesian principles.
Week 4
What is information in the context of Information Theory?
Information measures the degree of surprise when observing a specific outcome of a random variable. A certain event provides no information, while an unlikely event provides more information.
How is entropy related to code length in data compression?
Entropy provides a lower bound on the number of bits needed to encode a random variable. By assigning shorter codes to more probable outcomes and longer codes to rare outcomes, we can achieve efficient compression.
How does entropy differ between uniform and non-uniform distributions?
A uniform distribution, where all outcomes are equally likely, has the highest entropy because the uncertainty is maximized. A non-uniform distribution, where some outcomes are more probable than others, has lower entropy.
How does probability relate to information content?
Information content is inversely proportional to probability. The less likely an event is, the more information it provides when observed. This is represented as the negative logarithm of the probability.
What is entropy in Information Theory?
Entropy quantifies the average amount of information contained in a random variable. It represents the expected surprise of an outcome and is calculated as the sum of probabilities of all outcomes multiplied by their information content
How does entropy extend to continuous variables?
For continuous variables, entropy is generalized to differential entropy, which measures uncertainty over a continuous range of values. It differs from discrete entropy by a constant term that depends on bin size.
What is Shannon’s Noiseless Coding Theorem?
This theorem states that the entropy of a source is the minimum number of bits required on average to encode its messages without losing information.
How does entropy relate to thermodynamics?
Originally a concept in physics, entropy in thermodynamics measures disorder in a system. Statistical mechanics later connected this to information theory, showing entropy as a measure of uncertainty in state descriptions.
What is Maximum Entropy?
The principle of Maximum Entropy states that, given limited knowledge about a system, the probability distribution that best represents this knowledge is the one with the highest entropy, meaning it assumes the least additional information.
How does entropy extend to continuous variables?
For continuous variables, entropy is generalized to differential entropy, which measures uncertainty over a continuous range of values. It differs from discrete entropy by a constant term that depends on bin size.
What is Conditional Entropy?
Conditional entropy measures the remaining uncertainty of a variable after another related variable is known. It quantifies how much additional information is needed to describe the first variable given the second.
How does entropy change when considering multiple variables?
The total entropy of two variables is their individual entropies minus their mutual dependence. The more they are related, the lower the conditional entropy.
What is Relative Entropy or Kullback-Leibler (KL) Divergence?
KL Divergence measures how different one probability distribution is from another. It quantifies the extra information needed to encode samples from one distribution using another distribution’s code.
How does KL Divergence relate to Bayesian analysis?
In Bayesian statistics, KL Divergence is used to measure how much a posterior distribution differs from a prior distribution after incorporating new evidence.
What is the significance of KL Divergence being non-symmetric?
KL Divergence is not a true distance metric because swapping the two distributions changes the value. It specifically measures how much information is lost when using an approximate distribution instead of the true one.
How is convexity related to KL Divergence?
Convexity ensures that KL Divergence satisfies certain mathematical properties, including always being non-negative and only equaling zero when two distributions are identical.
What is Mutual Information?
Mutual Information quantifies the reduction in uncertainty of one variable given knowledge of another. It measures how much knowing one variable tells us about another.
How does Mutual Information relate to KL Divergence?
Mutual Information can be defined using KL Divergence between the joint distribution of two variables and the product of their marginal distributions. This captures how dependent the two variables are.
How is Mutual Information useful in machine learning?
Mutual Information is used in feature selection, clustering, and other tasks where understanding the dependency between variables is critical for model accuracy.
Week 5 slides
Why is approximation needed in Bayesian inference?
Bayesian inference often requires calculating posterior distributions, which can be intractable for complex models. Approximation methods, such as variational inference and Laplace approximation, help make these calculations feasible.
What are the two main types of approximation methods?
Deterministic methods: Laplace approximation and variational inference.
Stochastic methods: Markov Chain Monte Carlo (MCMC), which relies on sampling.
What is Laplace Approximation?
Laplace Approximation fits a Gaussian distribution to the posterior by centering it at the mode (most probable value). It simplifies complex distributions but may be inaccurate for multimodal or small datasets.
What is Variational Inference?
Variational Inference approximates a complex posterior distribution with a simpler, more manageable one by optimizing an objective function to minimize the difference between the true and approximate distributions.
How does Variational Inference work?
It selects an approximate distribution from a tractable family and optimizes its parameters to make it as close as possible to the true posterior. This converts the inference problem into an optimization problem.
What is the Evidence Lower Bound (ELBO)?
ELBO is an alternative function used for optimization in Variational Inference. Instead of directly minimizing the difference between distributions, ELBO provides a lower bound on the model’s evidence, guiding the optimization process.
What is Jensen’s Inequality and how does it relate to Variational Inference?
Jensen’s Inequality states that for a convex function, the function’s value at an average is less than or equal to the average of the function’s values. It helps derive ELBO and justify its use in variational optimization.
What is Kullback-Leibler (KL) Divergence?
KL Divergence measures how different two probability distributions are. In Variational Inference, it quantifies how much the approximation differs from the true posterior and serves as a key optimization target.
What are Forward and Reverse KL Divergence?
Forward KL (zero-avoiding): Ensures the approximate distribution covers all probable regions of the true distribution.
Reverse KL (zero-forcing): Leads to an approximation that fits tightly around the mode, ignoring low-probability areas.
What is the Mean Field Approximation?
The Mean Field Approximation simplifies Variational Inference by assuming that the variables in the model are independent, allowing for a factorized form of the approximate distribution.
What is the Block Coordinate Ascent algorithm?
This algorithm iteratively updates each variable’s approximate distribution while keeping others fixed, gradually improving the overall approximation until ELBO converges.
What is Parametric Variational Inference?
Instead of assuming a factorized form, Parametric Variational Inference restricts the approximate distribution to a specific parametric family, optimizing its parameters to best match the true posterior.
How does Variational Inference compare to MCMC?
Variational Inference is faster and provides deterministic approximations but may introduce bias.
MCMC is more accurate but computationally expensive due to extensive sampling.
What are the practical challenges of Variational Inference?
It may oversimplify complex distributions.
The choice of approximate family can limit accuracy.
Optimization can be difficult in high-dimensional settings.
What is the overall goal of Variational Inference?
The goal is to efficiently approximate intractable posterior distributions by maximizing ELBO, minimizing KL divergence, and finding a balance between speed and accuracy in probabilistic modeling.
Week 6 Slides
What is the main limitation of Variational Inference?
Variational inference may result in a poor approximation q(θ) of the true posterior p(θ), leading to inaccurate results. The choice of q(θ) can strongly impact performance, and the approach may struggle with complex, multi-modal posteriors.
What is Monte Carlo Integration?
Monte Carlo Integration is a technique for approximating integrals using sample averages. It is particularly useful in probabilistic inference, where computing expectations analytically is difficult. The fundamental idea is to approximate an integral by drawing random samples and averaging their function values.
What is Importance Sampling?
Importance sampling is a method used to approximate expectations or integrals when direct sampling from the target distribution p(x) is difficult. Instead, samples are drawn from an easier-to-sample proposal distribution q(x), and the results are reweighted by the ratio p(x) / q(x) to estimate the true expectation.
What is a key assumption in Importance Sampling?
A key assumption in Importance Sampling is that the proposal distribution q(x) must be non-zero whenever p(x) is non-zero. If does not sufficiently cover the support of p(x), the variance of the estimator can be very high, leading to poor performance.
When is Importance Sampling inefficient?
Importance Sampling becomes inefficient when the proposal distribution q(x) does not closely match the target distribution p(x). If q(x) has lighter tails than p(x), it may rarely sample regions where p(x) has significant probability mass, leading to large importance weights and high variance in estimates.
What is Rejection Sampling?
Rejection sampling is a technique to generate samples from a target distribution p(x) by using a proposal distribution q(x). A sample x ~ q(x) is drawn and accepted with probability proportional to p(x0) / (kq(x0)) , where k is a constant ensuring the comparison function bounds the target distribution.
How does Rejection Sampling differ from Importance Sampling?
Rejection sampling only accepts some samples from the proposal distribution q(x), discarding others based on an acceptance criterion. Importance sampling, in contrast, uses all samples but reweights them to account for differences between p(x) and q(x).
What is Markov Chain Monte Carlo (MCMC)?
MCMC is a class of algorithms used to sample from high-dimensional probability distributions by constructing a Markov chain whose stationary distribution is the target distribution p(x). Unlike Importance Sampling, MCMC does not require a proposal distribution that closely matches P(x), making it effective in high-dimensional spaces.
What is the Markov property in MCMC?
The Markov property states that the future state of a Markov chain depends only on the present state, not on past states. This property is crucial for constructing MCMC algorithms where each sample is generated based on the previous sample.
What is the Metropolis Algorithm?
The Metropolis algorithm is an MCMC method that generates a new candidate state using a symmetric proposal distribution (e.g., a Gaussian centered on the current state). The candidate is accepted with probability proportional to the ratio of the target distribution values at the new and old states; otherwise, the current state is retained.
How does the Metropolis-Hastings Algorithm generalize the Metropolis Algorithm?
The Metropolis-Hastings algorithm extends the Metropolis algorithm to allow for asymmetric proposal distributions q(x^1) != q(x|x^1) . It adjusts the acceptance probability to account for asymmetry in proposals, ensuring detailed balance is maintained.
What is Gibbs Sampling?
Gibbs Sampling is a special case of the Metropolis-Hastings algorithm where each variable is sampled in turn, conditioned on the current values of all other variables. This method is especially useful when the full conditional distributions are easy to sample from.
Why is Gibbs Sampling useful in Bayesian Inference?
Gibbs Sampling is useful in Bayesian inference because it allows efficient sampling from complex joint distributions by iteratively updating each variable using its full conditional distribution. This makes it particularly effective for high-dimensional models.
What are the main advantages and disadvantages of MCMC methods?
Advantages:
Can handle high-dimensional and complex distributions.
Does not require an explicitly known normalization constant for .
Disadvantages:
Requires many iterations to achieve convergence.
Generated samples are not independent (autocorrelated).
Computationally expensive compared to direct sampling methods when feasible.