Machine Learning and AI Flashcards

Question

What is regularization and why is it important and how is it represented in a Bayesian framework

Answer 1

Regularization is a technique that penalizes overly complex models (or large parameter values) to prevent overfitting. In the Bayesian framework, the log prior acts as a regularization term by discouraging large parameter values

Answer 2

They create a balance: the log likelihood term ensures the model fits the data well (like OLS), while the log prior term keeps the parameter values reasonable (like regularization), leading to a model that is both accurate and generalizable.

Answer 3

Ridge Regression can be seen as a special case where the Gaussian prior (with a penalty on the size of the parameters) is applied. The regularization term in Ridge Regression is equivalent to the negative log of a Gaussian prior, thus deriving Ridge Regression from a Bayesian perspective.

Answer 4

yi=wxi+ϵi where 𝑤𝑥𝑖 is the linear part and 𝜖𝑖 is the error

Answer 5

The normal distribution is used because it simplifies the math and is suited for continuous data, but it covers all real numbers including negatives.

Answer 6

Because the number of cases is count data (discrete and non-negative), and the normal distribution is not ideal for representing count data.

Answer 7

Models like the Poisson or negative binomial regression are more appropriate for count data.

Answer 8

We need a link function which can combine non-normal data with predictors in a linear way. * In the case of Poisson distribution, the expectation (or mean) is the rate parameter 𝜆: 𝜆𝑖 = 𝑤𝑥𝑖 * The link function above will not be appropriate as 𝜆 must be greater than 0, 𝜆𝑖 = 𝑓(𝑤𝑥𝑖) Where 𝑓(∙) is called an activation function in the ML literature

Answer 9

-Represents the input data where each row is a data point and each column is a feature -Usual notation is X

Answer 10

-Represents the coefficients of the model. -Each row is a model coefficient. -Denoted as w

Answer 11

Represents the observed values(dependent variables) per row -Denoted as y

Answer 12

-Represents the uncertainty in the posterior belief about w(looks like an error function) -Becomes smaller as more data is observed -Formula: Σn = (Σ0^-1 + σ^-2X^TX)^-1

Answer 13

-Represents the uncertainty in the prior belief about w -Larger values indicate higher uncertainty -Denoted as Σ0

Answer 14

-Represents the updated estimate of w after observing data -Formula: Wn = Σn(Σ0^-1w0 + σ^-2X^Ty)^1

Answer 15

-Represents the noise in the observed data -size nxn Formula: p(y|X,w,σ^2) = N(Xw,σ^2I)

Answer 16

-A classification model that predicts probabilities instead of strictly defined class predictions. -e.g. There is a 75% chance this image is a cat and 25% this image is a dog Formula is: θi = 1 / (1+ e^-ni) where ni = w0 + w1xi

Answer 17

Discriminative models learn the boundaries between categories while generative models learn how each category is distributed

Answer 18

Naive Bayes, mixture of Gaussians, Hidden Markov Models

Answer 19

Logistic regression, Linear regression, SVM, Neural Networks

Answer 20

The weights are defined as probablitites -This can account for the uncertainty in classification -The posterior must be approximated using methods such as Markov Chain Monte Carlo or Laplace approximation.

Answer 21

The Naive Bayes classifier is a generative classification algorithm based on Bayes' theorem, with the “naive” assumption that features are conditionally independent given the class. It calculates the probability of each class given the input features. Then, it predicts the class with the highest posterior probability. Despite the simplifying assumptions, it's fast, effective, and widely used, especially in text classification (like spam detection).

Answer 22

We use the Naive Bayes Classifier because it’s simple, fast, and effective, especially for text classification and high-dimensional data. It works well even with limited training data, requires no complex optimization, and gives surprisingly good results by assuming feature independence and applying Bayes’ Theorem.

Answer 23

p(y=c|X) = p(X|y=c) x p(y=c) / p(X)

Answer 24

Bernoulli is for Binary data(yes/no) and Multinomial is used for categorical data e.g. document classification

Answer 25

The posterior distribution is often complex and does not have a closed form distribution(cannot solve exactly) The Laplace approximation simplifies the posterior by approximating it as a Gaussian distribution

Answer 26

Similar to linear regression but the weights are represented as a probability distribution

Answer 27

In Bayesian inference, the posterior distribution often lacks a closed-form solution, making exact inference difficult. Laplace Approximation provides a way to approximate this posterior using a Gaussian distribution centered around the mode, simplifying computations.

Answer 28

The posterior is determined by combining prior knowledge about the model parameters with the likelihood of the observed data. Given that both the prior and likelihood follow a normal distribution, the result is another normal distribution whose center (mean) is a weighted combination of prior beliefs and observed data.

Answer 29

The likelihood function represents how probable the observed data is given specific model parameters. In Bayesian Linear Regression, it assumes that the observed data follows a normal distribution around the model’s predictions, with some amount of noise.

Answer 30

Unlike Bayesian Linear Regression, Bayesian Logistic Regression does not yield a closed-form posterior due to the non-Gaussian likelihood (Bernoulli-distributed output). This necessitates approximations such as Laplace Approximation, Variational Inference, or Markov Chain Monte Carlo.

Answer 31

The Occam Factor penalizes complex models by reducing their posterior probability, favouring simpler models that fit the data well without excessive parameters.

Answer 32

The goal of Laplace Approximation is to approximate a complex, continuous posterior distribution with a Gaussian distribution by expanding it around the mode using a second-order Taylor series expansion.

Answer 33

In one-dimensional problems, Laplace Approximation finds the highest probability point of the distribution and fits a Gaussian curve around it. The shape of this curve is determined by how quickly the probability changes near this peak.

Answer 34

In higher dimensions, the concept remains the same, but instead of a single value, the curvature of the probability distribution is represented using a matrix (called the Hessian matrix). This helps define the shape and spread of the Gaussian approximation.

Answer 35

The probability function is expanded as a quadratic function around its highest probability point using Taylor series. This quadratic form closely resembles a Gaussian curve, allowing it to be used as an approximation.

Answer 36

Ineffective for multimodal distributions (assumes unimodality). Less reliable for small datasets where Gaussian approximation may be poor. Only applies to real variables unless modified. May overlook global features of the posterior.

Answer 37

BIC is a statistical measure used to compare models. It balances how well a model fits the data with how complex it is, discouraging overly complicated models that may not generalize well.

Answer 38

BIC approximates model evidence by balancing fit quality and complexity. A lower BIC value indicates a better trade-off between explanatory power and complexity.

Answer 39

The Occam Factor favors simpler models by reducing their probability if they have too many parameters. This ensures models do not become overly complex without a significant improvement in accuracy.

Answer 40

It transforms intractable posteriors into manageable Gaussian distributions, simplifying computations while retaining Bayesian principles.

Answer 41

Information measures the degree of surprise when observing a specific outcome of a random variable. A certain event provides no information, while an unlikely event provides more information.

Answer 42

Entropy provides a lower bound on the number of bits needed to encode a random variable. By assigning shorter codes to more probable outcomes and longer codes to rare outcomes, we can achieve efficient compression.

Answer 43

A uniform distribution, where all outcomes are equally likely, has the highest entropy because the uncertainty is maximized. A non-uniform distribution, where some outcomes are more probable than others, has lower entropy.

Answer 44

Information content is inversely proportional to probability. The less likely an event is, the more information it provides when observed. This is represented as the negative logarithm of the probability.

Answer 45

Entropy quantifies the average amount of information contained in a random variable. It represents the expected surprise of an outcome and is calculated as the sum of probabilities of all outcomes multiplied by their information content

Answer 46

For continuous variables, entropy is generalized to differential entropy, which measures uncertainty over a continuous range of values. It differs from discrete entropy by a constant term that depends on bin size.

Answer 47

This theorem states that the entropy of a source is the minimum number of bits required on average to encode its messages without losing information.

Answer 48

Originally a concept in physics, entropy in thermodynamics measures disorder in a system. Statistical mechanics later connected this to information theory, showing entropy as a measure of uncertainty in state descriptions.

Answer 49

The principle of Maximum Entropy states that, given limited knowledge about a system, the probability distribution that best represents this knowledge is the one with the highest entropy, meaning it assumes the least additional information.

Answer 50

For continuous variables, entropy is generalized to differential entropy, which measures uncertainty over a continuous range of values. It differs from discrete entropy by a constant term that depends on bin size.

Answer 51

Conditional entropy measures the remaining uncertainty of a variable after another related variable is known. It quantifies how much additional information is needed to describe the first variable given the second.

Answer 52

The total entropy of two variables is their individual entropies minus their mutual dependence. The more they are related, the lower the conditional entropy.

Answer 53

KL Divergence measures how different one probability distribution is from another. It quantifies the extra information needed to encode samples from one distribution using another distribution’s code.

Answer 54

In Bayesian statistics, KL Divergence is used to measure how much a posterior distribution differs from a prior distribution after incorporating new evidence.

Answer 55

KL Divergence is not a true distance metric because swapping the two distributions changes the value. It specifically measures how much information is lost when using an approximate distribution instead of the true one.

Answer 56

Convexity ensures that KL Divergence satisfies certain mathematical properties, including always being non-negative and only equaling zero when two distributions are identical.

Answer 57

Mutual Information quantifies the reduction in uncertainty of one variable given knowledge of another. It measures how much knowing one variable tells us about another.

Answer 58

Mutual Information can be defined using KL Divergence between the joint distribution of two variables and the product of their marginal distributions. This captures how dependent the two variables are.

Answer 59

Mutual Information is used in feature selection, clustering, and other tasks where understanding the dependency between variables is critical for model accuracy.

Answer 60

Bayesian inference often requires calculating posterior distributions, which can be intractable for complex models. Approximation methods, such as variational inference and Laplace approximation, help make these calculations feasible.

Answer 61

Deterministic methods: Laplace approximation and variational inference. Stochastic methods: Markov Chain Monte Carlo (MCMC), which relies on sampling.

Answer 62

Laplace Approximation fits a Gaussian distribution to the posterior by centering it at the mode (most probable value). It simplifies complex distributions but may be inaccurate for multimodal or small datasets.

Answer 63

Variational Inference approximates a complex posterior distribution with a simpler, more manageable one by optimizing an objective function to minimize the difference between the true and approximate distributions.

Answer 64

It selects an approximate distribution from a tractable family and optimizes its parameters to make it as close as possible to the true posterior. This converts the inference problem into an optimization problem.

Answer 65

ELBO is an alternative function used for optimization in Variational Inference. Instead of directly minimizing the difference between distributions, ELBO provides a lower bound on the model’s evidence, guiding the optimization process.

Answer 66

Jensen’s Inequality states that for a convex function, the function’s value at an average is less than or equal to the average of the function’s values. It helps derive ELBO and justify its use in variational optimization.

Answer 67

KL Divergence measures how different two probability distributions are. In Variational Inference, it quantifies how much the approximation differs from the true posterior and serves as a key optimization target.

Answer 68

Forward KL (zero-avoiding): Ensures the approximate distribution covers all probable regions of the true distribution. Reverse KL (zero-forcing): Leads to an approximation that fits tightly around the mode, ignoring low-probability areas.

Answer 69

Forward KL: q tends to “cover” p. (overestimation) Reverse KL: q tends to lock on one of the two modes. (underestimation)

Answer 70

The Mean Field Approximation simplifies Variational Inference by assuming that the variables in the model are independent, allowing for a factorized form of the approximate distribution.

Answer 71

This algorithm iteratively updates each variable’s approximate distribution while keeping others fixed, gradually improving the overall approximation until ELBO converges.

Answer 72

Instead of assuming a factorized form, Parametric Variational Inference restricts the approximate distribution to a specific parametric family, optimizing its parameters to best match the true posterior.

Answer 73

Variational Inference is faster and provides deterministic approximations but may introduce bias. MCMC is more accurate but computationally expensive due to extensive sampling.

Answer 74

It may oversimplify complex distributions. The choice of approximate family can limit accuracy. Optimization can be difficult in high-dimensional settings.

Answer 75

The goal is to efficiently approximate intractable posterior distributions by maximizing ELBO, minimizing KL divergence, and finding a balance between speed and accuracy in probabilistic modeling.

Answer 76

Variational inference may result in a poor approximation q(θ) of the true posterior p(θ), leading to inaccurate results. The choice of q(θ) can strongly impact performance, and the approach may struggle with complex, multi-modal posteriors.

Answer 77

Monte Carlo Integration is a technique for approximating integrals using sample averages. It is particularly useful in probabilistic inference, where computing expectations analytically is difficult. The fundamental idea is to approximate an integral by drawing random samples and averaging their function values.

Answer 78

Importance sampling is a method used to approximate expectations or integrals when direct sampling from the target distribution p(x) is difficult. Instead, samples are drawn from an easier-to-sample proposal distribution q(x), and the results are reweighted by the ratio p(x) / q(x) to estimate the true expectation.

Answer 79

A key assumption in Importance Sampling is that the proposal distribution q(x) must be non-zero whenever p(x) is non-zero. If does not sufficiently cover the support of p(x), the variance of the estimator can be very high, leading to poor performance.

Answer 80

Importance Sampling becomes inefficient when the proposal distribution q(x) does not closely match the target distribution p(x). If q(x) has lighter tails than p(x), it may rarely sample regions where p(x) has significant probability mass, leading to large importance weights and high variance in estimates.

Answer 81

Rejection sampling is a technique to generate samples from a target distribution p(x) by using a proposal distribution q(x). A sample x ~ q(x) is drawn and accepted with probability proportional to p(x0) / (kq(x0)) , where k is a constant ensuring the comparison function bounds the target distribution.

Answer 82

Rejection sampling only accepts some samples from the proposal distribution q(x), discarding others based on an acceptance criterion. Importance sampling, in contrast, uses all samples but reweights them to account for differences between p(x) and q(x).

Answer 83

MCMC is a class of algorithms used to sample from high-dimensional probability distributions by constructing a Markov chain whose stationary distribution is the target distribution p(x). Unlike Importance Sampling, MCMC does not require a proposal distribution that closely matches P(x), making it effective in high-dimensional spaces.

Answer 84

The Markov property states that the future state of a Markov chain depends only on the present state, not on past states. This property is crucial for constructing MCMC algorithms where each sample is generated based on the previous sample.

Answer 85

The Metropolis algorithm is an MCMC method that generates a new candidate state using a symmetric proposal distribution (e.g., a Gaussian centered on the current state). The candidate is accepted with probability proportional to the ratio of the target distribution values at the new and old states; otherwise, the current state is retained.

Answer 86

The Metropolis-Hastings algorithm extends the Metropolis algorithm to allow for asymmetric proposal distributions q(x^1) != q(x|x^1) . It adjusts the acceptance probability to account for asymmetry in proposals, ensuring detailed balance is maintained.

Answer 87

Gibbs Sampling is a special case of the Metropolis-Hastings algorithm where each variable is sampled in turn, conditioned on the current values of all other variables. This method is especially useful when the full conditional distributions are easy to sample from.

Answer 88

Gibbs Sampling is useful in Bayesian inference because it allows efficient sampling from complex joint distributions by iteratively updating each variable using its full conditional distribution. This makes it particularly effective for high-dimensional models.

Answer 89

Advantages: Can handle high-dimensional and complex distributions. Does not require an explicitly known normalization constant for . Disadvantages: Requires many iterations to achieve convergence. Generated samples are not independent (autocorrelated). Computationally expensive compared to direct sampling methods when feasible.

Answer 90

The epsilon-greedy algorithm is a simple yet powerful strategy for decision-making in reinforcement learning. It's mainly used to balance exploration (trying new things) and exploitation (choosing the best-known option) in situations where an agent learns by trial and error—like picking the best slot machine or ad to show.

Answer 91

With a small chance (ε), try something random (explore). Otherwise, stick with the best known option (exploit). Start with guesses for how good each option is. On each turn: -Flip a (weighted) coin: -With probability ε (like 0.1): Try a random option. -With probability 1 - ε (like 0.9): Choose the one that’s performed best so far. -After trying an option, update its average reward based on what happened. epsilon dictates the probability of trying something new

Answer 92

Normal greedy algorithms always select the best known possible solution and always exploit eps greedy gives a way to balance exploration and exploitation by having eps chance to select a random option Greedy actions can sometimes lock onto suboptimal algorithms forever

Answer 93

Action (a) – possible moves that agent can take * State (s) – current situation returned by the environment * Reward (r) – Immediate return sent back from the environment to evaluate the action * Policy (𝜋) – Strategy that agent employs to determine next action based on the current state * Value (v) – Expected long-term return with discount, as opposed to short-term reward

Answer 94

In probabilistic inference, the Bayesian and frequentist perspectives differ mainly in how they interpret probability. Frequentists view probability as the long-run frequency of events and treat parameters as fixed but unknown quantities; they focus on estimating these parameters using data (e.g., via confidence intervals or p-values). Bayesians, on the other hand, treat parameters as random variables with their own probability distributions, which are updated using Bayes' theorem as new data arrives. This allows Bayesians to directly quantify uncertainty about parameters, while frequentists rely on repeated sampling logic.

Machine Learning and AI Flashcards

(122 cards)