AI2 Flashcards

Question

# Uncertainty in AI Give real world Application of uncertianity in AI?

Answer 1

- **Autonomous Vehicles:** Predicting traffic behavior under varying conditions - **Medical Diagnostics:** Handling noisy data & ambiguous symptoms - **Finance:** Forecasting under market volatility - **Computer Vision & Robotics:** Many vision & robotics systems still operate with dereministic algorithms - Huge scope to handle uncertainty handlings in object detections, dynamic models & parameters

Answer 2

- Helps predict something by assumung a straight line relationship between features *(x)* & outputs *(y)* - Work out explicitly the epistemic & aleatoric components of the uncertainty

Answer 3

fw(x) = w0 + w1(x1) + ...+ wd(xd) *d* - no. of features ***w*** = (w0,...,wd) - parameters *x1 ... xd* = Feature values of the input data *w0* = y Intercept

Answer 4

L(w) = 1/n * (Σ (yi - fw(xi))^2) i = 0

Answer 5

**y ≃ X w** - ***y*** - Output vector : the target outputs - ***X*** - Design Matrix : Contains all features, with rows as data point & columns as features - ***w*** - Weigh vector: stores all weights

Answer 6

L(w) = * (Σ (yi -( **w**^T)xi)^2) - *n* : Number of data points in your dataset - *yi*: The actual value for the ith data point - *xi*: Predicted value for the ith data point, Multiply xi with **w**^T, it calculates a weighted sum

Answer 7

L(w) = 1/n ||**y - Xw**||^2 - *y*: A vector of actual target value of size n - *w*: A vector of weights of size d - ***X***: matrix of size n * d - *n* : No. of data points (Rows of X) - *d* : No. of features (Columns of X) - Each row, *xi* represents the feature vector of a single data point.

Answer 8

More efficent when working large datasets because they use matrix operation instead of looping through each data point

Answer 9

The sum of the squares of the components of a vector **z**

Answer 10

Minimisng the MSE (Mean Square Error)

Answer 11

- Gradient Descent - Closed Form Solution

Answer 12

L(w) = 1/n ||**y - Xw**||^2 - **y**: True Output - **X**: Design Matrix - **w**: Weight Vector

Answer 13

1) Compute error **y - Xw** 2) Sqaure each error 3) Add them all up 4) Divide by n (No. of point)

Answer 14

1) Compute (X^T)X 2) Compute (X^T)y 3) Compute ((X^T)X)^(-1) - Find Deteminant - Calculate ((X^T)X)^(-1) 4) Compute ŵ - ŵ = ((X^T)X)^(-1) * (X^T)y

Answer 15

X = | x1 x2 | | y1 y2 | det(X) = (x1 * y2 ) - (x2 * y1)

Answer 16

yi = (**(w * )**^T)xi + εi - **εi:** - Random noise independent of all other variables - Has 0-mean - Variance Var(εi) = σ^2, ∀i ∈ [**d**]

Answer 17

y = **W (x * ) + e** - y: Outputs - **X**: Inputs - (x * ): The True weights (UNKNOWN BUT FIXED) - e: Random Noise

Answer 18

1) ŵ - (w * ) 2) ((X^T)X)^(-1) * (X^T) * e

Answer 19

1) ŵ - (w * ) = ((X^T)X)^(-1) * (X^T) * y - (w * ) 2) = ((X^T)X)^(-1) * (X^T) * (X * (w * ) + e) - (w * ) (*Plugged in y*) 3) = ((X^T)X)^(-1) * (X^T) * X(w * ) + ((X^T)X)^(-1) * (X^T) * e - (w * ) 4) = ((X^T)X)^(-1) * (X^T) * e - Since *((X^T)X)^(-1) (X^T) X = 1 & w - w = 0*

Answer 20

y = **X(w * ) + ε - Aleatoric Uncertainty is Irreducible - For a given input x, aleatoric uncertainty afects predictions beacuse of the noise in ε

Answer 21

(ŵ - (w * ))^T * x = -(e^T) * X((X^T))^(-1) * x - Reflects uncertainty about the parameter (w * ), caused by insufficent or incomplete data *- Epistemic uncertainty comes from the estimated weight ŵ being different from the true weight (w ).*

Answer 22

- Collecting more data - Improving the model - Adding more relevant features

Answer 23

((w * ) - ŵ )^T * x

Answer 24

Variance from Epistemic Uncertainty decreases as the dataset n grows

Answer 25

- Most Prevalent Form - Learning with a teacher - Teacher: expected output, label, class, etc - Solve 2 Types of problems: Classification & Regression

Answer 26

Learn a mapping from inputs {x1,...,xn} to corresponding targets {y1,..,yn} - Learn a function from X => Y - E.g. Training a regression model where X is house features & Y is price

Answer 27

- Learning without a teacher - Find Hidden Structures Clustering ==> Group inputs based on similar properties Agent Learns patterns in the input without any explicit feedback

Answer 28

- Focuses on modelling probability distribution over data - Fit Probabalistic model to the same input samples {x1,...,xn} e.g. Gaussian distribution in a dataset

Answer 29

- Bayesians represnet uncertainty by treating θ as a random variable with a prior distribution. - Parameters & data are often continuous-valued ## Footnote D = data θ = Parameters p(θ) = The Prior PDF

Answer 30

*P*(θ|D) = (*P*(D|θ)*P*(θ))/(*P*(D)) where: - *P*(D|θ) => Likelihood of data given parameters - *P*(θ) => Prior (Belief about Parameters befor obesrving data) - *P*(D) => ∫ *P*(D|θ)*P*(θ) *d*θ **Marginal Likelihood / Evidence**

Answer 31

- Bayesian approach to parameter estimation. - The most probable value of θ is where the postriori takes it maximum.

Answer 32

It incorporates a prior belief over parameter values

Answer 33

- *P*(D) has no influence on MAP, therefore making it easier to compute - Map Returns a point-estiamte

Answer 34

**θ MAP = arg max{θ} *P*(θ|X)** **= arg max{θ} log*P*(X∣θ)+log*P*(θ)** - Map Maximises the posterior distirbution to dind the most probable parameters

Answer 35

**Overfit ** - Leading to poor estimates

Answer 36

Incorporates Priors to mitgate the overfitting

Answer 37

Gaussian X ~ Normal (μ,σ^2) - ML (Maxmium Likelihood) for mean μ is : μ**'{ML} = 1/N (Σ{1-N} xi)**

Answer 38

- When N is Large

Answer 39

- The first term vanished with Large n - Var(θ) is of Order 1/n - With finite n ,the 2 terms trade off

Answer 40

Systematic Deviation: **((θ * ) - E[θ'] )^2** θ' = Estimate | θ * = true parameter

Answer 41

-Variablility in Estimate E[θ' - E[θ']^2]

Answer 42

E[(θ' - θ * )^2] = = E[θ']^2] - 2E[θ'](θ * ) + (θ * )^2 = E[θ']^2] + (θ * - E[θ'] )^2 - E[θ * ])^2 (*Rearrange and Complete the square*) = ((θ * ) - E[θ'] )^2 + E[θ' - E[θ']^2]

Answer 43

Bias[θ']^2 + Var[θ']

Answer 44

Distribution for a Binary (2) Random Value

Answer 45

Describes how well a set of parameters explain observed data

Answer 46

**Independent & Identically distributed** - Each data point does not influence or depend on any other data point - Each data point follows the same probability distribution

Answer 47

*P*(**x**,θ) = Π {1-n} *P*(xi;θ)

Answer 48

Likelihood is the Probability Mass Function of the observed data *L*(θ|**x**) = *L*(θ|x1,x2,...,xn) = P**x**(**x**;θ)

Answer 49

- Gives the probability that a discrete random variable takes a specific value. - It applies to discrete distributions like the binomial or Poisson distribution. e.g (X = x) = ((λ^x)e^(-λ))/x! **Poisson distribution.**

Answer 50

*P*(X =x) = nCx p^k (1-p)^(1-k) | nCX = n choose x

Answer 51

*P*(X = x) = ((λ^x)e^(-λ))/x!

Answer 52

Likelihood is the probability density function of observed data *L*(θ|**x**) = *L*(θ|x1,x2,...,xn) = F**x**(**x**;θ)

Answer 53

-Describes the probability distribution of a continuous random variable -Does not give probabilities directly but instead represents the likelihood of different outcomes - e.g. *P*(a <= X <= b ) ∫ {b,a} f(x) dx

Answer 54

-f(x) is PDF which describes how probability is distrubuted over values of X **Conditions:** - f(x) >= 0 , for all values of X - Total Area under curve must be 1 *MEANS INTEGRAL IS ALWAYS 1*

Answer 55

f(x) = (1 / (σ * sqrt(2π))) * e^(-(x - μ)^2 / (2σ^2)) ## Footnote μ = mean σ^2 = variance σ = Standard deviation

Answer 56

Can be both Scalar parameters OR Vector of parameters

Answer 57

Describes the chance of an event occuring, assuming a know distribution | - Predicting events - e.g. Chance of picking a black ball

Answer 58

- **Input**: Observation (data points) -** Ouptut **: Parameter values of the model that explain data -**Model**: Probability distribution -**Task**: Search for the best parameters of the model to fit distribution

Answer 59

1) Select relevant / Given Likelihood Function 2) Turn it into a Log-Likelihood Function 3) Differentiate and solve for the desired value

Answer 60

A Maximum Likelihood Estimator of the parameter θ , denoted as Θ'{MLE} is a Random Variable Θ'{MLE} = Θ'{MLE}(X) *Whose values are given by θ' when X =x*

Answer 61

As it depends on the random variable x

Answer 62

- Denoted as θ'{MLE} - The value of θ that maximizes the likelihood function - θ'{MLE} = arg max{θ} *L*(θ|x)

Answer 63

Finding θ'{MLE} , the value that maximises the function

Answer 64

- Closed Form - Exhaustive Search - Optimization Algorithms

Answer 65

Directly solve for MLE using calculus - Very Rarely we can compute it directly

Answer 66

- Tries all possible θ values - Suitable for low dimensional problems (Grid Search )

Answer 67

A more general Scalable approach - Uses algorithms like Gradinent Descent to find MLE

Answer 68

A function that maps a set of events into a number representing the cost of that event occuring

Answer 69

**Logarithmic Transformation**: Turns product into sums, making differentiation easier **Numerical Stability**: Prevents underflow when dealingwith small probabilities..

Answer 70

- **Convention**: Software for minimization problems - **Convenience**: Logs simplify multiplication into addition & diffentiation easier - **Numerical Stability**: Product of small probabilities can converge to zero, causing compuational issues due to machine precision limits.

Answer 71

Finding the best solution from among the set of all feasible solutions.

Answer 72

- Construct a model - Detemining the Problem Type - Selecting an Optimisation Algorithm

Answer 73

Constructing a model involves designing a mathematical or computational representation of a system based on the problem you are trying to solve. -Model is trained using historical data

Answer 74

- Before applying optimisation, it's essential to classify the problem correctly e.g. {Regression,Classification,Clustering,Optimisation Problems} - Identifying the problem type helps in selecting the appropriate model and optimisation method.

Answer 75

- Optimisation algorithms adjust the model’s parameters to minimise error (loss function) or maximise performance. e.g. {Gradient Descent & Simlated Annealing} - Depends on the problem type, data size, and computational constraints.

Answer 76

Goal of ML: is to infer the function that maps input to outputs so that it can predict the correct output

Answer 77

Yes, the success of the learning depends in a highly non trival way on how this is done.

Answer 78

- Given training data, the process of fimding appropriate parameters of a model is often defined as optimising an objective function ## Footnote Optimize an objective function to find the best parameters for a model

Answer 79

Dividing samples into cluster or groups such that: - Samples within the same group are as similar as possible - Samples in different groups are as different as possible ## Footnote Optimize to group similar data points together

Answer 80

States that the gradient of the cost function must be zero at minimum

Answer 81

- For function g(w) of N-dimesional independent variable, the optimisation problem is: arg min{w} g(w)

Answer 82

A w * is the local minimum if it satisfies the **first-order necessary condtion for optimality:** ***∇{w}g(w * ) = 0{N x 1}*** *Equation*

Answer 83

The point satisfying the condition

Answer 84

- minimum - maximum - saddle point

Answer 85

The equation of first-order necessary condition can be written as a system of N first order equations: **"The gradient of f(x) is equal to zero:"** **∂f/∂x₁ = 0, ∂f/∂x₂ = 0, ..., ∂f/∂xₙ = 0** **"The partial derivatives of the Lagrangian function must be zero:"** **∂ℒ/∂x₁ = 0, ∂ℒ/∂x₂ = 0, ..., ∂ℒ/∂xₙ = 0, ∂ℒ/∂λ₁ = 0, ..., ∂ℒ/∂λₘ = 0**

Answer 86

A first order iterative operation algorithm for finding a local minimum of a differentiable cost function **KEY IDEA:** Employ negative gradient at each step to decrease the cost function

Answer 87

- **Direction:** Detemined by the gradient at the point - **Magnitude:** Called step Size or Leaening Rate

Answer 88

- **Initalisation:** Start at any value of parameter θ - **Repeat:** Change the parameter θ in the direction that decreases the cost *J*(θ) - **Until:** The decrease in cost with each step is very small. ## Footnote GD, works by iteratively adjusting the parameters in the direction that reduces the cost function The Learning rate η controls step size

Answer 89

Predict which category something belongs to. **GOAL:** Learning to determine the most likely class that an input pattern belongs to **Formally:** Model Posterior Probs of class membership conditioned on input features ## Footnote E.g. Based on how many hours a student has studies, we want to predict if they will pass or fail

Answer 90

The probability od an event after considering the evidence

Answer 91

A problem with 2 possible outcomes

Answer 92

Model that give us a probability -* P*{i} = *P*(lable|x{i}) - P >= 0.5, predict PASS, else FAIL ## Footnote Predicts the probability something belongs to class 1

Answer 93

- log-odds (logit) - Sigmoid Function

Answer 94

A way to turn probability into numbers that can be used in *Linear Model*

Answer 95

Describes the likelihood of an ecent occurring

Answer 96

Ratio of the probability of an event occuring to the probability of it not occuring

Answer 97

odds = p/(p-1)

Answer 98

logit(*p*) = log (*p*/(*p-1*)) = log(*p*) - log(1-*p*) = θ0 + θ1(xi,1) + θ2(xi,2) +...+θd(xi,d) = θ0 + Σ {1-j} (θj(xi,d)) = (**θ**^T)**x**i ## Footnote θ0 = Intercept θ1,...,θd = Coefficeints (Weights assigned ti each feature) p = P(Yi = 1|xi) Yi = Target Variable (Dependent Variable)

Answer 99

- A function that maps any real number to a value between 0 & 1 - Used to convert log-odds into a probability

Answer 100

- Maps any real Num to a value between 0 & 1 - Output 0.5 when input is 0 - Approaches 1 as input is very large - Appraches 0 as it becomes very small

Answer 101

**GENERAL**: σ(x) = 1/(1+e((-θ^T)**x**i)) OR **STANDARD**: σ(x)= 1/(1+e^(−x)) ## Footnote Maps the logit to a probability

Answer 102

It is used to find the best parameter for Regression Model, by maximising the likelihood of the observed data ## Footnote Best method to find the best weights for the model so it predicts the observerd data acurrately as possible

Answer 103

Need to estimate d+1 unknow parameter θ

Answer 104

MLE, finds the set of parameters for which of the obeserved data is largest

Answer 105

Model predicts 70% chance of passing for student who study 10hrs & then actually passed, then likelihood is high

Answer 106

Estimate the unknow parameters θ of the logistic regression model

Answer 107

*L*(θ|y,x) = ∏ {i} *P*(yi|xi;0) **For Binary**: *P*(yi|xi;θ) = σ (xi)^(yi) * (1-σ(xi))^(1-yi) **For Log-Likelihood Function** ℓ(θ) = log *L*(θ| y,x) = Σ {i} [yi(log σ(xi)) + (1-yi)log(1-σ(xi))] *Estimation: Find θ'{MLE} that mins the negative log-likelihood* = **θ'{MLE} = arg min{θ} - ℓ(θ)** ## Footnote σ = sigmoid function

Answer 108

Adjust the parameter to make the observed data as likely as possible

Answer 109

Likelihood is the product of the probabilities of the observed outcomes

Answer 110

- Measures how worng the model's predictions are. - Want to minimise this to make the model more accurate

Answer 111

*J*(θ) = -ℓ(θ) = - Σ[yi(log σ(xi)) + (1-yi)(log(1-σ(xi)))] ## Footnote *J*(θ) = Cost Function to be minimised σ(xi) = Sigmoid Function yi = Binary Target label (0 or 1)

Answer 112

- Minimise *J*(θ) to fins the optimal parameters θ' - Use Gradient Descent

Answer 113

(d*J*(θ))/(dθ{j}) = Σ {1-n} [σ(xi) - yi]xij ## Footnote Gradient is used for iterative optimisation

Answer 114

- **Binary Targets**: Outcome variable ha 2 Possible values - **Independent Observations**: Each data point is independent of the others - **Low Multicollinearity**: Features are not highly correlated with each other - **Linearity**: Linear relationship between feature and log-odds - **Sufficient Sample Size**: Demands a large sample for accurate estimation ## Footnote If 2 Features are highly correlated, the model might not work well

Answer 115

Helps calculate the probability of an event based on prior information

Answer 116

Use Bayes' Rule, but it simplfies the problem by assuming that the features are independent of each other, which makes the computation easier

Answer 117

*P*(C|**x**) = (*P*(**x**|C)*P*(C))/*P*(**x**) C is a random variable represent the Class | Calculates probabilites of class based on avaliable data ## Footnote *P*(C|**x**)= Posterior prob of class C given feature vector X *P*(**x**|C)= Likelihood of feature vector x given class C *P*(C) = Prior Probability of class C *P*(**x**) = Marginal Likelihood (constant with respect to C) **x ** = observed feature vector C = Random Variable representing the class label

Answer 118

Features are conditionally indepentdent given the class label | - Efficent for classifcation tasks ## Footnote - Makes the model fast and scalable, through symptoms in reality maybe correlated - Reduces the complexity of problems, making calcualtions easier - Can calculate the likelihood for each feature

Answer 119

*P*(**x**|C) = *P*(x1,x2,...,xd|C) = ∏ {1-d} (*P*(xi|C))

Answer 120

- Prior probability of each class *P*(C) - Conditional probabilites *P*(xi|C) for each feature xi given class C - For a given class C, the posterior probability is: *P*(C|x) = (*P*(C) ∏{1-n}*P*(xi|c))/*P*(**x**)

Answer 121

The model is trained by calculating the prior & the likelihoods

Answer 122

C' = arg max {C} *P*(C) ∏ {1-n}*P*(Xi|C) ## Footnote We calculate the posterior probabilties fore each class & choose the class with highest val

Answer 123

- Simple & Easy to implement - Works well with large datasets - Performs well with categorical & continuous features - Fast training & predictions ## Footnote Naive is popular beacuse it is quick to train & wirks well with bith categorical continuos data

Answer 124

- "NAIVE" assumption od feature independence is rarely in practice - May perform poorly with highly correlated features - Sensitive to imbalanced datasets ## Footnote Simplicity can be a drawbacks when real-world data doesn't follow the independence assumption

Answer 125

Naive Bayes is widely used in tasks such as spam filtering, sentiment analysis & Text Classification because it perfomrs well with text data & large data sets.

Answer 126

**Gaussian Naive Bayes**: Assumes that *P*(**x**|C) (DATA) follows a normal distribution **Multinomial Naive Bayes**: Used for discrete count data, such as text classification; *P*(**x**|C) are computed directly from frequencies of occurence **Bernoulli Naive Bayes**: Assumes binary/boolean features (0 or 1)

Answer 127

Focus on modeling how the data is generated - Model *P*(X|Y) & *P*(Y) to compute *P*(Y|X) - e.g. Naive Bayes

Answer 128

Directly model the decision boundary - Directly model *P*(X|Y) - e.g. Logisitic Regression

Answer 129

- simple yet effeictive classification algorithm - Performas well in practical situations - Assumptions doesn't often hold in real application, but method works effectively well - Computationally Efficent & works well for certain problems

Answer 130

1) Pick/Use a Likelihood Function 2) Use the Log-Likelihood Function 3) Differentiate & solve for P(probabilites)

Answer 131

- Interpretability - Coefficent shows how feature affects probability

Answer 132

- Assume a linear relationship in log-odds, which may not always hold

Answer 133

P(A,B|C) = P(A|C) * P(B|C) ## Footnote Assumption simplifies probability calculations significantly, allowing for efficient classification

Answer 134

How to quantify / measure information?

Answer 135

"Alice is a Student" - Has more info since "Student" is more specific than "Person"

Answer 136

Information = "SUPRISE" in probability terms

Answer 137

- In close Connections - Application Machine Learning - Stats -Engineering - Audio/Video/Image compression (ORIGINALLY)

Answer 138

- The meaning / semantics of data doesn't matter - Instead, INFO is measured by Probability: - Rare events contain more info - Common Events contain less info

Answer 139

Measures info content of an event

Answer 140

- If an event is certain, it gives no new info - The rarer an event, the more suprising it is - 2 Independent events happen, total info is the sum

Answer 141

Ix(x) = -logb[*P*x(x)] = logb(1/(*P*)x(x)) - if P(x) is low then high suprise - if P(x) is high , suprise is low - if Px(x) = 1 then Ix(x) = 0 ## Footnote X is random Variable Probability Mass Function Px(x) P(x) probability of event x

Answer 142

b = 2 => bits b = e => nats b = 10 => dits ## Footnote They determine the units of self information

Answer 143

logit(A) = I(¬A) - I(A) - Information of event A is related to the information of its complement ¬A

Answer 144

Quantifies the uncertainty in a random variable X

Answer 145

H(X) = E [Ix(X)] = -Σ{1-m} P(X = xi) logb *P*(X = xi) = E[logb (1/*P*x(x))] = - E[logb (*P*x(X))] - All outcomes are equally likely, High Entropy (MAX UNCERTAINTY) - One outcome is very likely, entropy is low (LOW UNCERTAINTY)

Answer 146

H(X) = -Σ{1-m} *P*(xi) logb(*P*(xi))

Answer 147

- Y Axis : Entropy - X Axis : Probability

Answer 148

*P*x(.) = *P*(.)

Answer 149

P(X = x) = *P*(x)

Answer 150

*P*x,y(.,.) = *P*(.,.)

Answer 151

P(X=x, Y=y) = *P*(x,y)

Answer 152

Px|y(.|.) = *P*(.|.)

Answer 153

P(X=x|Y=y) = *P*(x|y) ## Footnote Deterministic value But if we write *P*(x|Y), this is Random variable since Y is random variable

Answer 154

Measures uncertainty in 2 Random Variables ## Footnote A measure of the uncertainty associared with a set of variables

Answer 155

H(X,Y) = -E[log*P*(X,Y)] = -Σ{x∈Rx}Σ{y∈Ry} *P*(x,y) log*P*(x,y)

Answer 156

Measures uncertainty of Y given X ## Footnote Quantifies uncertainuty of the outcome of random variable Y given the outcome of another random variable X

Answer 157

H(Y|X) = -E [log *P*(Y|X)] = -Σ{x∈Rx}Σ{y∈Ry} *P*(x,y) log *P*(y|x) ## Footnote Don't use this directly ,we re-write it to avoid computing *P*(y|x)

Answer 158

Using Bayes rule identity, where *P*(y|x) = (*P*(x,y))/(*P*(x)) Therefore: -Σ{x∈Rx}Σ{y∈Ry} P(x,y) log P(y|x) = -Σ{x∈Rx}Σ{y∈Ry} P(x,y) log (*P*(x,y))/(*P*(x)) = -Σ{x∈Rx}Σ{y∈Ry} P(x,y) [log(*P*(x,y)) - log*P*(x)] = H(X,Y) - H(X) SO H(X|Y) = H(X,Y) - H(X)

Answer 159

How different 2 probability distribution are ## Footnote Quantifies the distance between 2 probability distributions

Answer 160

D{KL}(P||Q) = Σ{x∈Rx} *P*(x) log (*P*(x)/*Q*(x)) = **E** [log (*P*(x))/*Q*(x)] ## Footnote **E** [log (*P*(x))/*Q*(x)] - Fraction is always bigger than 1, so its always +VE

Answer 161

- 0 log (0/Q) = 0 - *P* log(*P*/0) = ∞

Answer 162

- D{KL} (*P*||*Q*) >= 0 - D{KL} = 0 if *P*(x) = *Q*(x) - Not Symmetric : D{KL}(*P*||*Q*) != (*Q*||*P*) - Swapping wasn't leas to same answer

Answer 163

For 2 Discrete Distributions P & Q: H(P,Q) = - Σ *P*(x)logQ(x) = H(*P*) + D{KL}(*P*||*Q*)

Answer 164

Cross-entropy quantifies how well the predicted probabilities match the actual labels

Answer 165

I(X;Y) = Σ{x} Σ{y} *P*(x,y) log (*P*(x,y)/(*P*(x) *P*(y) ))

Answer 166

Mutual Information measures the divergence of modelling the joint probability distribution *P*(x,y) as the product marginals *P*x*P*y

Answer 167

When X & Y are Independent: P(x,y) = P(x)P(y) => I(X;Y) = D{KL} (P{x,y} || PxPy) = 0

Answer 168

I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y) = H(X,Y) - H(Y|X) - H(X|Y) ## Footnote The equations reveal how mutal information quantifies the reduction in uncertainty about one random variable given knowledge of another

Answer 169

- **NON-NEGATIVE**: I(X:Y) >=0 - **Symmetric **: I(X;Y) = I(Y;X) - **Measures Statistical Dependence**: - I(X;Y)= 0 iff X & Y Are Independent - I(X;Y) increases with the dependence between X & Y & with their individual entropies H(X) & H(Y) - I(X;X) = H(X) - H(X|X) = H(X) + 0 = H(X)

Answer 170

- Decide which features of the data to use for classification - Feature with Highest MI with class label are the informative features

Answer 171

- Simple & Interpretable (Easy For humans to understand) - Extendable (Random Forest & Gradient Boosting) are ensembles of decision trees

Answer 172

- Credit Card Fraud Detection - Customer Service Automation - Pancake Recipe Optimisation

Answer 173

- **Root OR Internal Node**: Represents a Feature - **Leaf Node**: Represents the target value (Class Lable OR values) - **Branch**: Represnts a decision rule

Answer 174

- **Classification Trees**: Target variable takes categorical values (e.g. Male/Female) - **Regression Trees**: Target variables takes continuous values (e.g. Temperature)

Answer 175

1) Find the best rule to split the data 2) Repeat until each partition is homogenous (PURE CLASS LABEL)

Answer 176

- Gini Index (Gini Impurity) - Information Gain

Answer 177

- Used in CART (classification And Regression Trees)

Answer 178

I{G} = 1 - Σ {1-j} (pi)^2 | pi is the fraction of items labeled with class *i* in the dataset ## Footnote For every feature does this, to find the most informative feature

Answer 179

Measures the reduction in entropy after splitting a dataset based on a feature

Answer 180

- Feature that maximises info gian is used for the split

Answer 181

IG(Y,X) = H(Y) - H(Y|X) | Y => Represents target X => Represents a feature of the input sample ## Footnote X & Y are random Variables H(Y) => Entropy of Labels H(Y|X) => Conditional Entropy after splitting on feature X

Answer 182

- Quantifies the improvement in classifying labels after using a feature to split the dataset - Feature that maximises info gain is chosen for the split

Answer 183

Overfitting - Don't want too many leafs - Even if the leaf nodes don't achieve 0 entropy

Answer 184

- Early Stopping (Limit tree depth/ min leaf size) - Post-Pruning (Remove unnecessary branches)

Answer 185

- **Unstable**: Small change in data, will create Different Tree - **Greedy Search**: No Backtracking - **Lower Accuracy**: Less Accurate than other models

Answer 186

- **Random Forest**: Combines multiple decision trees to reduce overfitting & improve accuracy - **Gradient Boosting**: Builds an esemble of trees sequentially, where each tree corrects error made by previous tree

Answer 187

Mutual Information

Answer 188

- Decision Tree learning algorithm recusively use mutual info to select the feature that share the most info with dependent variable - Selected Feature is used to split the data

Answer 189

- Searching through all subsets is computationally too demanding - Greedy approach is feasible instead

Answer 190

Choose a subset of feature S from the intial set F that maximises the mutual info between the target variable Y & the subset of selected features S: arg max {S⊆F} I(Y;S) ## Footnote - Select only relevant feature before training classifier

Answer 191

Greedy Approach is feasiable instead

Answer 192

- Start with all features - Pick a Feature Maximising Mutual Information - Remove from candidate set & repeat until K feature are choosen

Answer 193

f{max} = arg max {x∈F} (I(Y;X)-β Σ {xs∈F} I(Xs;Xi)))

Answer 194

How sure we are about something

Answer 195

Limits or Constraints

Answer 196

A classifier learn well if it can make correct predictions on new exa,ples, not just ones it has seen before

Answer 197

TO: - Understand hoe machine learning works - Create a better training method fo models - Develop better ways to check if classifier is reliable

Answer 198

- Classifier only memorises the training data, will fail on new data - Watnt a classifier that can genralise (works on unseen data)

Answer 199

1) Divide samples into train & test set 2) Train on train set 3) Test on test set ## Footnote Test Set (Used to see how many errors are made) Train Set (Used to teach classifier is working)

Answer 200

- Not a perfect method doesn't tell us how good the classfier really is - Need a more mathematical way to measure reliablity

Answer 201

Overfitting: When classifier memorises training data instead of learning pattern It is bad as a good classifier should work on new data, not just old data

Answer 202

Design a better learing algorithm - Ask question, like: - what is a goof pruing criterion - Why are large margin good - What other algos are likey to get good results ## Footnote The questions are important to make machine learning models better

Answer 203

- Input space - Input set {e.g. pics of cats & dogd}

Answer 204

- Output space - Output et (E.g. 1 = cat ; 0 = dog)

Answer 205

c : 𝒳 => 𝒴 - Looks at Input & predicts a label

Answer 206

All samples are drawn independently & indelitcally distibributed from some unknow Distribution *D* over 𝒳 x 𝒴

Answer 207

- A random i.i.d sample set S = (x1,y1), (x2,y2),...,(xn,yn) - We have n examples, where each x is an imput & each y is correct answer

Answer 208

- Data comes from unknow pattern - Data points ate randomly pickes & idependent

Answer 209

- Actual probability that the classfier makes a mistake/Error.

Answer 210

cD = P(X, Y from D) [c(x) ≠ y]

Answer 211

We don't the True Distribution, therefore we don't know the value of cD

Answer 212

Depends on c, if c is determinstic (fixed), then cD is deterministic. If c is Random then cD is random

Answer 213

Estimated Error using training sample

Answer 214

ĉS = (1/n) * Σ [c(xi) ≠ yi] from i = 1 to n | Avg. of the errors in the training set ## Footnote This is the fraction of mistakes in the observed sample

Answer 215

Binomial Distribution

Answer 216

P(n * ĉS = k | cD) = (n choose k) * (cD^k) * (1 - cD)^(n-k) ## Footnote This gives the probability of making exactly k mistakes in n trials

Answer 217

Decribes the Likelihood of diifferent error counts

Answer 218

How Likely different outcomes are

Answer 219

- Classifier Makes n predictions, the No.of misatkes follow Binomial Distribution - True error rate = cD e.g. - Flipping a baised coin n times wher: Heads = Error (prob cD) Tail = Not Error (prob 1 -cD) PMF tells us the probability of getting k ERRORS

Answer 220

PMF is Normal (Prob is near 0)

Answer 221

PMF is wider

Answer 222

More predictable classifier (Low uncertainty)

Answer 223

Less Predicatble (High Uncertainty)

Answer 224

Tells us the probability of getting at most k mistakes

Answer 225

Bin(n, k, cD) = P(ĉS ≤ k/n | cD) ## Footnote Sum of the values up to the No. of Arguments (No. of Errors)

Answer 226

- Sums up probs of making 0 to k mistakes - We ask what is the prob of k or fewer mistakes (Not Just k)

Answer 227

To solve for cD, setting confidence level δ where delta is: 1- (confidence level) e.g. confidence level is 95% then δ = 0.05

Answer 228

- Pick confidence level - Use this find upper bound on the classfier's true error cD

Answer 229

Bin(n, k, δ) = msx {P : Bin (n, k, P) >= δ} ## Footnote Means we find the highest true error rate cD that still gives us at most k mistakes in n trails, with confidence δ

Answer 230

1) Looks at different possible error rates cD 2) Find which pass the Binomial test for being >= δ 3) Take the max (highest) of the accepted values

Answer 231

P(cD ≤ Bin(n, n * ĉS, δ)) ≥ 1 - δ

Answer 232

- Probability our estimate bound is correct is at least 1 - δ. - Provides a probabilistic guarantee on the true error rate based on observed data - Derived under assumption of i.i.d

Answer 233

- Guarantees we are not understimating classfier's true error - More data (n) we havem tighter bound becomes

Answer 234

2 Ways: - Numerical method to compute Bin(n, n * ĉS, δ) - Approximate upper bound of Bin(n, n * ĉS, δ)

Answer 235

- Compute Bin(n, n * ĉS, δ) accurately - Give tightness possible bound

Answer 236

- Compute Upper Bound of Bin(n, n * ĉS, δ) - Faster, but loses some tightness

Answer 237

P(cD ≤ ĉS + sqrt( (ln(1/δ)) / (2n) )) ≥ 1 - δ | Formula gives upper bound on the classofer's true error ## Footnote cD = Real error rate of the classifier ĉS = observed error on test set n = No. of test examples δ = Confidence level

Answer 238

- Upper bound approximation, which are simplier to compute & simpler to interpret

Answer 239

- large n, extra term becomes small, meaning we get tight bound

Answer 240

ONLY WORKS IF TEST SET IS TRULY INDEPENDENT - If we use the training, this bound doesn NOT work

Answer 241

1) Help us Certify how well a classifier has learned 2) Probabilistic 3) Large Test Set give tighter bounds 4) Assume the data point are Independent 5) Useful in Both Theory

Answer 242

- Bound is tight, classifier is very relaible - Bound is wide, need more data to be confident

Answer 243

- No guarantee the error is exaclty what we calculated - Tells us the bound is correct in 1-δ case

Answer 244

- More data reduces uncertainty - Gives more precise estimate

Answer 245

Test data is not Random or Independent then Bound is not valid ## Footnote Has to be i.i.d

Answer 246

- Bound is simple & well understood - Better than just using a test set accuracy %, can be misleading

Answer 247

A way to estimate the true error rate of a classifier when we only have the training set

Answer 248

- Real life, er often have very little labelled data - Might not have enough data for a test set - Need a way estimate true error without test

Answer 249

- Only works if the test set is independent - If we use the training set, the classifeir has already seen data - Errors in trinig set are not Random anymore

Answer 250

- Training errors are no longer independent, because the the classifier have already seen the data ## Footnote E.g. Imagine a student memorises answers for test: - Check their performance on the test, they will one do grtat - True ability might be worse

Answer 251

- Uses Prior Belief - Before seeing any training data we assume one classfier are more likely than others - Defines a probability distibution over classifer P(c) - Represents our initial guess about which classfiers are good

Answer 252

- Classifeir too complex, less likely yo generalise well - Simpler models are more likely to generalise - Build a bound that takes complexity into account

Answer 253

P(cD ≤ Bin(n, n * ĉS, δ * P(c))) ≥ 1 - δ ## Footnote Adjusts the test set bound using a prior probability P(c) for the classifier

Answer 254

- About how good classifier is - Confidence bound is adjusted based on P(c)

Answer 255

P(cD ≤ ĉS + sqrt( (log(1/P(c)) + log(1/δ)) / (2n) )) ≥ 1 - δ

Answer 256

- Occam's Razor Bound is Weeaker than Test Set - But Best we can do without a test set

Answer 257

Dpends on Self-Information log(1/(P(c)) of the classifier with respect to prior bets

Answer 258

- Each classifer has its own distribution - Bound ensures we don't pick a bad classifier by accident - Size of δP(c) depends on our intial guess P(c) - Better Prior = Tighter Bound

Answer 259

cD ≤ ĉS + sqrt( (ln(1/δ)) / (2n) ) ## Footnote For 95% confidence, set δ = 0.05

Answer 260

Bias-Variance Analysis

Answer 261

Estimate the true parameter θ*

Answer 262

Cnstruct an estimator θ'n for an unknown parameter (θ * ) given observed data S. ## Footnote Index n is to indicare that we obtianed this estimate from n examples

Answer 263

S = {(xi,yi) , i ∈ [n]} => Genreated from unknow distribution that has some true parameter => Consits of Inputs & Outputs

Answer 264

θ'n is a random variable because it depends on the dataset, which contains randomness

Answer 265

- Estimating the probability of a coin flip {Beronoulli Distribution} - Estimating parameters of Linear Regression Model : θ'n = ((X^T X )^-1) (X^T)y

Answer 266

We generate multiple training sets, each will give a different estimate θ'n.

Answer 267

Low Bias: Estimates are close to True Parameter High Bias: Estimates are Far from True Parameter Low Variance: Estimates are Close together High Variance: Estimates are Spread out

Answer 268

Low Bias & Low Variance is what we want, but practically their are usually trade offs.

Answer 269

Difference in the estimates & the ground truth (TRUE PARAMETER) Definition: Bias of θ'n is Bias(θ'n) = E[θ'n] - θ *

Answer 270

if E[θ'n] = θ * , for all θ *

Answer 271

On the Ground Truth {True Parameter}

Answer 272

High Bias dominates the error

Answer 273

High Variance dominates the error

Answer 274

- Measures how much estimates flucuate. - Definition: Varaince of θ'n is Var(θ'n) = E[( θ'n - θ * )].

Answer 275

- Sum of (bias)^2 & Variance **E[(θ' - θ * )^2] = (θ * - E[θ'])^2 + E(θ' - E[θ * ]^2)**

Answer 276

θ'{MLE} = (Σxi) / n | i is 1 to n ## Footnote MLE is the sample mean of Bernouli trails

Answer 277

θ'{MAP} = ((Σxi)+ α + β) / (n + α +β - 2) | i is 1to n ## Footnote Maximiser of the posterior distribution of θ when the prior distribution is p(θ) = Beta( α , β)

Answer 278

Bias(θ'{MLE}) = E[ (Σxi) / n] - θ = (nθ)/n - θ = θ - θ = 0 | Due to Bias = 0 it is MLE is Unbiased ## Footnote (Σxi) / n becomes (nθ)/n as xi has expectation of θ so it becomes nθ

Answer 279

Bias(θ'{MAP}) = E[θ'{MAP}] - θ = ((nθ + α - 1) / (n + α + β - 2)) - θ ## Footnote MAP is **BIASED** estimator toward prior mean, especially for small n, but becomes unbiased as n grows larger

Answer 280

Var(θ'{MLE}) = Var((Σxi) / n) = (Var (Σxi))/n^2 = θ(1-θ)/n

Answer 281

Var(θ'{MAP}) = (1/(n + α + β - 2)^2) * Var(ΣXi) = nθ(1-θ) /((n + α + β - 2)^2) ## Footnote informative prior (α , β) variance reduces Very Large n Diminishes α & β Small n, α & β have to be informative

Answer 282

- Want an estimator that has Low bias & variance (Cannot be acheived simultaneously) - MLE has low Bias if n is large, but has high variance - MAP has high Bias, but low Variance - Bayesian estimator incorporate prior information, introduces bias but reduces Variance - Bayesian posterior mean is biased bu acheives lower than MSE than frequentist Estimator - Best choice depends on sample size, prior knowledge & application needs

Answer 283

- Map is Bayesian - Bias-Variance is Frequentist

Answer 284

Find estimator f'(x) that approximates f(x) using the training set

Answer 285

y = f(x) + ∈ | ∈ = Random noise {mean = 0 & variance = σ^2} f(x) = Actual Relationship

Answer 286

Prediction are stable but far from true function High Bias: Simple Linear model makes similar predictions, but they are all wrong

Answer 287

S = {(xi,yi)}

Answer 288

Each model fits trainig data well but gives different results - High variance = High degree polynomial perfectly fits each data set but varies widely

Answer 289

The Performance of a models at a fixed test point x

Answer 290

The difference between the true fucntion f(x) and the expected model prediction ocer all possible training sets: **Bias^2 = (f(x) - E{s}[f'(x;S)])^2** | E.g All predictions are systematically too low or high, model has bias ## Footnote f(x) => True function that generated the data f'(x;S) => Model's prediction at x given training set S E{s}[f'(x;S)] => Expected prediction over different training sets

Answer 291

How much the model's predictions vary for different training sets S: **Variance = E{s}[(f'(x;S) - E{s}[f'(x:S)])^2]** | - Captures the sensitivity of the model to different training sets ## Footnote High variance means the model is sesnitve to small changes in the data

Answer 292

Represents the irreducilbe noise, is the variance in y that is independent of x: **Irreducible Noise = E{∈}[∈^2] = σ^2 ** ## Footnote Inherent Randomness in data cannot be reduced

Answer 293

**Start with:** E{s}[(f'(x;S) - f(x))^2] **Add/Sub the Mean:** = E{s}[(f'(x;S) - E{s}[f'(x;S)] + E{s}[f'(x;S)] - f(x))^2] **Expand the Square:** = (E{s}[f'(x;S)] - f(x))^2 + E{s}[(f'(x;S) - E{s}[f'(x;S)])^2] + Cross-term **Eliminate Cross-term:** = E{s}[f'(x;S) - E{s}[f'(x;S)]] = 0

Answer 294

- Includes Noise E{s,∈}[(y - f'(x;S))^2] =E{s}[(f(x) - f'(x;S))^2] + E{∈}[∈^2] =(E{s}[f'(x;S)] - f(x))^2 + E{s}[(f'(x;S) - E{s}[f'(x;S)])^2] + σ^2 = Bias^2 + Variance + σ^2

Answer 295

**Bias**: Measure the error due to approximating the true function (High Bias can occur with overly simple models [underfitting]) **Variance**: Measures the variability of prediction f'(x;S) around its mean (High Variance occurs with overly complex models [Overfitting]) **Noise**: Irreducible error inherent in the data generation process **Tradeoff**: Increasing the model complexity typically reduces bias but increase variance & vice versa

Answer 296

A model with too few parameters will have bais and low variance

Answer 297

A model wth too many parameters will have low bias & high variance

Answer 298

- First Line of defense agaisnt overfitting (Reducing Varinace) - Adding a penalty term to the cost function to reduce variance

Answer 299

Map estimation of the parameters, which can reduce variance in expense of bias

Answer 300

λ * ||θ||^2 J(θ) = - logP(D|θ) +λ * ||θ||^2 - **Working backwards using this amounts to finding the maximising θ**

Answer 301

Idea: Train Multiple models on different random subsets of data & average the prediction (Repeatedly resample the training set, train separate model on each sample and avg the preditions)

Answer 302

Reduces variance

Answer 303

Epistemic Uncertainty is due to lack of knowledge. Bootstrap helps estimate this uncertainty.

Answer 304

Random Variables

Answer 305

Conditional dependencies

Answer 306

E.g Markov Random Fields These graphs define dependencies but do not capture causality

Answer 307

Direct edges to represnt cause-effect relationships ## Footnote A.K.A Bayesian Networks

Answer 308

probabilistic graphical model that uses the direction of edges to represent the cause-effect relationship & bayes' therorem for probabilistic inference ## Footnote Helps make data driven deciions under uncertianity

Answer 309

- **Graphical Representation:** Visual representation of joint probability distibutions of different random variables - **Powerful:** Captures complex between random variables (CAN REPRESENT CASUAL STRUCTURES) - **Combine data & prior knowledge:** Uses both historical knowledge & new observations - **Generative Approach:** Able to generate new data similar to existing data.

Answer 310

- Requires prior knowledge (Needs a known probability distribution) - Computationally Intractable (Large network requires significant computations).

Answer 311

Simple Bayesian network for classification - Target Variable Y is CLASS LABEL - Input Variable X = {X1,X2,...,Xn} is EVIDENCE

Answer 312

- **Inference:** Given evidmnce compute probability of other variables - **Training:** Learn Model Parameters - **Structure Determination:** Identitfy what is connected to what (How variables connect)

Answer 313

Bayesian network is a directed, acyclic graph (DAG) - Node => Random Variables - Directed Edges => Conditional Dependencies - Directed Edges represent directed influence (direct cause)

Answer 314

P(Xi|Parents(Xi))

Answer 315

- Conditional Distribution can be represented as a conditional table (CPT) CPT is the distribution over Xi for each combination of parent values (TABLE THAT SPECIFIES PROBABLILTY)

Answer 316

**Full Joint Distribution:** P(X{1}X{2},...,X{n}) = P(X{1})P(X{2}|X{1}) ...p(Xn|X{1},..X{n-1}) **Limited-Dependence Assumption:** P(X{1}X{2},...,X{n}) = Π {1-n} P(Xi|Parents(Xi))

Answer 317

Reduces complexity by assuming conditional independences

Answer 318

Compact representation of joint probabaility in terms of condtional distribution

Answer 319

- Direct Cause - Indirect Cause - Common Cause - Common Effect

Answer 320

**A** => **B** - A directly influences B E.g. Smoking (A) directly causes lung Damage (B)

Answer 321

**A** =>**B**=>**C** - A affects B, then B affects C E.g. Smoking (A) leads to tar buildup (B), which then causes lung cancer (C)

Answer 322

**B** <= **A** => **C** - A influences both B & C E.g. Genetic Mutation (A) may cause both high blood pressure (B) & Heart Disease (C)

Answer 323

**A** , **B** => **C** - Both A & B Contribute to causing C E.g. Smoking (A) & exposure to asbestos (B) both increase the risk of lung cancer

Answer 324

An edges represents the conditional dependence between the parent node & the child.

Answer 325

The problem of finding out the "Cause" variable when we only observe the "effect" variable

Answer 326

Observed: The ones we have knowledge about Unobserved: Ones we don't observe

Answer 327

P(B) = Σ {A} P(A|B) P(A)

Answer 328

P(B|A) = P(A∩B) / P(A) = P(B)P(A|B)/P(A) | This called diagnosis

Answer 329

P(A) = P(A|B) OR P(A,B) = P(A)P(B)

Answer 330

Two random variables A & B are conditionally independet given a third random variable C: (A ⫫ B) | C iff P(A,B|C) = P(A|C)P(B|C) | reduces the numper of parameters required

Answer 331

P(A,B|C) = P(A|C)P(B|C) Total Prob: P(A,B|C) = ∑{C} P(A∣C)P(B∣C)P(C)

Answer 332

P(C∣A)=P(C∣B)P(B∣A) Total Prob: P(C | A) = Σ{B} P(C | B) * P(B | A)

Answer 333

P(A∣B,C)=P(A∣B)+P(A∣C)−P(A∣B,C) If A is observed then B & C Become Dependent: P(B∣A,C)= P(B∣C)P(A∣B,C)/P(A∣C) ## Footnote B & C cause A

Answer 334

A node is Independent of its non-descendents given its parent

Answer 335

A node's Parent + Childern + Children's Parents

Answer 336

NO DAG is standard Bayesian Network, so only a net of up to 3 nodes only allow for the 4 Standard structures

Answer 337

P(A,B,C) = P(A)P(B|A)P(C|B)

Answer 338

P(A,B,C) = P(A)P(B)P(C|A,B) ## Footnote if C nor any of its descendents are observed then A & B are independent (IF WE CONFIRM ONE CAUSE OF OBSERVATION, REDUCES THE NEED TO INVOKE ALT CAUSE)

Answer 339

P (A,B,C) = P(A)P(B|A)P(C|A) | If A is observed then B & C are independent

Answer 340

A node is independent of its ancestors given its parent

Answer 341

Means of computing probabilties in a Bayesian Network ## Footnote A.K.A QUERY => Answering questions about the underlying probability distribution

Answer 342

- **Diagnosis**: Given an effect, infer a cause - **Prediction**: Given a cause, infer an effect

Answer 343

- **Classification**: Finding the most probable class - **Decision Making**: Computing expected utility of different actions

Answer 344

- **Evidence**: E = {E1,...,En} (Observed/Know Data) - **Query**: X = {X1,...,Xn} (Variables we want to infer) - **Non-evidence**: Y = {Y1,...,Yn} (Neither known nor wanted variable that must be dealt with) ## Footnote The Union of these variables is the complete set of variables of a Bayesian Network

Answer 345

**Unconditional:** Compute P(X) marginal Probability **Conditional:** Comput P(X|E) given Evdience

Answer 346

Find the most probable value for query X given evidence E = e MAP(X|E =e) = arg max{x} P (X = x | E=e) ## Footnote - We want to classify an object into category Y - Used in ML Classification

Answer 347

- Computes the posterior probability distribution/ The exact value of P(X|E) P(X|E) = P(X,E)/P(E) = α * P(X,E) = α * Σ {Y} P (X,E,Y) where α = 1/P(E) | Means Computing preciese Probability ## Footnote a => Normalisation Constant Using marginalisation we sum over non-evidence variables

Answer 348

- Process of summing over non-evidence Variables **Calculations:** - Discrete variable, we sum over their possible values - Continuous Variable, we Integrate instead

Answer 349

Formed by calculating the subset of a larger probability distribution of a collection of random variables

Answer 350

If there are n non-evdince variables & each has m values, the complexity is **O(nm^n)** | The Exponential, making large networks impractical ## Footnote n => No.non-evidence Vars m => No. of values each Vars can take

Answer 351

- Infer the posterior prob by marginalisation of the joint distribution - We comput exact value of P(X|E) using : (Σ{y} P(X,E,Y)) / P(E)

Answer 352

1) Inference in Bayesain Network answers diagnostic & predicitve questions 2) Exact Inference can be performed using enumeration & marginalisation 3) Computational complexity is HIGH, so more efficent Algorithms Exist

Answer 353

- Not about Prediction - But about choice & decision-making, such as planning & assigning - Trying to find the optimal solution - To Minimise/Maximise something - The process of navigating from the start state to a goal state by transitioning through intermediate states.

Answer 354

Computational problem where the goal is to find a solution, that transforms an inital state into a goal state, by exploring a space of possible possible solutions

Answer 355

**State Space**: All possible states **Start State**: Where search begins **Goal State**: Desired state the agent is looking for **Goal Test**: Whether the goal state is achieved or not **Successor Function**: Given a certian state, how to get to the next state **Solution**: is a sequence of actions which transforms the start state to a goal state

Answer 356

A graph to represent all possible states & transitions between them - *Nodes represent states* - *Arcs/Edges represent transitions {From state where can you go}*

Answer 357

Process of solving the search problem can be abstracted as a search tree - * Start state is root node* - *Children corresponds to successors* - *Nodes show states, but correspond to plans that achieve those states* - **Shows all the possible paths from start to end/goal nodes**

Answer 358

1) Start with initial state (*Root nide in the search tree*) 2) Expand a node (*Find all possible next moves*) 3) Pick which node to expand next (*Depends on the search stategy*) 4) Check if it's a Goal State: - **YES** *(Return Solution)* - **NO** *(Continue Expanding)* 5) Repeat until we find the solution or run out of nodes

Answer 359

Method for solving search problems by expanding nodes in a systematic way ## Footnote We explore the state space step by step

Answer 360

- Solving problems, often we have many choices at each step - Tress search help organise these choices into a structure - Ensure we don't miss any solutions & find the best one when neccessary

Answer 361

- Explore deepest nodes first before backtracking - Keeping looking at child (left node) until reaching the leaf which is either a goal or not ## Footnote Uses a Stack Keep track of nodes (LIFO)

Answer 362

- Explores all the nodes at the current depth/level before going deeper - Guarantees to find the shortest path - Good/ideal for shallow goals Order of expandtion is left to right, unless stated otherwise | Uses a Queue to keep track of node (FIFO) ## Footnote Uses more memory than DFS as needs to store all nodes at a level - Slow for deeper goals

Answer 363

We expand goal state, rather than finding it

Answer 364

1) Start at the intial state 2) Pick one path & follow it all the way down 3) If you reach a dead-end, backtrack & try another path 4) Repeat until you find the goal state

Answer 365

Problems where the goal is the to find a valid assignment of values to variables while satisfying a set of constraints.

Answer 366

- Planning - Identification

Answer 367

- Care about the Goal itself, not path DON'T CARE ABOUT ORDER ONLY THE FINAL GOAL

Answer 368

- Sequence of actions ONLY CARE ABOUT THE PATH TO THE GOAL

Answer 369

Identification problems have cosntraints to be satisfied, but there are no preferences. - Has constraints to be met - No Preferences (No Objectives) - No maximise/Minimise required

Answer 370

Hard Constraints, which legal solutions cannot violate

Answer 371

Soft Constraints, where we need to optimise *DON'T NEED TO MINIMISE/MAXMISE BUT STILL THERE*

Answer 372

- ** A set of variables *(X1,X2,...)* ** *Things we need to assign values to* - **A Domain for each Variable *(D1,D2,...)*** *Possible values for each variavle* - **Set of Constraints** *Rules that limit which values can be assigned*

Answer 373

***When every variable has a value***

Answer 374

***Not all variables have a value***

Answer 375

**Standard Search Problems:** - State is a "black-box" :*Arbitrary data structures* - Goal Test can be any function over states - Focuses on planning (Finding Sequences) - Path to solution is important - Explores every possible soultion 1-by-1 - Uses methods like DFS,BFS/A* **CSPs:** - State is defiend by variables - Value of variables are given by the domain - Goal Test is a set of constriants specifying the allowed combinations of the values of variables *Allowed Domain values for a variables that meet the constraints* - The Final Goal is most important - Constraint Based Pruning: Prunes the search space by eliminating infeasible values early. - Uses methods like: - Backtracking Search(uses pruning) - Forwards Checking (eliminating inconsistent values early ) - Local Search

Answer 376

Visual representation of relationships between variables & constraints - Easier to visualise - Helps solves problems using graph base techniques | - Node = Variable - Edge/Arc = Constraints

Answer 377

Square to represent a constraint, & connect all the variables involved

Answer 378

- Something that need to be assigned a value (DOMAIN) - Represent the unknowns you are trying to determine in a given problem.

Answer 379

Set of possible value that each variable can take

Answer 380

- Defines a realtionship between values - Restricts the possible values that can be assigned to them

Answer 381

Consider: - How to create a constraint - What we are trying to solve/ the task

Answer 382

Set Variables: - ** Something related to goals / problem** *e.g. n-queens : goal is to assign a column for each queen, s.t. 2 queens don't threaten each other* - ** May need to consider the size of the domain** *e.g. sudoku, each variable has a domain size of 9 (1-9)* *e.g Optimisation Problem, variables are real numbers, the domain is infinite, but u may have to discretize (**Continuous to Discrete**) it for practical computation* - **How the constraints can be expressed.** *e.g. sudoku, constraint is each could be that each row, column & subgrid must contain all numbers 1-9 without repetiton. **GLOBAL CONSTRAINTS GOVERNS THE REALTIONSHIPS BETWEEN ALL VARS***

Answer 383

**Finite Domains**: - Discrete - Has limited set of values - E.g Sudoku, Minesweeper, & Map Colouring **Infinite Domains**: - Discrete / Continuous - E.g. Variables that involve time/ Numbers (REAL) - Example scenarios: Optimisation Problems, Temperature in different rooms etc.

Answer 384

**Unary**: - One Variable - e.g. a = 1 **Binary**: - It's between 2 Variables - e.g. a != b **High-Order Constraints**: - Involves 3+ variables

Answer 385

They elminate invlaid assignments early to avoid brute-force search

Answer 386

If CSP has n Variables, size of each domain is d, then there are O(d^n) complete assignments. - Larger the search space the more time and memory are needed to explore all possibilie. - Large d & n values, becomes practically impossible to exhaustively explore all assignments - Brute-force become inefficeint for large n & d values, naive search method to search all possible combinations of assignments is exponentially slow as no.of variables increases - Size of problem grow, intractability increases, time to solve problems grows quickly **INFINTE DOMAINS:** - Search space is UNBOUNDED, & there's no simple way to count the possible solutions, making them difficult to search effective

Answer 387

YES , they become **Constrained Optimisation Problems**

Answer 388

- Assignment Problems (*Who teache which class*) - Timetabling Problems (*Class is offered when & where*) - Hardware configuration - Transportation Scheduling - Factory Scheduling - Circuit layout - Fault diagnosis

Answer 389

- Constraint graph - A puzzle where the digits o numbers are represented by letters. - Each Letter Represents a unique digit - GOAL is to find the digits s.t a given equation is verified

Answer 390

- Generate all complete assignments - Test each assignment in turn - Then return the first one that satisifes all constraints

Answer 391

- needs to store all d^n complete assignements Therefore: - Exponential Growth in memory - Generate failed constraints (wastes computation)

Answer 392

In CSPs, states are defined by the values assigned so far *{PARTIAL ASSIGNMENTS}* - **Inital State:** the empty assignment{} - **Successor Function:** assign a value to an unassigned variable - **Goal:** Current Assignment is complete & satisfies all the constraints

Answer 393

- Explores all the shallowest nodes before going deeper - Solution or Goal state is ALWAYS in the bottom layer - BFS needs to traverse and explore all nodes to find the solution *{PARTIAL ASSIGNMENTS}*

Answer 394

- Explore the deepest nodes, until your reach the goal node - Do not need to explore all nodes - Need to consider constraints as we explore

Answer 395

- Only consider values which do not conflict previous assignments - A tie is broken alphabetically & numerically - May have to do computation like check the constraints, i.e. "*Incremental goal test*"

Answer 396

- Variable Assignment are commutative, so fix ordering - One Variable changes at a time each at each layer - The order of assignments doesn't affect the correctness of the solution, but it can impact efficiency

Answer 397

DFS method with 2 additional things: 1) Check constraints as you go 2) Consider one variable at a layer

Answer 398

- A,B,C have domain {0,1,2} - Constraint: A < B < C - Start with empty assignment - Explore Different assignments checking the deeper node first, if it doesnt work go back and choose a different assignment

Answer 399

- **Filtering** - **Ordering** ## Footnote These are methods to speed up searches

Answer 400

Can we detect inevitable failure early - If assignment fails a constraint, you can cancle it Keeps track of domain for unassigned variables and crosses off bad/infeasible options

Answer 401

Forward Checking

Answer 402

Cross of values of neighbouring variables that violate a constraint when added to the exisiting assignment Assigns a variable, cross off anything that is now violated on all its neighbours' domains e.g.: Vars: A,B,C with domain {0,1,2} Constraints: B>A>C Assign A is 0, Domain of B & C are reduced to match the constraints B new domain is {1,2} C new domain is {} => NOT LEGAL as C is Empty

Answer 403

- Assigns first/next rather than alphabetically - Consider the Var with min number of values to explore [e.g. fewest legal values left in domain] e.g.: Vars: A,B,C with domain {0,1,2} Constraints: A <= B < C Assign A is 0, Domain of B & C are reduced to match the constraints B new domain is {0,1,2} C new domain is {1,2} Assign C as it has the fewest variable Repeat previous steps to find a solution

Answer 404

Vars: A,B,C with domain {0,1,2} Constraints: A <= B < C & A + B + C = 3 Assign A first with 0 (Alphabetical & Numerical Assignment) Then pick the next state with the smallest new domain and where assignment of A meets the constraints Repeat the Assignment until u find a solution if domain become {} the solution/ state is NOT LEGAL

Answer 405

- Systematically search the space of assignment in a constructive way - Start with empty assignment - Assign a value to an unassigned variables & deal with constraints until a solution is found

Answer 406

- Space too big & even infinite - In a reasonable time, systematic search may fail to consider enough of the space to give meaningful results

Answer 407

Not Systmematically Search the space, but find solutions quickly on average - Start with arbitrary complete assignment (*Constraint can be violated/ Assignment can be invalid*) - Try to improve the assignment iteratively

Answer 408

- Randomly generate a complete assignment - Solution not found/criterion not met then follow the next steps: - S1) Explore neighbours of the assignment (Randomly choose assignment, look at neighbour) - S2) Choose the assignment that violates the fewest constraints & Repeat the above steps (*If Neighbour does not change or are worse than current one you don't move*) EACH ITERATION REDUCES THE NUMBER OF CONSTRAINTS VIOLATED Repeat the steps until criterion/ all constriants met or no valid options left to move to.

Answer 409

**Hill Climbing:** Var: A,B,C with domain {0,1,2} CONSTRAINTS: A <= B < C - Randomly Generated intial assignment (e.g: A=1, B=1, C=1) - When solution not found/ stop criterion not met: S1) Explore all neighbours of assignment: (e.g: (A=2, B=1, C=1)( A=1, B=1, C=2)) S2) Choose the assignment that violates the fewest constraints ( A=1, B=1, C=2) - First instance not a solution as it doesnt meet constraint (B < C) - Neighbour is a better solution, and its neigbours are not better, then it is the solution the method gives.

Answer 410

No, the solution can get stuck on a local maxima/minima where some constraints are violated. This is where the neighbours of the solution are not better than current one it gets stuck at the that solution and will not reach gloabl maxima/minima

Answer 411

Seeking the best/ Optimum solution (Violates no Constraints), not just a solution that violates a few constraint. Maximises/Minimises the objective function to find the best solution

Answer 412

- **Variables** *e.g.( X = {x1,x2,x3,...,xn})* - **Domain D(xi)** *Possible values for each variable* - **Constraints (C1,...,Cn) ** *Defining Realationships between variables* - **Objective function arg min/max f(X) ** *What we aim to Max/Min*

Answer 413

**Hard**: Must be satisfied to be valid **Soft**: Can be violated but assigned a cost to violation & aims to minimise them

Answer 414

**Not Always** Due to optimisation problems sometimes having continuous search space

Answer 415

Yes, work for both discrete and continuous search spaces

Answer 416

- Fast and Efficient - Deal with problems where the search is difficult to represent / formulate - Can be used in an online settign when the problem changes

Answer 417

- Hill Climbing - Simulated Annealing - Population-based Local searches

Answer 418

Jumping out of local Optima

Answer 419

- No Guarantee to be COMPLETE or OPTIMAL - Can get stuck on local maxima & plateaus (RUN FOREVER IF NOT PROPERLY FORMULATED)

Answer 420

- Rapidly find good solutions by Improving over bad initial state (GREEDY) - Lower Time & Space Complexity compared to search algorithms - No Requirements of problem-specific heuristics (UNINFORMED) - Start from Candidate Solution, instead of building up step-by-step (UNLIKELY BUT POSSIBLE, SOLUTION PICKED AT RANDOM ) Run algorithm for a Maximum number of iterations m Variants of Hill climbing can sort out getting stuck on local maxima/ plateaus

Answer 421

- Stochastic Hill Climbing - First - Choice Hill Climbing - Random - Restart Hill Climbing

Answer 422

- Variants Randomly selects a neighbour that involves an uphill move - Probability of picking a specific move can depend on the steepness - Converges slower than steepest ascent but can find higher solutions

Answer 423

-Randomly generates a single SUCCESSOR neighbour solution & moves to it if it's better than current solution - No UPHILL, keeps randomly generating solutions until there is an uphill move - After MAX num of tries OR generating all neighbours, hasn't found UPHILL move, it gives up & assumes that it's now at Optimal solution. - Time complexity of this is lower as not all neighbours need to be generates before 1 is picked (GOOD WHEN THERE ARE LOADS OF NEIGHBOURS FOR EACH SOLUTIONS)

Answer 424

- Generates a series of different hill climbing searches of the same problem from random to initial states - Stops when goal is found - Can be parallelised / Threaded easily so does not take much time on modern Computers - RARE to have to wait for this to happen

Answer 425

Random - Restart Hill Climbing It generates a solution if one exists, as eventually random start solution will be OPTIMAL SOLUTION

Answer 426

- Find near Optimal Solutions in reasonable time (High Chance of going to GLOBAL MAX, But only Good as the formulation of Optimisation Problem) - Avoids getting stuck in poor LOCAL MAX & PLATEAU by combining EXPLORATION & EXLIOTATION (many solutions we concurrently work with at the end of EXPLORATION )

Answer 427

- Not Guaranteed to be complete OR OPTIMAL (SENSITIVE TO FORMUALTION) - NOT reliable = can't guarantee completeness - Time & Space Complexity is problem & Representation- dependent

Answer 428

min / max f(x) => min/max objective functions s.t g_i(x) <= 0, i = 1,...,m h_i(x) <= 0, i = 1,...,n => Feasibility Constraint ==> x is the DESIGN VARAIBLE (can be anything) SEARCH SPACE is space of all possible x values

Answer 429

is the space of all possible x values ALLOWED by constraints in CANDIDATE SOLUTIONS

Answer 430

Explicitly mentioned CANNOT BE ASSUMED

Answer 431

Rules that must be in place by the problem definitions in order for solutions to be CONSIDERED FEASIBLE e.g: x,y > 0

Answer 432

-Takes design variables as an Input - Outputs a NUMERICAL value that problem aims to MININMISE OR MAXIMISE CAN have Multiple Objective functions in Formulation Defines the Cost or Quality of a Solution

Answer 433

-Constraints that design variables must satisfy for the solution to be FEASIBLE -Usually depicted by functions that take the design variables as input and output a numeric value. -They specify the values that these functions are allowed to take for the solution to be feasible. -There may be zero or more constraints in a problem DEFINES THE FEASIBILITY OF THE SOLUTION

Answer 434

DO NOT keep track of the paths or States that have been visited NOT systematic ,but PROS are: - Use very little Memory - Find Reasonable solutions in Large or Infinite State Space

Answer 435

Optimisation Algorithms that operate by searching from initial state to neighbouring states

Answer 436

TO find & Reach Global Maximum

Answer 437

TO find & Reach Global Minimum

Answer 438

Does not look beyond the immediate neighbours of the current state

Answer 439

-Representation -Initialisation Procedure -Neighbourhood Operator

Answer 440

How to store Design variables in the problem(s) Should facilitate the Application of the Initialisation Procedure

Answer 441

How to pick initial solution. USUALLY RANDOM, Can Be Heuristic

Answer 442

How to generate Neighbourhood Solutions (INCREMENT/STEP SIZE)

Answer 443

Completeness: No, Depends on problem formulation & design of the algorithms (GET STUCK ON LOCAL MINIMA) OPTIMALITY: Not Optimal, (GET STUCK ON LOCAL MINIMA) Time: O(mnp) m = MAX no. of iteration, n = MAX no. of neighbours, EACH take O(p) to generate Space: O(nq+r) ==> r is a constant so ==> O(nq) n = MAX no. of neighbours, Variable takes O(q) and r represents the space to generate the neighbours sequentially(NEGLIGIBLE COMPARED TO n & q)

Answer 444

1) Start at the intial start 2) Expand all possible moves at the first level 3) Move to the next level & repeat 4) Continue until the goal is found

Answer 445

NOT JUST FOR FUN - *Formal model of decision-making between agents* - Each player has a set of possible actions - Resulting outcome depends on the combination of everyone's choices - Each player has a preference over outcomes

Answer 446

Typically a **ADVERSARIAL SEARCH** problem where your opponent has soemthing to say about your strategy

Answer 447

- Deterministic - Stochastic - No.of player : Single/2/Multiple - Zero-sum - Non-zero-sum - Perfect Information - Imperfect Information

Answer 448

- The next state completely determined by the current state & Action - NO RANDOMNESS IF U KNWO EXACTLY WHAT WILL HAPPEN - E.g: Tic-Tac-Toe, Chess, Go

Answer 449

Next State is uncertain, actions have random outcomes e.g. Poker, mahjong

Answer 450

- One (NO Adversary) *e.g. Soliatre* - Two *e.g. Chess, Go* - 2+ *e.g. Poker, Mahjong*

Answer 451

Games where one player wins and another loses *Utilities are negatives of each other* ## Footnote e.g. Chess, go, poker

Answer 452

Games where players can both gain (win) & lose *Where cooperation is allowed/ required* ## Footnote e.g Prisoner's Dilemma

Answer 453

Plauers can see everything, so its more strightforward as all the information is avaliable ## Footnote e.g. Tic-Tac-Toe, Chess ,Go

Answer 454

Parts of the game state is hidden, therefore requires strategies like belief states & probalisitic reasoning HARD TO SEE NEXT OPTIONS OR STATES ## Footnote E.g. Poker, Mahjong Poker (Don't know anyone else's cards)

Answer 455

**States: S** - Intial State/ Root of game tree **Action: A** - Returns all the possible legal moves from State s **Transition Function: S * A => S** - Given State S & Action A, this returns the new state after the move. - From Current state how to get to another state **Terminal Test: S => (true,false)** - Boolean that checks if its an end state - Leaves of game tree, where we assigns final utility values **Players: P = (1,...,N)** - Alternates through players - Return who's turn is next/ currently **Utilities: S{terminal} X P => R** - AKA Objective Function - The final Numeric value of the terminal state S from perpective of a player P - Scores that minimax uses to progate values up a tree to decide players best moves

Answer 456

To find a strategy (Policy) which recommends a move for each state, in order that they can end up with the max achievable utiltity *Whatever state we are in we want the computer to tell us the best next state to go to*

Answer 457

Best achievable outcome (utility) from that state *IS IT A GOOD STATE OR NOT*

Answer 458

- Draw the Current State (Initial State) & assign a player - Draw successor for intial state for other player - Draw successor for all States until reaching Final/ Terminal states/ leaves - Then working backwards from leaves to root assign each state a utility value *Remember the utility is based on a player's perspective and the other player will affect them negatively*

Answer 459

Utility of the terminal to which both player play optimally from that node *(PLAY BEST MOVES FOR THEMSELVES)*

Answer 460

- One player has to maximise utility - Another player has to minimise the Utility of the other player

Answer 461

**function** minimax_val (state) **return **its minimax value if *state* is a terminal state **return** its utility if *state* is for agent MAX to take an action **return** max_val(*state*) if *state* is for agent MIN to take an action **return** min_val(*state*) **function** max_val (state) **return **its minimax value v intialise v = NEGATIVE INFINATE **for** each successor of *state* v = max (v,minimax_val(successor)) **return** v **function** min_val (state) **return **its minimax value v intialise v = INFINATE **for** each successor of *state* v = max (v,minimax_val(successor)) **return** v

Answer 462

Game tree has alternating layers of Max and Min nodes: - Max nodes (your turn) choose the highest value. - Min nodes (opponent's turn) choose the lowest value. - Start from the leaf nodes (end of game states with known values). Recursively propagate values up the tree: - At each node, pick the max or min of the child values depending on its type. At the root, you get the best possible value assuming: - You play optimally, and - Your opponent also plays optimally. Key Notes: - Acts like Depth-First Search (DFS): - Explore all paths left to right, assign values bottom-up. - At the root: - Initially assign the value of the leftmost child. - As you explore other children, only update if a better (max/min) value is found.

Answer 463

- Explores all the successors to find the max value the min player will choose - Explores successors & find the path to the best solution depending on minimax

Answer 464

Usually Used in *Deterministic, zero-sum games* - Our win is our adversary's loss - Player 1 maxes its own utlity and other player minimises it Search: - A state-space serch tree - Players Alternate turns (MAX/MIN) - Compute each node's minimax value (*Best achievalbe utility against an optimal adversary (ASSUMES OPPONENT WILL PLAY OPTIMALLY)*)

Answer 465

Minimax use DFS (*So explores the deepest nodes first*) **Time:** - complexity: O(b^m) - b = branching factor (*no of successors a branch may have*) - d = Max depth **Space:** - Complexity: O(bm) This computatuonally expensive, it will take too much time and space to search using minimax Solving game Trees are completely infeasible in most cases: - Deep Searches, requires all the nodes to be evaluated, and searching millions of nodes takes too much time - Bad or bad branches will still be explored wasting time and memory - Storing all the explored nodes will take alot of memory

Answer 466

Purpose: To reduce the number of nodes explored in the game tree during Minimax search without affecting the final decision. How it works: During exploration, some branches are cut off if they cannot influence the final decision. This happens when a node proves to be worse than a previously examined move and therefore doesn't need further exploration.

Answer 467

If the parent has a constraint and the succesor has its own cosntraint and the domain has no overlapping values e.g. root from another branch get v>= 3 Exploring the next succssor you get v<=2 There are no overlapping values in the domains of either, If there were it would check the next leaf node to see it the roots successor value will change(New min value)

Answer 468

The minimax value

Answer 469

**The best value that MAX can guarantee** - It represents a lower bound, MAX will never accept a move worse than ths

Answer 470

The Upperbound of the minimum value that the maximising player is assured off

Answer 471

- Compute min_val(node) - The Goal for MIN is to mimimize the socre, s.t. it chooses the smallest possible values among its children

Answer 472

- While exploring children of a node, we find a node that never gives us a better value than α - As a result we stop searching the children

Answer 473

**Pruning has not effect on Minimax value for the root** - Doesnt change final value - Sppeds up Computation, best moves stay the same **Good Child ordering improves effectivness** -Evalutes best move first, pruning becomes much more effective *Best move picked first, pruning occurs earlier and more often* *Worst move first, pruning occurs later and not as frequent* **Complexity of perfect ordering : O(b^(m/2))** - Can look twice as deep as your opponent - Improved efficeincy - Can explore a greater depth than opponent/ normal minimax - Full Serach of many games is hopless, *i.e. chess, go* due to the exponential growth of possibilites

AI2 Flashcards

Revision for AI 2 exam