AI2 Flashcards

Revision for AI 2 exam

1
Q

Uncertainty In AI

Define Uncertainty:

A

Situatuions where the outcome or state of a system is not fully known

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Uncertainty In AI

What are Sources of Uncertianty?

A

Data Uncertainty: Noisy/Incomplete Data
- e.g. missing / messy info
Modele Uncertainty: Limited or incorrect model assumptions
- e.g. if computer instructions are not great may lead to wrong results
Enviromental Uncertainty
- e.g. World changes (Unpredictable Weather)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Uncertainty In AI

What is Predictive Uncertainty?

simple, easy to understand

A

Example:
Imagine a computer tries to predict how tall a person will be
- Guess will be TOO high or TOO Low

The Gap/Difference between Actual value and guess is its Predictive Uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Uncertainty In AI

What is Predictive Uncertainty?

More scientific or using proper terminology

A

Suppose we trained a model fŵ(x) (Guess), ŵ (is the parameters of the model obtained on training set ) and (x,y) is random test example

**It is the distribution of the Residual y - fŵ(x) **

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Uncertainty In AI

What are the 2 types of Uncertainty?

A

-Aleatoric
-Epistemic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Uncertainty In AI

What is Epistemic Uncertainty?

A
  • Uncertainty due to lack of Knowledge
  • Reducible with more data OR improved Models

example: Insuffiicient training data for a machine learing model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Uncertainty In AI

What is Aleastoric Uncertainty?

A
  • Intrinsic / Natural Randomness in data or process
  • Thus Uncertainty cannot change
  • If too many data point or feature overlap could be this type (on a graph)

Example: Sesnor noise on autonomous Vehicles

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Uncertianity In AI

What is the equation that the residual error can be decomposed as?

A

y - fŵ(x) = y - g(w * ) + g(w * ) - fŵ(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Uncertianity In AI

What is the Goal of this:
y - fŵ(x) = y - g(w * ) + g(w * ) - fŵ(x)?

A

Break difference between the actual value (y) & the AI’s prediction (fŵ(x)) into 2 types of uncertainty

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Uncertainty in AI

What is the Aleatoric Uncertainty in:
y - fŵ(x) = y - g(w * ) + g(w * ) - fŵ(x)

A

** y - g(w * ) **
- Error of the best possible model (g(w * ))
- g (true model) and w (true parameters) are not known to use

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Uncertainty in AI

What is the Epistemic Uncertainty in:
y - fŵ(x) = y - g(w * ) + g(w * ) - fŵ(x)

A

** g(w * ) - fŵ(x)**
- Error of our model fŵ(x) VS. g(w * )

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Uncertainty in AI

What is model family?

A

The best possible type modle to use (e.g Linear Regression / Neural Network)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Uncertianty in AI

What happens to the equation when we know the model family?

A

y - fŵ(x) = y - f(w * ) + f(w * ) - fŵ(x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Does w * = ŵ?

A

Not typically beacuse we only see a finite amount of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Uncertainty in AI

How can total error be split?

A
  • Randomness from enivroment
  • Lack of knowledge about the world
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Uncertainty in AI

What is the Bayesian Perspective?

A
  • Uncertainty is represented as probability distribution (prior belief [incorporate prior knowledge]) over a set of parameter values of the model, updated based on new data
  • Computer starts with a guess (prior) & gets smater when it sees new info
  • Treats model parameters (w) as random
  • Data is seen as non-random
  • Guesses the uncertainties explicitly (Probabiliites )
  • e.g. starting a problem with a map
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Uncertainty in AI

What is the Frequentist Prespective?

A
  • Uncertainty is based on long-run frequencies of outocmes
  • Data is treated as random, drawn from some unknown distrubution & model parameters are non-random
  • Data used to estimate parameters
  • Treats model as fixed, but unknown
  • Data is RANDOM & we estimate parameters from the DATA

Solves problems without knowing how big it is

Won’t say how confident it is, instead gives a range

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Uncertainty in AI

What is the difference between Bayesian & Frequntist in Interpretation of Probability?

A

-Bayesian: Probability as a degree of belief
-Frequentist: Probability as the long-run frequency of events

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Uncertainty in AI

What is the difference between Bayesian & Frequntist in Handling of Uncertainty?

A

-Bayesian: Quantifies and update epistemic uncertainty
-Frequentist: Captured by confidence intervals or p-values, not beliefs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Uncertainty in AI

What is the difference between Bayesian & Frequntist in Use of Data?

A

-Bayesian: Incorporates prior knowledge & continuously updates
-Frequentist: Relies on observed data to estimate fixed model parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Uncertainty in AI

What are the reasons to handle uncertainty in AI?

A
  • Improved Decision Making
  • Better Model Generalization
  • Robustness & Reliability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Uncertainty in AI

What is Improved Decision-Making?

A

Accounts for possible errors or unknowns
example:
- if uncertain, then patient should see doctor
- if uncertain, robot should avoid obstacle, defer action, , move closer & collect more data
- id uncertian, raise query to the humans (active learning)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Uncertainty in AI

What is Better Model Generalization?

A

Avoids overfititng to noisy data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Uncertainty in AI

What is Robustness and Reliabilty?

A
  • Rdeucing epistemic uncertainty facilitates performance under unpredictable condtions
  • Quantifies confidence & accuracy of your model help the human users take approriate decisions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
# Uncertainty in AI Give real world Application of uncertianity in AI?
- **Autonomous Vehicles:** Predicting traffic behavior under varying conditions - **Medical Diagnostics:** Handling noisy data & ambiguous symptoms - **Finance:** Forecasting under market volatility - **Computer Vision & Robotics:** Many vision & robotics systems still operate with dereministic algorithms - Huge scope to handle uncertainty handlings in object detections, dynamic models & parameters
26
# Uncertainty in AI What is Linear Regression?
- Helps predict something by assumung a straight line relationship between features *(x)* & outputs *(y)* - Work out explicitly the epistemic & aleatoric components of the uncertainty
27
# Uncertainty in AI What is the Linear Regression Model?
fw(x) = w0 + w1(x1) + ...+ wd(xd) *d* - no. of features ***w*** = (w0,...,wd) - parameters *x1 ... xd* = Feature values of the input data *w0* = y Intercept
28
# Uncertainty in AI What is the loss function/ Mean Sqaure Error?
L(w) = 1/n * (Σ (yi - fw(xi))^2) i = 0
29
# Uncertainty in AI What is Vectorising Linear Regression?
**y ≃ X w** - ***y*** - Output vector : the target outputs - ***X*** - Design Matrix : Contains all features, with rows as data point & columns as features - ***w*** - Weigh vector: stores all weights
30
# Uncertainty in AI What is the Mean Square Error Equation?
L(w) = * (Σ (yi -( **w**^T)xi)^2) - *n* : Number of data points in your dataset - *yi*: The actual value for the ith data point - *xi*: Predicted value for the ith data point, Multiply xi with **w**^T, it calculates a weighted sum
31
# Uncertainty in AI What is Vectorized From in MSE?
L(w) = 1/n ||**y - Xw**||^2 - *y*: A vector of actual target value of size n - *w*: A vector of weights of size d - ***X***: matrix of size n * d - *n* : No. of data points (Rows of X) - *d* : No. of features (Columns of X) - Each row, *xi* represents the feature vector of a single data point.
32
# Uncertainty in AI Why is Vectorised Form of MSE better than normal MSE?
More efficent when working large datasets because they use matrix operation instead of looping through each data point
33
# Uncertianty in AI Does ||**y - Xw**||^2 == Σ (yi -( **w**^T)xi)^2
YES
34
# Uncertainty in AI What is ||z||^2?
The sum of the squares of the components of a vector **z**
35
# Uncertainty in AI How to find ŵ?
Minimisng the MSE (Mean Square Error)
36
# Uncertainty in AI What are the methods to Minimise MSE?
- Gradient Descent - Closed Form Solution
37
# Uncertainty in AI What is the Formula for Minimising Loss?
L(w) = 1/n ||**y - Xw**||^2 - **y**: True Output - **X**: Design Matrix - **w**: Weight Vector
38
# Uncertainty in AI Give the Steps for minimising the loss?
1) Compute error **y - Xw** 2) Sqaure each error 3) Add them all up 4) Divide by n (No. of point)
39
# Uncertainty in AI What are the Closed Form Solution steps?
1) Compute (X^T)X 2) Compute (X^T)y 3) Compute ((X^T)X)^(-1) - Find Deteminant - Calculate ((X^T)X)^(-1) 4) Compute ŵ - ŵ = ((X^T)X)^(-1) * (X^T)y
40
# Uncertainty in AI How to find the Determinant?
X = | x1 x2 | | y1 y2 | det(X) = (x1 * y2 ) - (x2 * y1)
41
# Uncertainty in AI What is the eqution if we assume the True Model is Linear?
yi = (**(w * )**^T)xi + εi - **εi:** - Random noise independent of all other variables - Has 0-mean - Variance Var(εi) = σ^2, ∀i ∈ [**d**]
42
# Uncertainty in AI What is the the assumed Linear equation for True model if it was vectorized?
y = **W (x * ) + e** - y: Outputs - **X**: Inputs - (x * ): The True weights (UNKNOWN BUT FIXED) - e: Random Noise
43
# Uncertainty in AI What are the equations for the differnece between estimated weigths & true weights?
1) ŵ - (w * ) 2) ((X^T)X)^(-1) * (X^T) * e
44
# Uncertainty in AI How to get from ŵ - (w * ) to ((X^T)X)^(-1) * (X^T) * e ?
1) ŵ - (w * ) = ((X^T)X)^(-1) * (X^T) * y - (w * ) 2) = ((X^T)X)^(-1) * (X^T) * (X * (w * ) + e) - (w * ) (*Plugged in y*) 3) = ((X^T)X)^(-1) * (X^T) * X(w * ) + ((X^T)X)^(-1) * (X^T) * e - (w * ) 4) = ((X^T)X)^(-1) * (X^T) * e - Since *((X^T)X)^(-1) (X^T) X = 1 & w - w = 0*
45
# Uncertainty in AI What is Aleatoric Uncertainty in Linear Regression?
y = **X(w * ) + ε - Aleatoric Uncertainty is Irreducible - For a given input x, aleatoric uncertainty afects predictions beacuse of the noise in ε
46
# Uncertainty in AI What is Epistemic Uncertainty in Linear Regression?
(ŵ - (w * ))^T * x = -(e^T) * X((X^T))^(-1) * x - Reflects uncertainty about the parameter (w * ), caused by insufficent or incomplete data *- Epistemic uncertainty comes from the estimated weight ŵ being different from the true weight (w ).*
47
# Uncertainty in AI How is Epistemic Uncertainty Reducible in Linear Regression?
- Collecting more data - Improving the model - Adding more relevant features
48
# Uncertainty in AI How can the ERROR from this uncertainty can be expressed in Epistemic Uncertainty?
((w * ) - ŵ )^T * x
49
# Uncertainty in AI What happens to Variance of Epistemic Uncertainty when the dataset grows?
Variance from Epistemic Uncertainty decreases as the dataset n grows
50
# Bayesain, MAP , MLE Estimation What is Supervised Learning? | YEAR ONE ANSWER
- Most Prevalent Form - Learning with a teacher - Teacher: expected output, label, class, etc - Solve 2 Types of problems: Classification & Regression
51
# Bayesain, MAP , MLE Estimation What is Supervised Learning? | YEAR TWO ANSWER
Learn a mapping from inputs {x1,...,xn} to corresponding targets {y1,..,yn} - Learn a function from X => Y - E.g. Training a regression model where X is house features & Y is price
52
# Bayesain, MAP , MLE Estimation What is Unsupervised Learning ? | YEAR ONE ANSWER
- Learning without a teacher - Find Hidden Structures Clustering ==> Group inputs based on similar properties Agent Learns patterns in the input without any explicit feedback
53
# Bayesain, MAP , MLE Estimation What Unsupervised Learning? | YEAR TWO ANSWER
- Focuses on modelling probability distribution over data - Fit Probabalistic model to the same input samples {x1,...,xn} e.g. Gaussian distribution in a dataset
54
# Bayesain, MAP , MLE Estimation What is Bayesain Estimation?
- Bayesians represnet uncertainty by treating θ as a random variable with a prior distribution. - Parameters & data are often continuous-valued ## Footnote D = data θ = Parameters p(θ) = The Prior PDF
55
# Bayesain, MAP , MLE Estimation What is the Poisterior Distribution of θ give D?
*P*(θ|D) = (*P*(D|θ)*P*(θ))/(*P*(D)) where: - *P*(D|θ) => Likelihood of data given parameters - *P*(θ) => Prior (Belief about Parameters befor obesrving data) - *P*(D) => ∫ *P*(D|θ)*P*(θ) *d*θ **Marginal Likelihood / Evidence**
56
# Bayesain, MAP , MLE Estimation Does Bayesian Estimation Operate through optimisation to obtain best parameter values?
NO
57
# Bayesain, MAP , MLE Estimation What is Maximum A Posteriori (MAP) Estimation?
- Bayesian approach to parameter estimation. - The most probable value of θ is where the postriori takes it maximum.
58
# Bayesain, MAP , MLE Estimation How is MAP conceptually a Bayesain Approach?
It incorporates a prior belief over parameter values
59
# Bayesain, MAP , MLE Estimation How is MAP Estimation different to FULL Bayesian?
- *P*(D) has no influence on MAP, therefore making it easier to compute - Map Returns a point-estiamte
60
# Bayesain, MAP , MLE Estimation Explain the MAP equation?
**θ MAP = arg max{θ} *P*(θ|X)** **= arg max{θ} log*P*(X∣θ)+log*P*(θ)** - Map Maximises the posterior distirbution to dind the most probable parameters
61
# Bayesain, MAP , MLE Estimation What can using ML (Maxmium Likelihood) on small database do?
**Overfit ** - Leading to poor estimates
62
# Bayesain, MAP , MLE Estimation How does MAP avoid the ML (Maxmium Likelihood) problem with small datasets?
Incorporates Priors to mitgate the overfitting
63
# Bayesain, MAP , MLE Estimation Do Both MAP & ML (Maxmium Likelihood) use optimisation to find the best paramete values?
YES
64
# Bayesain, MAP , MLE Estimation Give an Example of ML (Maxmium Likelihood):
Gaussian X ~ Normal (μ,σ^2) - ML (Maxmium Likelihood) for mean μ is : μ**'{ML} = 1/N (Σ{1-N} xi)**
65
# Bayesain, MAP , MLE Estimation When is ML (Maxmium Likelihood) GOOD?
- When N is Large
66
# Bayesain, MAP , MLE Estimation What is the consequence of ML (Maxmium Likelihood) being asymptotically unbiased?
- The first term vanished with Large n - Var(θ) is of Order 1/n - With finite n ,the 2 terms trade off
67
# Bayesain, MAP , MLE Estimation What is Bias [θ']^2 in ML (Maxmium Likelihood)?
Systematic Deviation: **((θ * ) - E[θ'] )^2** θ' = Estimate | θ * = true parameter
68
# Bayesain, MAP , MLE Estimation What is Variance [θ'] in ML (Maxmium Likelihood)?
-Variablility in Estimate E[θ' - E[θ']^2]
69
# Bayesain, MAP , MLE Estimation What is the MSE of ML (Maxmium Likelihood)? | Prove it
E[(θ' - θ * )^2] = = E[θ']^2] - 2E[θ'](θ * ) + (θ * )^2 = E[θ']^2] + (θ * - E[θ'] )^2 - E[θ * ])^2 (*Rearrange and Complete the square*) = ((θ * ) - E[θ'] )^2 + E[θ' - E[θ']^2]
70
# Bayesain, MAP , MLE Estimation What is MSE for ML (Maxmium Likelihood)?
Bias[θ']^2 + Var[θ']
71
# Bayesain, MAP , MLE Estimation What is the Bernoulli Distribution?
Distribution for a Binary (2) Random Value
72
# Bayesain, MAP , MLE Estimation What is the purpose of the Likelihood function?
Describes how well a set of parameters explain observed data
73
# Bayesain, MAP , MLE Estimation What is i.i.d?
**Independent & Identically distributed** - Each data point does not influence or depend on any other data point - Each data point follows the same probability distribution
74
# Bayesain, MAP , MLE Estimation What is the Probablity of the sequence, given θ?
*P*(**x**,θ) = Π {1-n} *P*(xi;θ)
75
# Bayesain, MAP , MLE Estimation What does it mean if the distribution is discrete?
Likelihood is the Probability Mass Function of the observed data *L*(θ|**x**) = *L*(θ|x1,x2,...,xn) = P**x**(**x**;θ)
76
# Bayesain, MAP , MLE Estimation What is PMF?
- Gives the probability that a discrete random variable takes a specific value. - It applies to discrete distributions like the binomial or Poisson distribution. e.g (X = x) = ((λ^x)e^(-λ))/x! **Poisson distribution.**
77
# Bayesain, MAP , MLE Estimation What is the equation for Binomial Distribution?
*P*(X =x) = nCx p^k (1-p)^(1-k) | nCX = n choose x
78
# Bayesain, MAP , MLE Estimation What is the Equation for Poission Distribution?
*P*(X = x) = ((λ^x)e^(-λ))/x!
79
# Bayesain, MAP , MLE Estimation What does it mean if the distribution is continuous?
Likelihood is the probability density function of observed data *L*(θ|**x**) = *L*(θ|x1,x2,...,xn) = F**x**(**x**;θ)
80
# Bayesain, MAP , MLE Estimation What is PDF?
-Describes the probability distribution of a continuous random variable -Does not give probabilities directly but instead represents the likelihood of different outcomes - e.g. *P*(a <= X <= b ) ∫ {b,a} f(x) dx
81
# Bayesain, MAP , MLE Estimation What is f(x) and the condtion for it in PDF?
-f(x) is PDF which describes how probability is distrubuted over values of X **Conditions:** - f(x) >= 0 , for all values of X - Total Area under curve must be 1 *MEANS INTEGRAL IS ALWAYS 1*
82
# Bayesain, MAP , MLE Estimation Give the Normal Distribution PDF equation:
f(x) = (1 / (σ * sqrt(2π))) * e^(-(x - μ)^2 / (2σ^2)) ## Footnote μ = mean σ^2 = variance σ = Standard deviation
83
# Bayesain, MAP , MLE Estimation What can θ be in Likelihood Function?
Can be both Scalar parameters OR Vector of parameters
84
# Bayesain, MAP , MLE Estimation What is Probability?
Describes the chance of an event occuring, assuming a know distribution | - Predicting events - e.g. Chance of picking a black ball
85
# Bayesain, MAP , MLE Estimation
86
# Bayesain, MAP , MLE Estimation What is the general Procedure of MLE?
- **Input**: Observation (data points) -** Ouptut **: Parameter values of the model that explain data -**Model**: Probability distribution -**Task**: Search for the best parameters of the model to fit distribution
87
# Bayesain, MAP , MLE Estimation Give me a the step to do Maximum Likelihood
1) Select relevant / Given Likelihood Function 2) Turn it into a Log-Likelihood Function 3) Differentiate and solve for the desired value
88
# Bayesain, MAP , MLE Estimation What is the Maximum Likelihood Estimator?
A Maximum Likelihood Estimator of the parameter θ , denoted as Θ'{MLE} is a Random Variable Θ'{MLE} = Θ'{MLE}(X) *Whose values are given by θ' when X =x*
89
# Bayesain, MAP , MLE Estimation Why is Θ'{MLE} a Random Variable
As it depends on the random variable x
90
# Bayesain, MAP , MLE Estimation What is the Maximum Likelihood Estimate | DIfferent to Estimator
- Denoted as θ'{MLE} - The value of θ that maximizes the likelihood function - θ'{MLE} = arg max{θ} *L*(θ|x)
91
# Bayesain, MAP , MLE Estimation What is the Optimisation Problem caused by in θ'{MLE} = arg max{θ} *L*(θ|x)?
Finding θ'{MLE} , the value that maximises the function
92
# Bayesain, MAP , MLE Estimation How can we solve the optimization problem formulated in θ'{MLE} = arg max{θ} *L*(θ|x)?
- Closed Form - Exhaustive Search - Optimization Algorithms
93
# Bayesain, MAP , MLE Estimation What is Closed Form?
Directly solve for MLE using calculus - Very Rarely we can compute it directly
94
# Bayesain, MAP , MLE Estimation Whats is 'Exhaustive' Search?
- Tries all possible θ values - Suitable for low dimensional problems (Grid Search )
95
# Bayesain, MAP , MLE Estimation What are Optimization Algorithms?
A more general Scalable approach - Uses algorithms like Gradinent Descent to find MLE
96
# Bayesain, MAP , MLE Estimation What is the Cost Function? A.K.A: Objective Function Loss Function
A function that maps a set of events into a number representing the cost of that event occuring
97
# Bayesain, MAP , MLE Estimation What is Logarithmic Transformation & Numerical Stability?
**Logarithmic Transformation**: Turns product into sums, making differentiation easier **Numerical Stability**: Prevents underflow when dealingwith small probabilities..
98
# Bayesain, MAP , MLE Estimation What are the reasons to use the Negative Logarithm?
- **Convention**: Software for minimization problems - **Convenience**: Logs simplify multiplication into addition & diffentiation easier - **Numerical Stability**: Product of small probabilities can converge to zero, causing compuational issues due to machine precision limits.
99
# Bayesain, MAP , MLE Estimation What is Optimisation?
Finding the best solution from among the set of all feasible solutions.
100
# Bayesain, MAP , MLE Estimation What are Optimisation Procedure?
- Construct a model - Detemining the Problem Type - Selecting an Optimisation Algorithm
101
# Bayesain, MAP , MLE Estimation What happens in Constructing a Model?
Constructing a model involves designing a mathematical or computational representation of a system based on the problem you are trying to solve. -Model is trained using historical data
102
# Bayesain, MAP , MLE Estimation What happens in Determining the Problem Type?
- Before applying optimisation, it's essential to classify the problem correctly e.g. {Regression,Classification,Clustering,Optimisation Problems} - Identifying the problem type helps in selecting the appropriate model and optimisation method.
103
# Bayesain, MAP , MLE Estimation What happens in Selecting an Optimisation Algorithm?
- Optimisation algorithms adjust the model’s parameters to minimise error (loss function) or maximise performance. e.g. {Gradient Descent & Simlated Annealing} - Depends on the problem type, data size, and computational constraints.
104
# Bayesain, MAP , MLE Estimation What is Machine Learning (ML) Goal?
Goal of ML: is to infer the function that maps input to outputs so that it can predict the correct output
105
# Bayesain, MAP , MLE Estimation Is Optimisation often used towards the ML Goal?
Yes, the success of the learning depends in a highly non trival way on how this is done.
106
# Bayesain, MAP , MLE Estimation What is Supervised Learning?
- Given training data, the process of fimding appropriate parameters of a model is often defined as optimising an objective function ## Footnote Optimize an objective function to find the best parameters for a model
107
# Bayesain, MAP , MLE Estimation What is Unsupervised Learning?
Dividing samples into cluster or groups such that: - Samples within the same group are as similar as possible - Samples in different groups are as different as possible ## Footnote Optimize to group similar data points together
108
# Bayesain, MAP , MLE Estimation What is First-Order Optimality Condition?
States that the gradient of the cost function must be zero at minimum
109
# Bayesain, MAP , MLE Estimation What is the equation for First-Order Optimality Condtiion?
- For function g(w) of N-dimesional independent variable, the optimisation problem is: arg min{w} g(w)
110
# Bayesain, MAP , MLE Estimation What is Local Minimum Condtion?
A w * is the local minimum if it satisfies the **first-order necessary condtion for optimality:** ***∇{w}g(w * ) = 0{N x 1}*** *Equation*
111
# Bayesain, MAP , MLE Estimation What is a Stationary Point?
The point satisfying the condition
112
# Bayesain, MAP , MLE Estimation What are examples of Stationary Points?
- minimum - maximum - saddle point
113
# Bayesain, MAP , MLE Estimation What is the First-Order Optimality Condition?
The equation of first-order necessary condition can be written as a system of N first order equations: **"The gradient of f(x) is equal to zero:"** **∂f/∂x₁ = 0, ∂f/∂x₂ = 0, ..., ∂f/∂xₙ = 0** **"The partial derivatives of the Lagrangian function must be zero:"** **∂ℒ/∂x₁ = 0, ∂ℒ/∂x₂ = 0, ..., ∂ℒ/∂xₙ = 0, ∂ℒ/∂λ₁ = 0, ..., ∂ℒ/∂λₘ = 0**
114
# Bayesain, MAP , MLE Estimation What is Gradient Descent?
A first order iterative operation algorithm for finding a local minimum of a differentiable cost function **KEY IDEA:** Employ negative gradient at each step to decrease the cost function
115
# Bayesain, MAP , MLE Estimation What are the Two Ingredients of Gradient Descent?
- **Direction:** Detemined by the gradient at the point - **Magnitude:** Called step Size or Leaening Rate
116
# Bayesain, MAP , MLE Estimation What is the Steps For Gradient Descent?
- **Initalisation:** Start at any value of parameter θ - **Repeat:** Change the parameter θ in the direction that decreases the cost *J*(θ) - **Until:** The decrease in cost with each step is very small. ## Footnote GD, works by iteratively adjusting the parameters in the direction that reduces the cost function The Learning rate η controls step size
117
# Bayesain, MAP , MLE Estimation What are Classification Problems?
Predict which category something belongs to. **GOAL:** Learning to determine the most likely class that an input pattern belongs to **Formally:** Model Posterior Probs of class membership conditioned on input features ## Footnote E.g. Based on how many hours a student has studies, we want to predict if they will pass or fail
118
# Bayesain, MAP , MLE Estimation What is Posterior Probabilities?
The probability od an event after considering the evidence
119
# Bayesain, MAP , MLE Estimation What is Binary Classification?
A problem with 2 possible outcomes
120
# Bayesain, MAP , MLE Estimation What is Logistic Regression? ## Footnote A.K.A : Logit Regression
Model that give us a probability -* P*{i} = *P*(lable|x{i}) - P >= 0.5, predict PASS, else FAIL ## Footnote Predicts the probability something belongs to class 1
121
# Bayesain, MAP , MLE Estimation What is the Approaches for Logistic Regression?
- log-odds (logit) - Sigmoid Function
122
# Bayesain, MAP , MLE Estimation What is the Logit?
A way to turn probability into numbers that can be used in *Linear Model*
123
# Bayesain, MAP , MLE Estimation What is Probability? | Logit
Describes the likelihood of an ecent occurring
124
# Bayesain, MAP , MLE Estimation What does ODDs mean?
Ratio of the probability of an event occuring to the probability of it not occuring
125
# Bayesain, MAP , MLE Estimation What are Odds Formula?
odds = p/(p-1)
126
# Bayesain, MAP , MLE Estimation What is the Equation for Logit?
logit(*p*) = log (*p*/(*p-1*)) = log(*p*) - log(1-*p*) = θ0 + θ1(xi,1) + θ2(xi,2) +...+θd(xi,d) = θ0 + Σ {1-j} (θj(xi,d)) = (**θ**^T)**x**i ## Footnote θ0 = Intercept θ1,...,θd = Coefficeints (Weights assigned ti each feature) p = P(Yi = 1|xi) Yi = Target Variable (Dependent Variable)
127
# Bayesain, MAP , MLE Estimation What is the Sigmoid Function?
- A function that maps any real number to a value between 0 & 1 - Used to convert log-odds into a probability
128
# Bayesain, MAP , MLE Estimation What are the properties of the Sigmoid Function?
- Maps any real Num to a value between 0 & 1 - Output 0.5 when input is 0 - Approaches 1 as input is very large - Appraches 0 as it becomes very small
129
# Bayesain, MAP , MLE Estimation What is the Sigmoid Function?
**GENERAL**: σ(x) = 1/(1+e((-θ^T)**x**i)) OR **STANDARD**: σ(x)= 1/(1+e^(−x)) ## Footnote Maps the logit to a probability
130
# Bayesain, MAP , MLE Estimation How is MLE used in Logistic Regression Model?
It is used to find the best parameter for Regression Model, by maximising the likelihood of the observed data ## Footnote Best method to find the best weights for the model so it predicts the observerd data acurrately as possible
131
# Bayesain, MAP , MLE Estimation What happens in Solving Logisitic Regression?
Need to estimate d+1 unknow parameter θ
132
# Bayesain, MAP , MLE Estimation What is the Mehtod?
MLE, finds the set of parameters for which of the obeserved data is largest
133
# Bayesain, MAP , MLE Estimation Give of example of MLE of Logistic Regression:
Model predicts 70% chance of passing for student who study 10hrs & then actually passed, then likelihood is high
134
# Bayesain, MAP , MLE Estimation What is the Goal MLE of Logistic Regresion?
Estimate the unknow parameters θ of the logistic regression model
135
# Bayesain, MAP , MLE Estimation What is the Likelihood Function for MLE of Logistic Regression?
*L*(θ|y,x) = ∏ {i} *P*(yi|xi;0) **For Binary**: *P*(yi|xi;θ) = σ (xi)^(yi) * (1-σ(xi))^(1-yi) **For Log-Likelihood Function** ℓ(θ) = log *L*(θ| y,x) = Σ {i} [yi(log σ(xi)) + (1-yi)log(1-σ(xi))] *Estimation: Find θ'{MLE} that mins the negative log-likelihood* = **θ'{MLE} = arg min{θ} - ℓ(θ)** ## Footnote σ = sigmoid function
136
# Bayesain, MAP , MLE Estimation How to Maximise the Likelihood in MLE of Logistic Regression?
Adjust the parameter to make the observed data as likely as possible
137
# Bayesain, MAP , MLE Estimation What are of Product Probabilities? | MLE of Linear Regression
Likelihood is the product of the probabilities of the observed outcomes
138
# Bayesain, MAP , MLE Estimation What is the Cost Function for Logistic Regression?
- Measures how worng the model's predictions are. - Want to minimise this to make the model more accurate
139
# Bayesain, MAP , MLE Estimation What is the equation for Negative Log-Likelihoood as Cost Function?
*J*(θ) = -ℓ(θ) = - Σ[yi(log σ(xi)) + (1-yi)(log(1-σ(xi)))] ## Footnote *J*(θ) = Cost Function to be minimised σ(xi) = Sigmoid Function yi = Binary Target label (0 or 1)
140
# Bayesain, MAP , MLE Estimation What is the Optimisation for the Cost Function -VE Log-Likelihood?
- Minimise *J*(θ) to fins the optimal parameters θ' - Use Gradient Descent
141
# Bayesain, MAP , MLE Estimation What is Gradient of the Cost Function?
(d*J*(θ))/(dθ{j}) = Σ {1-n} [σ(xi) - yi]xij ## Footnote Gradient is used for iterative optimisation
142
# Bayesain, MAP , MLE Estimation What are the Main Assumption in Logistic Regression?
- **Binary Targets**: Outcome variable ha 2 Possible values - **Independent Observations**: Each data point is independent of the others - **Low Multicollinearity**: Features are not highly correlated with each other - **Linearity**: Linear relationship between feature and log-odds - **Sufficient Sample Size**: Demands a large sample for accurate estimation ## Footnote If 2 Features are highly correlated, the model might not work well
143
# Bayesain, MAP , MLE Estimation What is Bayes Rule?
Helps calculate the probability of an event based on prior information
144
# Bayesain, MAP , MLE Estimation What is Naice Bayes Classifier?
Use Bayes' Rule, but it simplfies the problem by assuming that the features are independent of each other, which makes the computation easier
145
# Bayesain, MAP , MLE Estimation What is Bayes' Rule Theorem? | Give what each variable means
*P*(C|**x**) = (*P*(**x**|C)*P*(C))/*P*(**x**) C is a random variable represent the Class | Calculates probabilites of class based on avaliable data ## Footnote *P*(C|**x**)= Posterior prob of class C given feature vector X *P*(**x**|C)= Likelihood of feature vector x given class C *P*(C) = Prior Probability of class C *P*(**x**) = Marginal Likelihood (constant with respect to C) **x ** = observed feature vector C = Random Variable representing the class label
146
# Bayesain, MAP , MLE Estimation What is the Naive Bayes Assumption? | Give example of why it is good?
Features are conditionally indepentdent given the class label | - Efficent for classifcation tasks ## Footnote - Makes the model fast and scalable, through symptoms in reality maybe correlated - Reduces the complexity of problems, making calcualtions easier - Can calculate the likelihood for each feature
147
# Bayesain, MAP , MLE Estimation What is the Naive Bayes Assumption Equation
*P*(**x**|C) = *P*(x1,x2,...,xd|C) = ∏ {1-d} (*P*(xi|C))
148
# Bayesain, MAP , MLE Estimation What is involved in training Naive Bayes model?
- Prior probability of each class *P*(C) - Conditional probabilites *P*(xi|C) for each feature xi given class C - For a given class C, the posterior probability is: *P*(C|x) = (*P*(C) ∏{1-n}*P*(xi|c))/*P*(**x**)
149
# Bayesain, MAP , MLE Estimation How to train the model?
The model is trained by calculating the prior & the likelihoods
150
# Bayesain, MAP , MLE Estimation What is the formual for classification in Naive Bayes ?
C' = arg max {C} *P*(C) ∏ {1-n}*P*(Xi|C) ## Footnote We calculate the posterior probabilties fore each class & choose the class with highest val
151
# Bayesain, MAP , MLE Estimation Advantages of Naive Bayes?
- Simple & Easy to implement - Works well with large datasets - Performs well with categorical & continuous features - Fast training & predictions ## Footnote Naive is popular beacuse it is quick to train & wirks well with bith categorical continuos data
152
# Bayesain, MAP , MLE Estimation Disadvantages of Naive Bayes?
- "NAIVE" assumption od feature independence is rarely in practice - May perform poorly with highly correlated features - Sensitive to imbalanced datasets ## Footnote Simplicity can be a drawbacks when real-world data doesn't follow the independence assumption
153
# Bayesain, MAP , MLE Estimation What is Naive Bayes used for?
Naive Bayes is widely used in tasks such as spam filtering, sentiment analysis & Text Classification because it perfomrs well with text data & large data sets.
154
# Bayesain, MAP , MLE Estimation What are the different types of Naive Bayes Classification based on the distribution of the features?
**Gaussian Naive Bayes**: Assumes that *P*(**x**|C) (DATA) follows a normal distribution **Multinomial Naive Bayes**: Used for discrete count data, such as text classification; *P*(**x**|C) are computed directly from frequencies of occurence **Bernoulli Naive Bayes**: Assumes binary/boolean features (0 or 1)
155
# Bayesain, MAP , MLE Estimation What are Generative Classifiers?
Focus on modeling how the data is generated - Model *P*(X|Y) & *P*(Y) to compute *P*(Y|X) - e.g. Naive Bayes
156
# Bayesain, MAP , MLE Estimation What is DIscriminative Classifiers?
Directly model the decision boundary - Directly model *P*(X|Y) - e.g. Logisitic Regression
157
# Bayesain, MAP , MLE Estimation Summarise Naive Classifier:
- simple yet effeictive classification algorithm - Performas well in practical situations - Assumptions doesn't often hold in real application, but method works effectively well - Computationally Efficent & works well for certain problems
158
# Bayesain, MAP , MLE Estimation What are the steps for Maximum Likelihood?
1) Pick/Use a Likelihood Function 2) Use the Log-Likelihood Function 3) Differentiate & solve for P(probabilites)
159
# Bayesain, MAP , MLE Estimation What are Pros of Logistic Regression?
- Interpretability - Coefficent shows how feature affects probability
160
# Bayesain, MAP , MLE Estimation What are Cons of Logisitc Regression?
- Assume a linear relationship in log-odds, which may not always hold
161
# Bayesain, MAP , MLE Estimation What is the Conditional Independece Assumption?
P(A,B|C) = P(A|C) * P(B|C) ## Footnote Assumption simplifies probability calculations significantly, allowing for efficient classification
162
# Information Theory What is Information Theory?
How to quantify / measure information?
163
# Information Theory Which has more info: *"Alice is a person"* OR *"Alice is a Student"*
"Alice is a Student" - Has more info since "Student" is more specific than "Person"
164
# Information Theory How to quantify the information?
Information = "SUPRISE" in probability terms
165
# Information Theory Can Logistic Regression be interprted in terms of information measure?
YES
166
# Information Theory What application does Information Theory have?
- In close Connections - Application Machine Learning - Stats -Engineering - Audio/Video/Image compression (ORIGINALLY)
167
# Information Theory What is the meaning of the data?
- The meaning / semantics of data doesn't matter - Instead, INFO is measured by Probability: - Rare events contain more info - Common Events contain less info
168
# Information Theory What is Self Information?
Measures info content of an event
169
# Information Theory What are the 3 key rules that define Self Information?
- If an event is certain, it gives no new info - The rarer an event, the more suprising it is - 2 Independent events happen, total info is the sum
170
# Information Theory What is the Definition of Self Information?
Ix(x) = -logb[*P*x(x)] = logb(1/(*P*)x(x)) - if P(x) is low then high suprise - if P(x) is high , suprise is low - if Px(x) = 1 then Ix(x) = 0 ## Footnote X is random Variable Probability Mass Function Px(x) P(x) probability of event x
171
# Information Theory What is b base of log?
b = 2 => bits b = e => nats b = 10 => dits ## Footnote They determine the units of self information
172
# Information Theory What is the Realtionship in Self Infromation and Explain it?
logit(A) = I(¬A) - I(A) - Information of event A is related to the information of its complement ¬A
173
# Information Theory What is Entropy?
Quantifies the uncertainty in a random variable X
174
# Information Theory What is Definition of Entropy and give the Interpretation?
H(X) = E [Ix(X)] = -Σ{1-m} P(X = xi) logb *P*(X = xi) = E[logb (1/*P*x(x))] = - E[logb (*P*x(X))] - All outcomes are equally likely, High Entropy (MAX UNCERTAINTY) - One outcome is very likely, entropy is low (LOW UNCERTAINTY)
175
# Information Theory What is the Standard Entropy Equation?
H(X) = -Σ{1-m} *P*(xi) logb(*P*(xi))
176
# Information Theory What is the Entropy of one outcome?
0
177
# Information Theory What are on the graph of Entropy?
- Y Axis : Entropy - X Axis : Probability
178
# Information Theory What is the Notation for Marginal PMF of X? | Simpiflied Notations
*P*x(.) = *P*(.)
179
# Information Theory What is the Notation for Marginal Probability of "X = x": | Simpiflied Notations
P(X = x) = *P*(x)
180
# Information Theory What is the Notation for Joint PMF? | Simpiflied Notations
*P*x,y(.,.) = *P*(.,.)
181
# Information Theory What is the Notation for Joint PMF at (x,y)? | Simpiflied Notations
P(X=x, Y=y) = *P*(x,y)
182
# Information Theory What is the Notation for Conditional PMF? | Simpiflied Notations
Px|y(.|.) = *P*(.|.)
183
# Information Theory What is the Notation for Conditional PMF at (x,y)? | Simpiflied Notations
P(X=x|Y=y) = *P*(x|y) ## Footnote Deterministic value But if we write *P*(x|Y), this is Random variable since Y is random variable
184
# Information Theory Whats is Joint Entropy?
Measures uncertainty in 2 Random Variables ## Footnote A measure of the uncertainty associared with a set of variables
185
# Information Theory What is the Definintion of Joint Entropy?
H(X,Y) = -E[log*P*(X,Y)] = -Σ{x∈Rx}Σ{y∈Ry} *P*(x,y) log*P*(x,y)
186
# Information Theory What is Conditional Entorpy
Measures uncertainty of Y given X ## Footnote Quantifies uncertainuty of the outcome of random variable Y given the outcome of another random variable X
187
# Information Theory What is the Equation for Conditional Entropy?
H(Y|X) = -E [log *P*(Y|X)] = -Σ{x∈Rx}Σ{y∈Ry} *P*(x,y) log *P*(y|x) ## Footnote Don't use this directly ,we re-write it to avoid computing *P*(y|x)
188
# Information Theory WHat is Chain Rule for Conditional Entropy?
Using Bayes rule identity, where *P*(y|x) = (*P*(x,y))/(*P*(x)) Therefore: -Σ{x∈Rx}Σ{y∈Ry} P(x,y) log P(y|x) = -Σ{x∈Rx}Σ{y∈Ry} P(x,y) log (*P*(x,y))/(*P*(x)) = -Σ{x∈Rx}Σ{y∈Ry} P(x,y) [log(*P*(x,y)) - log*P*(x)] = H(X,Y) - H(X) SO H(X|Y) = H(X,Y) - H(X)
189
# Information Theory What is Relative Entropy (Kullback-Leibler DIvergence) **KL**
How different 2 probability distribution are ## Footnote Quantifies the distance between 2 probability distributions
190
# Information Theory What is the Equation for KL?
D{KL}(P||Q) = Σ{x∈Rx} *P*(x) log (*P*(x)/*Q*(x)) = **E** [log (*P*(x))/*Q*(x)] ## Footnote **E** [log (*P*(x))/*Q*(x)] - Fraction is always bigger than 1, so its always +VE
191
# Information Theory What are Conventions of KL?
- 0 log (0/Q) = 0 - *P* log(*P*/0) = ∞
192
# Information Theory What are the Properties of KL?
- D{KL} (*P*||*Q*) >= 0 - D{KL} = 0 if *P*(x) = *Q*(x) - Not Symmetric : D{KL}(*P*||*Q*) != (*Q*||*P*) - Swapping wasn't leas to same answer
193
# Information Theory What is Cross Entropy Equations?
For 2 Discrete Distributions P & Q: H(P,Q) = - Σ *P*(x)logQ(x) = H(*P*) + D{KL}(*P*||*Q*)
194
# Information Theory What is Cross Entropy?
Cross-entropy quantifies how well the predicted probabilities match the actual labels
195
# Information Theory How can Mutual Information be expressed in KL Divergence?
I(X;Y) = Σ{x} Σ{y} *P*(x,y) log (*P*(x,y)/(*P*(x) *P*(y) ))
196
# Information Theory What is the Interpretation of Mutual Information in KL Divergence?
Mutual Information measures the divergence of modelling the joint probability distribution *P*(x,y) as the product marginals *P*x*P*y
197
# Information Theory What is the Special Case for Mutual Information & KL Divergence?
When X & Y are Independent: P(x,y) = P(x)P(y) => I(X;Y) = D{KL} (P{x,y} || PxPy) = 0
198
# Information Theory What are the Alternative Forms of Mutual Information?
I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y) = H(X,Y) - H(Y|X) - H(X|Y) ## Footnote The equations reveal how mutal information quantifies the reduction in uncertainty about one random variable given knowledge of another
199
# Information Theory What are the Properties of Mutual Information?
- **NON-NEGATIVE**: I(X:Y) >=0 - **Symmetric **: I(X;Y) = I(Y;X) - **Measures Statistical Dependence**: - I(X;Y)= 0 iff X & Y Are Independent - I(X;Y) increases with the dependence between X & Y & with their individual entropies H(X) & H(Y) - I(X;X) = H(X) - H(X|X) = H(X) + 0 = H(X)
200
# Information Theory In Practice What do we use Mutual Information for?
- Decide which features of the data to use for classification - Feature with Highest MI with class label are the informative features
201
# Decision Trees & Feature Selection Why should you use decision trees?
- Simple & Interpretable (Easy For humans to understand) - Extendable (Random Forest & Gradient Boosting) are ensembles of decision trees
202
# Decision Trees & Feature Selection What are Application of Decision Tree?
- Credit Card Fraud Detection - Customer Service Automation - Pancake Recipe Optimisation
203
# Decision Trees & Feature Selection What is the Predictive Model Structure for Decision Tree Learning?
- **Root OR Internal Node**: Represents a Feature - **Leaf Node**: Represents the target value (Class Lable OR values) - **Branch**: Represnts a decision rule
204
# Decision Trees & Feature Selection What are Types of Decision Trees?
- **Classification Trees**: Target variable takes categorical values (e.g. Male/Female) - **Regression Trees**: Target variables takes continuous values (e.g. Temperature)
205
# Decision Trees & Feature Selection How to construct a Decision Tree?
1) Find the best rule to split the data 2) Repeat until each partition is homogenous (PURE CLASS LABEL)
206
# Decision Trees & Feature Selection How to find the best rules to split the sample in decision Trees?
- Gini Index (Gini Impurity) - Information Gain
207
# Decision Trees & Feature Selection What is Gini Index?
- Used in CART (classification And Regression Trees)
208
# Decision Trees & Feature Selection What is the Equation for Gini Index?
I{G} = 1 - Σ {1-j} (pi)^2 | pi is the fraction of items labeled with class *i* in the dataset ## Footnote For every feature does this, to find the most informative feature
209
# Decision Trees & Feature Selection What is Information Gain?
Measures the reduction in entropy after splitting a dataset based on a feature
210
# Decision Trees & Feature Selection What are the Key Idea for Information Gain?
- Feature that maximises info gian is used for the split
211
# Decision Trees & Feature Selection What is the Equation for Information Gain?
IG(Y,X) = H(Y) - H(Y|X) | Y => Represents target X => Represents a feature of the input sample ## Footnote X & Y are random Variables H(Y) => Entropy of Labels H(Y|X) => Conditional Entropy after splitting on feature X
212
# Decision Trees & Feature Selection What is the Interpretation of Information Gain?
- Quantifies the improvement in classifying labels after using a feature to split the dataset - Feature that maximises info gain is chosen for the split
213
# Decision Trees & Feature Selection What is the Problem for allowing the tree to keep growing ?
Overfitting - Don't want too many leafs - Even if the leaf nodes don't achieve 0 entropy
214
# Decision Trees & Feature Selection What can we do to prevent Overfitting?
- Early Stopping (Limit tree depth/ min leaf size) - Post-Pruning (Remove unnecessary branches)
215
# Decision Trees & Feature Selection What are limitations of Decision Trees?
- **Unstable**: Small change in data, will create Different Tree - **Greedy Search**: No Backtracking - **Lower Accuracy**: Less Accurate than other models
216
# Decision Trees & Feature Selection What are the Solutions for the Limitations of Decision Trees?
- **Random Forest**: Combines multiple decision trees to reduce overfitting & improve accuracy - **Gradient Boosting**: Builds an esemble of trees sequentially, where each tree corrects error made by previous tree
217
# Decision Trees & Feature Selection What is Information Gain Actually?
Mutual Information
218
# Decision Trees & Feature Selection What is the Interpretation of Mutual Info and Information Gain?
- Decision Tree learning algorithm recusively use mutual info to select the feature that share the most info with dependent variable - Selected Feature is used to split the data
219
# Decision Trees & Feature Selection Why not use mutual info to slect a set of relevant feature before apllying classifier?
- Searching through all subsets is computationally too demanding - Greedy approach is feasible instead
220
# Decision Trees & Feature Selection How to is Mutual Information Feature Selection done?
Choose a subset of feature S from the intial set F that maximises the mutual info between the target variable Y & the subset of selected features S: arg max {S⊆F} I(Y;S) ## Footnote - Select only relevant feature before training classifier
221
# Decision Trees & Feature Selection What to use instead of Mutual Information based Feature Selection?
Greedy Approach is feasiable instead
222
# Decision Trees & Feature Selection What are the steps for Greedy Feature Selection Algorithm?
- Start with all features - Pick a Feature Maximising Mutual Information - Remove from candidate set & repeat until K feature are choosen
223
# Decision Trees & Feature Selection What is the Key Formula Greedy Feature Selection?
f{max} = arg max {x∈F} (I(Y;X)-β Σ {xs∈F} I(Xs;Xi)))
224
# Test Set Bound & Occam's Razor Bound What is Confidence?
How sure we are about something
225
# Test Set Bound & Occam's Razor Bound What is a Bound?
Limits or Constraints
226
# Test Set Bound & Occam's Razor Bound What is Predition Ability?
A classifier learn well if it can make correct predictions on new exa,ples, not just ones it has seen before
227
# Test Set Bound & Occam's Razor Bound Does Prediction Abiltiy = Learning?
YES
228
# Test Set Bound & Occam's Razor Bound Why should we study prediticion theroy ?
TO: - Understand hoe machine learning works - Create a better training method fo models - Develop better ways to check if classifier is reliable
229
# Test Set Bound & Occam's Razor Bound Why is Prediction Theory Important?
- Classifier only memorises the training data, will fail on new data - Watnt a classifier that can genralise (works on unseen data)
230
# Test Set Bound & Occam's Razor Bound What is the standard technique that we we can verify learning success?
1) Divide samples into train & test set 2) Train on train set 3) Test on test set ## Footnote Test Set (Used to see how many errors are made) Train Set (Used to teach classifier is working)
231
# Test Set Bound & Occam's Razor Bound What's wrong with standard technique?
- Not a perfect method doesn't tell us how good the classfier really is - Need a more mathematical way to measure reliablity
232
# Test Set Bound & Occam's Razor Bound What is Overfitting and why is it bad?
Overfitting: When classifier memorises training data instead of learning pattern It is bad as a good classifier should work on new data, not just old data
233
# Test Set Bound & Occam's Razor Bound How to avoid overfitting?
Design a better learing algorithm - Ask question, like: - what is a goof pruing criterion - Why are large margin good - What other algos are likey to get good results ## Footnote The questions are important to make machine learning models better
234
# Test Set Bound & Occam's Razor Bound What is 𝒳 ?
- Input space - Input set {e.g. pics of cats & dogd}
235
What is 𝒴?
- Output space - Output et (E.g. 1 = cat ; 0 = dog)
236
# Test Set Bound & Occam's Razor Bound What is c?
c : 𝒳 => 𝒴 - Looks at Input & predicts a label
237
# Test Set Bound & Occam's Razor Bound What is Basic Assumption?
All samples are drawn independently & indelitcally distibributed from some unknow Distribution *D* over 𝒳 x 𝒴
238
# Test Set Bound & Occam's Razor Bound What is S = {(xi,yi)} (1-n) ~ *D* ^n?
- A random i.i.d sample set S = (x1,y1), (x2,y2),...,(xn,yn) - We have n examples, where each x is an imput & each y is correct answer
239
# Test Set Bound & Occam's Razor Bound What is the Assumption we make for Basic Assumption?
- Data comes from unknow pattern - Data points ate randomly pickes & idependent
240
# Test Set Bound & Occam's Razor Bound What is True Error?
- Actual probability that the classfier makes a mistake/Error.
241
# Test Set Bound & Occam's Razor Bound What is the formula for True Error (cD)?
cD = P(X, Y from D) [c(x) ≠ y]
242
# Test Set Bound & Occam's Razor Bound What is the problem with cD / True error?
We don't the True Distribution, therefore we don't know the value of cD
243
# Test Set Bound & Occam's Razor Bound Is cD Determinsitic or Random Variable?
Depends on c, if c is determinstic (fixed), then cD is deterministic. If c is Random then cD is random
244
# Test Set Bound & Occam's Razor Bound What is Observed Error?
Estimated Error using training sample
245
# Test Set Bound & Occam's Razor Bound What is the Observed Error Formula?
ĉS = (1/n) * Σ [c(xi) ≠ yi] from i = 1 to n | Avg. of the errors in the training set ## Footnote This is the fraction of mistakes in the observed sample
246
# Test Set Bound & Occam's Razor Bound What distribution does the No. of mistakes follow?
Binomial Distribution
247
# Test Set Bound & Occam's Razor Bound What is the Binomial Probability Mass Function (PMF) for test set errors?
P(n * ĉS = k | cD) = (n choose k) * (cD^k) * (1 - cD)^(n-k) ## Footnote This gives the probability of making exactly k mistakes in n trials
248
# Test Set Bound & Occam's Razor Bound What is Binomial PMF?
Decribes the Likelihood of diifferent error counts
249
# Test Set Bound & Occam's Razor Bound What is PMF?
How Likely different outcomes are
250
# Test Set Bound & Occam's Razor Bound What are application to classifer error?
- Classifier Makes n predictions, the No.of misatkes follow Binomial Distribution - True error rate = cD e.g. - Flipping a baised coin n times wher: Heads = Error (prob cD) Tail = Not Error (prob 1 -cD) PMF tells us the probability of getting k ERRORS
251
# Test Set Bound & Occam's Razor Bound What does a small cD mean?
PMF is Normal (Prob is near 0)
252
# Test Set Bound & Occam's Razor Bound What does large cD mean?
PMF is wider
253
# Test Set Bound & Occam's Razor Bound What does Steeper curve mean on Binomial PMF graph?
More predictable classifier (Low uncertainty)
254
# Test Set Bound & Occam's Razor Bound What does Flat Curve mean on Binomial PMF graph?
Less Predicatble (High Uncertainty)
255
# Test Set Bound & Occam's Razor Bound What is Binomial BMF?
Tells us the probability of getting at most k mistakes
256
# Test Set Bound & Occam's Razor Bound What is the formaula for Cumulative Distribution?
Bin(n, k, cD) = P(ĉS ≤ k/n | cD) ## Footnote Sum of the values up to the No. of Arguments (No. of Errors)
257
# Test Set Bound & Occam's Razor Bound What does Binomial CMF mean?
- Sums up probs of making 0 to k mistakes - We ask what is the prob of k or fewer mistakes (Not Just k)
258
# Test Set Bound & Occam's Razor Bound What do we use CDF for?
To solve for cD, setting confidence level δ where delta is: 1- (confidence level) e.g. confidence level is 95% then δ = 0.05
259
# Test Set Bound & Occam's Razor Bound What is the Key Idea fo CMF Test Set Bound?
- Pick confidence level - Use this find upper bound on the classfier's true error cD
260
# Test Set Bound & Occam's Razor Bound Give the Inverted CDF:
Bin(n, k, δ) = msx {P : Bin (n, k, P) >= δ} ## Footnote Means we find the highest true error rate cD that still gives us at most k mistakes in n trails, with confidence δ
261
# Test Set Bound & Occam's Razor Bound What are the steps for finding the max valid cD?
1) Looks at different possible error rates cD 2) Find which pass the Binomial test for being >= δ 3) Take the max (highest) of the accepted values
262
# Test Set Bound & Occam's Razor Bound What is the Theroem for Test Set Bound?
P(cD ≤ Bin(n, n * ĉS, δ)) ≥ 1 - δ
263
# Test Set Bound & Occam's Razor Bound Explain the Theroem of Test Set Bound:
- Probability our estimate bound is correct is at least 1 - δ. - Provides a probabilistic guarantee on the true error rate based on observed data - Derived under assumption of i.i.d
264
# Test Set Bound & Occam's Razor Bound Why is the Theroem Powerful?
- Guarantees we are not understimating classfier's true error - More data (n) we havem tighter bound becomes
265
# Test Set Bound & Occam's Razor Bound How to use test set bound?
2 Ways: - Numerical method to compute Bin(n, n * ĉS, δ) - Approximate upper bound of Bin(n, n * ĉS, δ)
266
# Test Set Bound & Occam's Razor Bound What is Numerical Method?
- Compute Bin(n, n * ĉS, δ) accurately - Give tightness possible bound
267
# Test Set Bound & Occam's Razor Bound What is Approximate Upper Bound Method?
- Compute Upper Bound of Bin(n, n * ĉS, δ) - Faster, but loses some tightness
268
# Test Set Bound & Occam's Razor Bound What is the Equation simplified forms of the bound?
P(cD ≤ ĉS + sqrt( (ln(1/δ)) / (2n) )) ≥ 1 - δ | Formula gives upper bound on the classofer's true error ## Footnote cD = Real error rate of the classifier ĉS = observed error on test set n = No. of test examples δ = Confidence level
269
# Test Set Bound & Occam's Razor Bound Explain what the simplified forms of the bound is?
- Upper bound approximation, which are simplier to compute & simpler to interpret
270
# Test Set Bound & Occam's Razor Bound How does Simplfied Bound help?
- large n, extra term becomes small, meaning we get tight bound
271
# Test Set Bound & Occam's Razor Bound What is a precaution/ Warning for Simplified Bound?
ONLY WORKS IF TEST SET IS TRULY INDEPENDENT - If we use the training, this bound doesn NOT work
272
# Test Set Bound & Occam's Razor Bound What are the Key Takeaways of Test Set Bound?
1) Help us Certify how well a classifier has learned 2) Probabilistic 3) Large Test Set give tighter bounds 4) Assume the data point are Independent 5) Useful in Both Theory
273
# Test Set Bound & Occam's Razor Bound Explain how it helps us Certify how well a classifier has learned?
- Bound is tight, classifier is very relaible - Bound is wide, need more data to be confident
274
# Test Set Bound & Occam's Razor Bound What does Probabilistic mean?
- No guarantee the error is exaclty what we calculated - Tells us the bound is correct in 1-δ case
275
# Test Set Bound & Occam's Razor Bound Explain why large Test Sets give tight bounds?
- More data reduces uncertainty - Gives more precise estimate
276
# Test Set Bound & Occam's Razor Bound Explain the Assumption that the data points are independent?
Test data is not Random or Independent then Bound is not valid ## Footnote Has to be i.i.d
277
# Test Set Bound & Occam's Razor Bound How is Test Set Bound Useful in Practice and in Theory?
- Bound is simple & well understood - Better than just using a test set accuracy %, can be misleading
278
# Test Set Bound & Occam's Razor Bound What is Traing Set Bound?
A way to estimate the true error rate of a classifier when we only have the training set
279
# Test Set Bound & Occam's Razor Bound Why is creating similar confidence bound using traing set only , important?
- Real life, er often have very little labelled data - Might not have enough data for a test set - Need a way estimate true error without test
280
# Test Set Bound & Occam's Razor Bound What is the Challenge with Test Set Bound?
- Only works if the test set is independent - If we use the training set, the classifeir has already seen data - Errors in trinig set are not Random anymore
281
# Test Set Bound & Occam's Razor Bound What is the Problem if S is the Training set?
- Training errors are no longer independent, because the the classifier have already seen the data ## Footnote E.g. Imagine a student memorises answers for test: - Check their performance on the test, they will one do grtat - True ability might be worse
282
# Test Set Bound & Occam's Razor Bound What is the Key Idea of Occam's Razor Bound?
- Uses Prior Belief - Before seeing any training data we assume one classfier are more likely than others - Defines a probability distibution over classifer P(c) - Represents our initial guess about which classfiers are good
283
# Test Set Bound & Occam's Razor Bound Why does having a Prior help in Occam's Razor bound?
- Classifeir too complex, less likely yo generalise well - Simpler models are more likely to generalise - Build a bound that takes complexity into account
284
# Test Set Bound & Occam's Razor Bound What is the Formula for Occam's Razor bound?
P(cD ≤ Bin(n, n * ĉS, δ * P(c))) ≥ 1 - δ ## Footnote Adjusts the test set bound using a prior probability P(c) for the classifier
285
# Test Set Bound & Occam's Razor Bound What is P(c)?
- About how good classifier is - Confidence bound is adjusted based on P(c)
286
# Test Set Bound & Occam's Razor Bound What is the simplified Occam's Razor Bound?
P(cD ≤ ĉS + sqrt( (log(1/P(c)) + log(1/δ)) / (2n) )) ≥ 1 - δ
287
# Test Set Bound & Occam's Razor Bound How is the Occam's bound Relative to Test Set Bound?
- Occam's Razor Bound is Weeaker than Test Set - But Best we can do without a test set
288
# Test Set Bound & Occam's Razor Bound What does tightness depend on in Occam's Razor Bound?
Dpends on Self-Information log(1/(P(c)) of the classifier with respect to prior bets
289
# Test Set Bound & Occam's Razor Bound What are key Insights for Occam's Razor Bound?
- Each classifer has its own distribution - Bound ensures we don't pick a bad classifier by accident - Size of δP(c) depends on our intial guess P(c) - Better Prior = Tighter Bound
290
# Test Set Bound & Occam's Razor Bound How do you compute an upper bound on true error with 95% confidence?
cD ≤ ĉS + sqrt( (ln(1/δ)) / (2n) ) ## Footnote For 95% confidence, set δ = 0.05
291
# Bias Variance & Uncertianty Quantification What is MSE used for?
Bias-Variance Analysis
292
# Bias Variance & Uncertianty Quantification What is the Goal for Bias-Varaince Analysis?
Estimate the true parameter θ*
293
# Bias Variance & Uncertianty Quantification What is the Goal for Statistical Inferene/ Estimation setup?
Cnstruct an estimator θ'n for an unknown parameter (θ * ) given observed data S. ## Footnote Index n is to indicare that we obtianed this estimate from n examples
294
# Bias Variance & Uncertianty Quantification What is S?
S = {(xi,yi) , i ∈ [n]} => Genreated from unknow distribution that has some true parameter => Consits of Inputs & Outputs
295
# Bias Variance & Uncertianty Quantification What is the Key Point in Statistical Inference?
θ'n is a random variable because it depends on the dataset, which contains randomness
296
# Bias Variance & Uncertianty Quantification Give examples of Statistical Inference:
- Estimating the probability of a coin flip {Beronoulli Distribution} - Estimating parameters of Linear Regression Model : θ'n = ((X^T X )^-1) (X^T)y
297
# Bias Variance & Uncertianty Quantification What is the Intuition for Statistical Inference?
We generate multiple training sets, each will give a different estimate θ'n.
298
# Bias Variance & Uncertianty Quantification Explain what Low & High Variance & Bias is?
Low Bias: Estimates are close to True Parameter High Bias: Estimates are Far from True Parameter Low Variance: Estimates are Close together High Variance: Estimates are Spread out
299
# Bias Variance & Uncertianty Quantification What Variance and Bias do we want?
Low Bias & Low Variance is what we want, but practically their are usually trade offs.
300
# Bias Variance & Uncertianty Quantification What is the Bias of an Estimator?
Difference in the estimates & the ground truth (TRUE PARAMETER) Definition: Bias of θ'n is Bias(θ'n) = E[θ'n] - θ *
301
# Bias Variance & Uncertianty Quantification When is an Estimator unbiased?
if E[θ'n] = θ * , for all θ *
302
# Bias Variance & Uncertianty Quantification When there are many points where does the estimator converge?
On the Ground Truth {True Parameter}
303
# Bias Variance & Uncertianty Quantification What is Underfitting in Bias Variance?
High Bias dominates the error
304
# Bias Variance & Uncertianty Quantification What is Overfitting in Bias Variance?
High Variance dominates the error
305
# Bias Variance & Uncertianty Quantification What is Variance of an Estimator?
- Measures how much estimates flucuate. - Definition: Varaince of θ'n is Var(θ'n) = E[( θ'n - θ * )].
306
# Bias Variance & Uncertianty Quantification Does variance directly depend on the true Parameter θ * ?
NO
307
# Bias Variance & Uncertianty Quantification What is the MSE using Bias-Variance?
- Sum of (bias)^2 & Variance **E[(θ' - θ * )^2] = (θ * - E[θ'])^2 + E(θ' - E[θ * ]^2)**
308
# Bias Variance & Uncertianty Quantification What is the MLE for Estimator θ ?
θ'{MLE} = (Σxi) / n | i is 1 to n ## Footnote MLE is the sample mean of Bernouli trails
309
# Bias Variance & Uncertianty Quantification What is the MAP Estimator for θ?
θ'{MAP} = ((Σxi)+ α + β) / (n + α +β - 2) | i is 1to n ## Footnote Maximiser of the posterior distribution of θ when the prior distribution is p(θ) = Beta( α , β)
310
# Bias Variance & Uncertianty Quantification Can all Estimators can be analysed in the frequentist framework?
Yes
311
# Bias Variance & Uncertianty Quantification What is the Bias of MLE?
Bias(θ'{MLE}) = E[ (Σxi) / n] - θ = (nθ)/n - θ = θ - θ = 0 | Due to Bias = 0 it is MLE is Unbiased ## Footnote (Σxi) / n becomes (nθ)/n as xi has expectation of θ so it becomes nθ
312
# Bias Variance & Uncertianty Quantification What is the Bias of MAP?
Bias(θ'{MAP}) = E[θ'{MAP}] - θ = ((nθ + α - 1) / (n + α + β - 2)) - θ ## Footnote MAP is **BIASED** estimator toward prior mean, especially for small n, but becomes unbiased as n grows larger
313
# Bias Variance & Uncertianty Quantification What is the Variance of MLE?
Var(θ'{MLE}) = Var((Σxi) / n) = (Var (Σxi))/n^2 = θ(1-θ)/n
314
# Bias Variance & Uncertianty Quantification What is the Variance of MAP Estimator?
Var(θ'{MAP}) = (1/(n + α + β - 2)^2) * Var(ΣXi) = nθ(1-θ) /((n + α + β - 2)^2) ## Footnote informative prior (α , β) variance reduces Very Large n Diminishes α & β Small n, α & β have to be informative
315
# Bias Variance & Uncertianty Quantification What are Implications of Bias-Variance Analysis?
- Want an estimator that has Low bias & variance (Cannot be acheived simultaneously) - MLE has low Bias if n is large, but has high variance - MAP has high Bias, but low Variance - Bayesian estimator incorporate prior information, introduces bias but reduces Variance - Bayesian posterior mean is biased bu acheives lower than MSE than frequentist Estimator - Best choice depends on sample size, prior knowledge & application needs
316
# Bias Variance & Uncertianty Quantification What Framework is MAP & What is Bais-Variance?
- Map is Bayesian - Bias-Variance is Frequentist
317
# Bias Variance & Uncertianty Quantification What is the Goal for the Regression Problem setup?
Find estimator f'(x) that approximates f(x) using the training set
318
# Bias Variance & Uncertianty Quantification What is the True function of Regression Problem setup?
y = f(x) + ∈ | ∈ = Random noise {mean = 0 & variance = σ^2} f(x) = Actual Relationship
319
# Bias Variance & Uncertianty Quantification What does it mean to have Low Variance & High Bias in Regression Bias-Variance?
Prediction are stable but far from true function High Bias: Simple Linear model makes similar predictions, but they are all wrong
320
# Bias Variance & Uncertianty Quantification What is the Dataset (S) for Regression?
S = {(xi,yi)}
321
# Bias Variance & Uncertianty Quantification What does it mean to have High Variance & Low Bias in Regression Bias-Variance?
Each model fits trainig data well but gives different results - High variance = High degree polynomial perfectly fits each data set but varies widely
322
# Bias Variance & Uncertianty Quantification What is the quantity we want to analyse in Regression Bias-Variance?
The Performance of a models at a fixed test point x
323
# Bias Variance & Uncertianty Quantification What is Bias in Regression Bias-Variance?
The difference between the true fucntion f(x) and the expected model prediction ocer all possible training sets: **Bias^2 = (f(x) - E{s}[f'(x;S)])^2** | E.g All predictions are systematically too low or high, model has bias ## Footnote f(x) => True function that generated the data f'(x;S) => Model's prediction at x given training set S E{s}[f'(x;S)] => Expected prediction over different training sets
324
# Bias Variance & Uncertianty Quantification What is Variance in Regression Bias-Variance?
How much the model's predictions vary for different training sets S: **Variance = E{s}[(f'(x;S) - E{s}[f'(x:S)])^2]** | - Captures the sensitivity of the model to different training sets ## Footnote High variance means the model is sesnitve to small changes in the data
325
# Bias Variance & Uncertianty Quantification What is Noise in Regression Bias-Variance?
Represents the irreducilbe noise, is the variance in y that is independent of x: **Irreducible Noise = E{∈}[∈^2] = σ^2 ** ## Footnote Inherent Randomness in data cannot be reduced
326
# Bias Variance & Uncertianty Quantification How to decompes the Error while Ignoring Noise?
**Start with:** E{s}[(f'(x;S) - f(x))^2] **Add/Sub the Mean:** = E{s}[(f'(x;S) - E{s}[f'(x;S)] + E{s}[f'(x;S)] - f(x))^2] **Expand the Square:** = (E{s}[f'(x;S)] - f(x))^2 + E{s}[(f'(x;S) - E{s}[f'(x;S)])^2] + Cross-term **Eliminate Cross-term:** = E{s}[f'(x;S) - E{s}[f'(x;S)]] = 0
327
# Bias Variance & Uncertianty Quantification What is the Total Error in Regression Bias-Variance?
- Includes Noise E{s,∈}[(y - f'(x;S))^2] =E{s}[(f(x) - f'(x;S))^2] + E{∈}[∈^2] =(E{s}[f'(x;S)] - f(x))^2 + E{s}[(f'(x;S) - E{s}[f'(x;S)])^2] + σ^2 = Bias^2 + Variance + σ^2
328
# Bias Variance & Uncertianty Quantification What are the Bias-Variance Tradeoff?
**Bias**: Measure the error due to approximating the true function (High Bias can occur with overly simple models [underfitting]) **Variance**: Measures the variability of prediction f'(x;S) around its mean (High Variance occurs with overly complex models [Overfitting]) **Noise**: Irreducible error inherent in the data generation process **Tradeoff**: Increasing the model complexity typically reduces bias but increase variance & vice versa
329
# Bias Variance & Uncertianty Quantification What is Underfitting in Regression?
A model with too few parameters will have bais and low variance
330
# Bias Variance & Uncertianty Quantification What is Overfitting in Regression?
A model wth too many parameters will have low bias & high variance
331
# Bias Variance & Uncertianty Quantification What is Regularisation?
- First Line of defense agaisnt overfitting (Reducing Varinace) - Adding a penalty term to the cost function to reduce variance
332
# Bias Variance & Uncertianty Quantification What is Regularisation equivalent to in likelihood bases training?
Map estimation of the parameters, which can reduce variance in expense of bias
333
# Bias Variance & Uncertianty Quantification What is the added Regularisation Term?
λ * ||θ||^2 J(θ) = - logP(D|θ) +λ * ||θ||^2 - **Working backwards using this amounts to finding the maximising θ**
334
# Bias Variance & Uncertianty Quantification What is Bootstrapping?
Idea: Train Multiple models on different random subsets of data & average the prediction (Repeatedly resample the training set, train separate model on each sample and avg the preditions)
335
# Bias Variance & Uncertianty Quantification What is the benefit of Bootstrapping ensembles?
Reduces variance
336
# Bias Variance & Uncertianty Quantification How can Bootstrap be used for Estimating Epistemic Uncertainty?
Epistemic Uncertainty is due to lack of knowledge. Bootstrap helps estimate this uncertainty.
337
# Bias Variance & Uncertianty Quantification
338
# Bayesian Networks - Representations What do the Node represent in Probabilistic Graphical model?
Random Variables
339
# Bayesian Networks - Representations What do the Edges represent in Probabilistic Graphical model?
Conditional dependencies
340
# Bayesian Networks - Representations What are undirected graphical models?
E.g Markov Random Fields These graphs define dependencies but do not capture causality
341
# Bayesian Networks - Representations What are directed graphical models?
Direct edges to represnt cause-effect relationships ## Footnote A.K.A Bayesian Networks
342
# Bayesian Networks - Representations What are Bayesian Networks?
probabilistic graphical model that uses the direction of edges to represent the cause-effect relationship & bayes' therorem for probabilistic inference ## Footnote Helps make data driven deciions under uncertianity
343
# Bayesian Networks - Representations What are the advantages of Bayesian Networks?
- **Graphical Representation:** Visual representation of joint probability distibutions of different random variables - **Powerful:** Captures complex between random variables (CAN REPRESENT CASUAL STRUCTURES) - **Combine data & prior knowledge:** Uses both historical knowledge & new observations - **Generative Approach:** Able to generate new data similar to existing data.
344
# Bayesian Networks - Representations What are the Disadvantages of Bayesian Networks?
- Requires prior knowledge (Needs a known probability distribution) - Computationally Intractable (Large network requires significant computations).
345
# Bayesian Networks - Representations What is the Naive Bayes as Bayesian Network?
Simple Bayesian network for classification - Target Variable Y is CLASS LABEL - Input Variable X = {X1,X2,...,Xn} is EVIDENCE
346
# Bayesian Networks - Representations What are the 3 Main tasks needed with Bayesian Networks?
- **Inference:** Given evidmnce compute probability of other variables - **Training:** Learn Model Parameters - **Structure Determination:** Identitfy what is connected to what (How variables connect)
347
# Bayesian Networks - Representations How to represent the joint probability distributions of random variables?
Bayesian network is a directed, acyclic graph (DAG) - Node => Random Variables - Directed Edges => Conditional Dependencies - Directed Edges represent directed influence (direct cause)
348
# Bayesian Networks - Representations Give a conditional distribution for each node given its parents:
P(Xi|Parents(Xi))
349
# Bayesian Networks - Representations Bayesian Networks and Discrete Random Variables?
- Conditional Distribution can be represented as a conditional table (CPT) CPT is the distribution over Xi for each combination of parent values (TABLE THAT SPECIFIES PROBABLILTY)
350
# Bayesian Networks - Representations What is the formal meaning of a Bayesian Network?
**Full Joint Distribution:** P(X{1}X{2},...,X{n}) = P(X{1})P(X{2}|X{1}) ...p(Xn|X{1},..X{n-1}) **Limited-Dependence Assumption:** P(X{1}X{2},...,X{n}) = Π {1-n} P(Xi|Parents(Xi))
351
# Bayesian Networks - Representations What are the advantages of Limited Dependence assumption?
Reduces complexity by assuming conditional independences
352
# Bayesian Networks - Representations What is the Essence of a Bayesain Network?
Compact representation of joint probabaility in terms of condtional distribution
353
# Bayesian Networks - Representations What are the 4 Probabilistic Realtionship Standard Structures?
- Direct Cause - Indirect Cause - Common Cause - Common Effect
354
# Bayesian Networks - Representations What is Direct Cause? | Give an example
**A** => **B** - A directly influences B E.g. Smoking (A) directly causes lung Damage (B)
355
# Bayesian Networks - Representations What is Indirect Cause? | Give an Example
**A** =>**B**=>**C** - A affects B, then B affects C E.g. Smoking (A) leads to tar buildup (B), which then causes lung cancer (C)
356
# Bayesian Networks - Representations What is Common Cause?
**B** <= **A** => **C** - A influences both B & C E.g. Genetic Mutation (A) may cause both high blood pressure (B) & Heart Disease (C)
357
# Bayesian Networks - Representations What is Common Effect?
**A** , **B** => **C** - Both A & B Contribute to causing C E.g. Smoking (A) & exposure to asbestos (B) both increase the risk of lung cancer
358
# Bayesian Networks - Representations What is the probabilistic Relationship of Direct Cause?
An edges represents the conditional dependence between the parent node & the child.
359
# Bayesian Networks - Representations In Direct Cause what is Inference?
The problem of finding out the "Cause" variable when we only observe the "effect" variable
360
# Bayesian Networks - Representations What are observed & unobserved variables?
Observed: The ones we have knowledge about Unobserved: Ones we don't observe
361
# Bayesian Networks - Representations What is the Marginal Probability for Direct Cause?
P(B) = Σ {A} P(A|B) P(A)
362
# Bayesian Networks - Representations What is the conditional probability definition for direct cause?
P(B|A) = P(A∩B) / P(A) = P(B)P(A|B)/P(A) | This called diagnosis
363
# Bayesian Networks - Representations When are 2 sets of variables are Independent?
P(A) = P(A|B) OR P(A,B) = P(A)P(B)
364
# Bayesian Networks - Representations What is conditional Indepence?
Two random variables A & B are conditionally independet given a third random variable C: (A ⫫ B) | C iff P(A,B|C) = P(A|C)P(B|C) | reduces the numper of parameters required
365
# Bayesian Networks - Representations What is the probability of Common Cause?
P(A,B|C) = P(A|C)P(B|C) Total Prob: P(A,B|C) = ∑{C} P(A∣C)P(B∣C)P(C)
366
# Bayesian Networks - Representations What is the Probaility for Indirect Cause?
P(C∣A)=P(C∣B)P(B∣A) Total Prob: P(C | A) = Σ{B} P(C | B) * P(B | A)
367
# Bayesian Networks - Representations What is the Probability for Common Effect?
P(A∣B,C)=P(A∣B)+P(A∣C)−P(A∣B,C) If A is observed then B & C Become Dependent: P(B∣A,C)= P(B∣C)P(A∣B,C)/P(A∣C) ## Footnote B & C cause A
368
# Bayesian Networks - Representations What is Markov Condition?
A node is Independent of its non-descendents given its parent
369
# Bayesian Networks - Representations What is Markov Blanket?
A node's Parent + Childern + Children's Parents
370
# Bayesian Networks - Representations Are loops Allowed in Bayesian Networks?
NO DAG is standard Bayesian Network, so only a net of up to 3 nodes only allow for the 4 Standard structures
371
# Bayesian Networks - Representations What is Joint Distribution of a Indirect Cause?
P(A,B,C) = P(A)P(B|A)P(C|B)
372
# Bayesian Networks - Representations What is the Joint Probability distribution of Common effect?
P(A,B,C) = P(A)P(B)P(C|A,B) ## Footnote if C nor any of its descendents are observed then A & B are independent (IF WE CONFIRM ONE CAUSE OF OBSERVATION, REDUCES THE NEED TO INVOKE ALT CAUSE)
373
# Bayesian Networks - Representations What is the Joint Probability of Common Cause?
P (A,B,C) = P(A)P(B|A)P(C|A) | If A is observed then B & C are independent
374
# Bayesian Networks - Representations What is d-Separation?
A node is independent of its ancestors given its parent
375
# Bayesian Networks - Learning & Inference What is Inference in Bayesian Networks?
Means of computing probabilties in a Bayesian Network ## Footnote A.K.A QUERY => Answering questions about the underlying probability distribution
376
# Bayesian Networks - Learning & Inference What 2 Question does Inference answer?
- **Diagnosis**: Given an effect, infer a cause - **Prediction**: Given a cause, infer an effect
377
# Bayesian Networks - Learning & Inference What are the 2 Applications of Inference?
- **Classification**: Finding the most probable class - **Decision Making**: Computing expected utility of different actions
378
# Bayesian Networks - Learning & Inference What are the 3 Categories of Variables?
- **Evidence**: E = {E1,...,En} (Observed/Know Data) - **Query**: X = {X1,...,Xn} (Variables we want to infer) - **Non-evidence**: Y = {Y1,...,Yn} (Neither known nor wanted variable that must be dealt with) ## Footnote The Union of these variables is the complete set of variables of a Bayesian Network
379
# Bayesian Networks - Learning & Inference What is Unconditional & Condtional Probability Inference?
**Unconditional:** Compute P(X) marginal Probability **Conditional:** Comput P(X|E) given Evdience
380
# Bayesian Networks - Learning & Inference What is MAP Inference?
Find the most probable value for query X given evidence E = e MAP(X|E =e) = arg max{x} P (X = x | E=e) ## Footnote - We want to classify an object into category Y - Used in ML Classification
381
# Bayesian Networks - Learning & Inference What is Exact Inference?
- Computes the posterior probability distribution/ The exact value of P(X|E) P(X|E) = P(X,E)/P(E) = α * P(X,E) = α * Σ {Y} P (X,E,Y) where α = 1/P(E) | Means Computing preciese Probability ## Footnote a => Normalisation Constant Using marginalisation we sum over non-evidence variables
382
# Bayesian Networks - Learning & Inference What is Marginalisation & the Calculations?
- Process of summing over non-evidence Variables **Calculations:** - Discrete variable, we sum over their possible values - Continuous Variable, we Integrate instead
383
# Bayesian Networks - Learning & Inference What is the Marginal Distribution?
Formed by calculating the subset of a larger probability distribution of a collection of random variables
384
# Bayesian Networks - Learning & Inference What is the Computational Complexity of Exact Inference?
If there are n non-evdince variables & each has m values, the complexity is **O(nm^n)** | The Exponential, making large networks impractical ## Footnote n => No.non-evidence Vars m => No. of values each Vars can take
385
# Bayesian Networks - Learning & Inference What is Inference by Enumeration?
- Infer the posterior prob by marginalisation of the joint distribution - We comput exact value of P(X|E) using : (Σ{y} P(X,E,Y)) / P(E)
386
# Bayesian Networks - Learning & Inference Summarise Exact Inference?
1) Inference in Bayesain Network answers diagnostic & predicitve questions 2) Exact Inference can be performed using enumeration & marginalisation 3) Computational complexity is HIGH, so more efficent Algorithms Exist
387
# Search & Constraint Satisfication Problems What is a Search? ## Footnote Simple
- Not about Prediction - But about choice & decision-making, such as planning & assigning - Trying to find the optimal solution - To Minimise/Maximise something - The process of navigating from the start state to a goal state by transitioning through intermediate states.
388
# Search & Constraint Satisfication Problems What is a Search? ## Footnote Leccture Definition
Computational problem where the goal is to find a solution, that transforms an inital state into a goal state, by exploring a space of possible possible solutions
389
# Search & Constraint Satisfication Problems What does a search problem Consists of?
**State Space**: All possible states **Start State**: Where search begins **Goal State**: Desired state the agent is looking for **Goal Test**: Whether the goal state is achieved or not **Successor Function**: Given a certian state, how to get to the next state **Solution**: is a sequence of actions which transforms the start state to a goal state
390
# Search & Constraint Satisfication Problems What is a State space graph?
A graph to represent all possible states & transitions between them - *Nodes represent states* - *Arcs/Edges represent transitions {From state where can you go}*
391
# Search & Constraint Satisfication Problems What is a Search Tree?
Process of solving the search problem can be abstracted as a search tree - * Start state is root node* - *Children corresponds to successors* - *Nodes show states, but correspond to plans that achieve those states* - **Shows all the possible paths from start to end/goal nodes**
392
# Search & Constraint Satisfication Problems Explain the General Search Process for Search Tree?
1) Start with initial state (*Root nide in the search tree*) 2) Expand a node (*Find all possible next moves*) 3) Pick which node to expand next (*Depends on the search stategy*) 4) Check if it's a Goal State: - **YES** *(Return Solution)* - **NO** *(Continue Expanding)* 5) Repeat until we find the solution or run out of nodes
393
# Search & Constraint Satisfication Problems What is a Search Tree?
Method for solving search problems by expanding nodes in a systematic way ## Footnote We explore the state space step by step
394
# Search & Constraint Satisfication Problems What is the need for a Search Tree?
- Solving problems, often we have many choices at each step - Tress search help organise these choices into a structure - Ensure we don't miss any solutions & find the best one when neccessary
395
# Search & Constraint Satisfication Problems What is Depth First Search DFS?
- Explore deepest nodes first before backtracking - Keeping looking at child (left node) until reaching the leaf which is either a goal or not ## Footnote Uses a Stack Keep track of nodes (LIFO)
396
# Search & Constraint Satisfication Problems What is Breadth First Search BFS?
- Explores all the nodes at the current depth/level before going deeper - Guarantees to find the shortest path - Good/ideal for shallow goals Order of expandtion is left to right, unless stated otherwise | Uses a Queue to keep track of node (FIFO) ## Footnote Uses more memory than DFS as needs to store all nodes at a level - Slow for deeper goals
397
# Search & Constraint Satisfication Problems When BFS successful?
We expand goal state, rather than finding it
398
# Search & Constraint Satisfication Problems Step-by-Step DFS?
1) Start at the intial state 2) Pick one path & follow it all the way down 3) If you reach a dead-end, backtrack & try another path 4) Repeat until you find the goal state
399
# Constraint Satisfication Problems What are CSPs?
Problems where the goal is the to find a valid assignment of values to variables while satisfying a set of constraints.
400
# Constraint Satisfication Problems What are the types of Searh Problems?
- Planning - Identification
401
# Constraint Satisfication Problems What are Identification Search Problems?
- Care about the Goal itself, not path DON'T CARE ABOUT ORDER ONLY THE FINAL GOAL
402
# Constraint Satisfication Problems What are Planning search problems?
- Sequence of actions ONLY CARE ABOUT THE PATH TO THE GOAL
403
# Constraint Satisfication Problems Explain CSPs?
Identification problems have cosntraints to be satisfied, but there are no preferences. - Has constraints to be met - No Preferences (No Objectives) - No maximise/Minimise required
404
# Constraint Satisfication Problems Constraints in CSPs are?
Hard Constraints, which legal solutions cannot violate
405
# Constraint Satisfication Problems Preferences in CSPs are?
Soft Constraints, where we need to optimise *DON'T NEED TO MINIMISE/MAXMISE BUT STILL THERE*
406
# Constraint Satisfication Problems What do CSPs consist of?
- ** A set of variables *(X1,X2,...)* ** *Things we need to assign values to* - **A Domain for each Variable *(D1,D2,...)*** *Possible values for each variavle* - **Set of Constraints** *Rules that limit which values can be assigned*
407
# Constraint Satisfication Problems In CSPs when is an assignment complete?
***When every variable has a value***
408
# Constraint Satisfication Problems In CSPs, what is a partial assignment?
***Not all variables have a value***
409
# Constraint Satisfication Problems Explain differences between Standard Search Problems vs CSPs?
**Standard Search Problems:** - State is a "black-box" :*Arbitrary data structures* - Goal Test can be any function over states - Focuses on planning (Finding Sequences) - Path to solution is important - Explores every possible soultion 1-by-1 - Uses methods like DFS,BFS/A* **CSPs:** - State is defiend by variables - Value of variables are given by the domain - Goal Test is a set of constriants specifying the allowed combinations of the values of variables *Allowed Domain values for a variables that meet the constraints* - The Final Goal is most important - Constraint Based Pruning: Prunes the search space by eliminating infeasible values early. - Uses methods like: - Backtracking Search(uses pruning) - Forwards Checking (eliminating inconsistent values early ) - Local Search
410
# Constraint Satisfaction Problems What are Constraint Graphs?
Visual representation of relationships between variables & constraints - Easier to visualise - Helps solves problems using graph base techniques | - Node = Variable - Edge/Arc = Constraints
411
# Constraint Satisfaction Problems What do you do when a constraint relates to more than two variables?
Square to represent a constraint, & connect all the variables involved
412
# Constraint Satisfaction Problems What are Variables in CSPs
- Something that need to be assigned a value (DOMAIN) - Represent the unknowns you are trying to determine in a given problem.
413
# Constraint Satisfaction Problems What are Values in CSPs?
Set of possible value that each variable can take
414
# Constraint Satisfaction Problems What are Constraints in CSPs?
- Defines a realtionship between values - Restricts the possible values that can be assigned to them
415
# Constraint Satisfaction Problems What should we consider when selecting the domain (IF NOT GIVEN)?
Consider: - How to create a constraint - What we are trying to solve/ the task
416
# Constraint Satisfaction Problems How to pick a variable for CSPs?
Set Variables: - ** Something related to goals / problem** *e.g. n-queens : goal is to assign a column for each queen, s.t. 2 queens don't threaten each other* - ** May need to consider the size of the domain** *e.g. sudoku, each variable has a domain size of 9 (1-9)* *e.g Optimisation Problem, variables are real numbers, the domain is infinite, but u may have to discretize (**Continuous to Discrete**) it for practical computation* - **How the constraints can be expressed.** *e.g. sudoku, constraint is each could be that each row, column & subgrid must contain all numbers 1-9 without repetiton. **GLOBAL CONSTRAINTS GOVERNS THE REALTIONSHIPS BETWEEN ALL VARS***
417
# Constraint Satisfaction Problems What are the variety of variables in CSPs?
**Finite Domains**: - Discrete - Has limited set of values - E.g Sudoku, Minesweeper, & Map Colouring **Infinite Domains**: - Discrete / Continuous - E.g. Variables that involve time/ Numbers (REAL) - Example scenarios: Optimisation Problems, Temperature in different rooms etc.
418
# Constraint Satisfaction Problems What are the Variety of Constraints in CSPs?
**Unary**: - One Variable - e.g. a = 1 **Binary**: - It's between 2 Variables - e.g. a != b **High-Order Constraints**: - Involves 3+ variables
419
# Constraint Satisfaction Problems Why use CSPs?
They elminate invlaid assignments early to avoid brute-force search
420
# Constraint Satisfaction Problems Why are CSPs difficult search problems?
If CSP has n Variables, size of each domain is d, then there are O(d^n) complete assignments. - Larger the search space the more time and memory are needed to explore all possibilie. - Large d & n values, becomes practically impossible to exhaustively explore all assignments - Brute-force become inefficeint for large n & d values, naive search method to search all possible combinations of assignments is exponentially slow as no.of variables increases - Size of problem grow, intractability increases, time to solve problems grows quickly **INFINTE DOMAINS:** - Search space is UNBOUNDED, & there's no simple way to count the possible solutions, making them difficult to search effective
420
# Constraint Satisfaction Problems Can CSP problems consider preferences?
YES , they become **Constrained Optimisation Problems**
421
# Constraint Satisfaction Problems What are Real World CSPs?
- Assignment Problems (*Who teache which class*) - Timetabling Problems (*Class is offered when & where*) - Hardware configuration - Transportation Scheduling - Factory Scheduling - Circuit layout - Fault diagnosis
422
# Solving CSPs What are Cryptarithmetic?
- Constraint graph - A puzzle where the digits o numbers are represented by letters. - Each Letter Represents a unique digit - GOAL is to find the digits s.t a given equation is verified
423
# Solving CSPs by Systematical Search What is the Generate & Test Method / Exhaustive generate-and-test algorithm?
- Generate all complete assignments - Test each assignment in turn - Then return the first one that satisifes all constraints
424
# Solving CSPs by Systematical Search What is wrong with Generate & Test method?
- needs to store all d^n complete assignements Therefore: - Exponential Growth in memory - Generate failed constraints (wastes computation)
425
# Solving CSPs by Systematical Search How are CSPs Solved using Standard Search formula?
In CSPs, states are defined by the values assigned so far *{PARTIAL ASSIGNMENTS}* - **Inital State:** the empty assignment{} - **Successor Function:** assign a value to an unassigned variable - **Goal:** Current Assignment is complete & satisfies all the constraints
426
# Solving CSPs by Systematical Search How to solve CSPs by BFS?
- Explores all the shallowest nodes before going deeper - Solution or Goal state is ALWAYS in the bottom layer - BFS needs to traverse and explore all nodes to find the solution *{PARTIAL ASSIGNMENTS}*
427
# Solving CSPs by Systematical Search How to solve CSPs by DFS?
- Explore the deepest nodes, until your reach the goal node - Do not need to explore all nodes - Need to consider constraints as we explore
428
# Solving CSPs by Systematical Search How do you check constraints as you go?
- Only consider values which do not conflict previous assignments - A tie is broken alphabetically & numerically - May have to do computation like check the constraints, i.e. "*Incremental goal test*"
429
# Solving CSPs by Systematical Search What is Consider one variable at a layer?
- Variable Assignment are commutative, so fix ordering - One Variable changes at a time each at each layer - The order of assignments doesn't affect the correctness of the solution, but it can impact efficiency
430
# Solving CSPs by Systematical Search What is Backtracking?
DFS method with 2 additional things: 1) Check constraints as you go 2) Consider one variable at a layer
431
# Solving CSPs by Systematical Search Give an Example of Backtracking:
- A,B,C have domain {0,1,2} - Constraint: A < B < C - Start with empty assignment - Explore Different assignments checking the deeper node first, if it doesnt work go back and choose a different assignment
432
# Solving CSPs by Systematical Search How to improve Backtracking?
- **Filtering** - **Ordering** ## Footnote These are methods to speed up searches
433
# Solving CSPs by Systematical Search What does Filtering do in Backtracking?
Can we detect inevitable failure early - If assignment fails a constraint, you can cancle it Keeps track of domain for unassigned variables and crosses off bad/infeasible options
434
# Solving CSPs by Systematical Search Give an Example of Filtering:
Forward Checking
435
# Solving CSPs by Systematical Search What is Forward Checking?
Cross of values of neighbouring variables that violate a constraint when added to the exisiting assignment Assigns a variable, cross off anything that is now violated on all its neighbours' domains e.g.: Vars: A,B,C with domain {0,1,2} Constraints: B>A>C Assign A is 0, Domain of B & C are reduced to match the constraints B new domain is {1,2} C new domain is {} => NOT LEGAL as C is Empty
436
# Solving CSPs by Systematical Search What does Ordering do in Backtracking?
- Assigns first/next rather than alphabetically - Consider the Var with min number of values to explore [e.g. fewest legal values left in domain] e.g.: Vars: A,B,C with domain {0,1,2} Constraints: A <= B < C Assign A is 0, Domain of B & C are reduced to match the constraints B new domain is {0,1,2} C new domain is {1,2} Assign C as it has the fewest variable Repeat previous steps to find a solution
437
# Solving CSPs by Systematical Search Give an Example for Backtracking + Forward Checking + Ordering:
Vars: A,B,C with domain {0,1,2} Constraints: A <= B < C & A + B + C = 3 Assign A first with 0 (Alphabetical & Numerical Assignment) Then pick the next state with the smallest new domain and where assignment of A meets the constraints Repeat the Assignment until u find a solution if domain become {} the solution/ state is NOT LEGAL
438
# Solving CSPs by Local Search What is a Tree Search?
- Systematically search the space of assignment in a constructive way - Start with empty assignment - Assign a value to an unassigned variables & deal with constraints until a solution is found
439
# Solving CSPs by Local Search What is the Problem with Tree Search?
- Space too big & even infinite - In a reasonable time, systematic search may fail to consider enough of the space to give meaningful results
440
# Solving CSPs by Local Search What is Local Search Method?
Not Systmematically Search the space, but find solutions quickly on average - Start with arbitrary complete assignment (*Constraint can be violated/ Assignment can be invalid*) - Try to improve the assignment iteratively
441
# Solving CSPs by Local Search How are local searches used for CSPs?
- Randomly generate a complete assignment - Solution not found/criterion not met then follow the next steps: - S1) Explore neighbours of the assignment (Randomly choose assignment, look at neighbour) - S2) Choose the assignment that violates the fewest constraints & Repeat the above steps (*If Neighbour does not change or are worse than current one you don't move*) EACH ITERATION REDUCES THE NUMBER OF CONSTRAINTS VIOLATED Repeat the steps until criterion/ all constriants met or no valid options left to move to.
442
# Solving CSPs by Local Search Example of A Local Search:
**Hill Climbing:** Var: A,B,C with domain {0,1,2} CONSTRAINTS: A <= B < C - Randomly Generated intial assignment (e.g: A=1, B=1, C=1) - When solution not found/ stop criterion not met: S1) Explore all neighbours of assignment: (e.g: (A=2, B=1, C=1)( A=1, B=1, C=2)) S2) Choose the assignment that violates the fewest constraints ( A=1, B=1, C=2) - First instance not a solution as it doesnt meet constraint (B < C) - Neighbour is a better solution, and its neigbours are not better, then it is the solution the method gives.
443
# Solving CSPs by Local Search In Local Search is the solution given the best?
No, the solution can get stuck on a local maxima/minima where some constraints are violated. This is where the neighbours of the solution are not better than current one it gets stuck at the that solution and will not reach gloabl maxima/minima
444
# Solving CSPs by Local Search What is Optimisation?
Seeking the best/ Optimum solution (Violates no Constraints), not just a solution that violates a few constraint. Maximises/Minimises the objective function to find the best solution
445
# Solving CSPs by Local Search Give the Format / Definition of a Optimisation Problem:
- **Variables** *e.g.( X = {x1,x2,x3,...,xn})* - **Domain D(xi)** *Possible values for each variable* - **Constraints (C1,...,Cn) ** *Defining Realationships between variables* - **Objective function arg min/max f(X) ** *What we aim to Max/Min*
446
# Solving CSPs by Local Search What are Hard & Soft Constraints?
**Hard**: Must be satisfied to be valid **Soft**: Can be violated but assigned a cost to violation & aims to minimise them
447
# Solving CSPs by Local Search Can Tree Search Methods work for optimisation problems?
**Not Always** Due to optimisation problems sometimes having continuous search space
448
# Solving CSPs by Local Search Can Local Search Methods be Effective for Optimisation Problems?
Yes, work for both discrete and continuous search spaces
449
# Solving CSPs by Local Search Summarise Local Search for Optimisation Problems:
- Fast and Efficient - Deal with problems where the search is difficult to represent / formulate - Can be used in an online settign when the problem changes
450
# Solving CSPs by Local Search Give Examples of Local Search Methods:
- Hill Climbing - Simulated Annealing - Population-based Local searches
451
# Solving CSPs by Local Search What is Population-based Search helpful for?
Jumping out of local Optima
452
# Solving CSPs by Local Search Cons of Hill Climbing:
- No Guarantee to be COMPLETE or OPTIMAL - Can get stuck on local maxima & plateaus (RUN FOREVER IF NOT PROPERLY FORMULATED)
453
# Solving CSPs by Local Search Pros of Hill Climbing:
- Rapidly find good solutions by Improving over bad initial state (GREEDY) - Lower Time & Space Complexity compared to search algorithms - No Requirements of problem-specific heuristics (UNINFORMED) - Start from Candidate Solution, instead of building up step-by-step (UNLIKELY BUT POSSIBLE, SOLUTION PICKED AT RANDOM ) Run algorithm for a Maximum number of iterations m Variants of Hill climbing can sort out getting stuck on local maxima/ plateaus
454
# Solving CSPs by Local Search Variants of Hill Climbing:
- Stochastic Hill Climbing - First - Choice Hill Climbing - Random - Restart Hill Climbing
455
# Solving CSPs by Local Search Stochastic Hill Climbing :
- Variants Randomly selects a neighbour that involves an uphill move - Probability of picking a specific move can depend on the steepness - Converges slower than steepest ascent but can find higher solutions
456
# Solving CSPs by Local Search First - Choice Hill Climbing
-Randomly generates a single SUCCESSOR neighbour solution & moves to it if it's better than current solution - No UPHILL, keeps randomly generating solutions until there is an uphill move - After MAX num of tries OR generating all neighbours, hasn't found UPHILL move, it gives up & assumes that it's now at Optimal solution. - Time complexity of this is lower as not all neighbours need to be generates before 1 is picked (GOOD WHEN THERE ARE LOADS OF NEIGHBOURS FOR EACH SOLUTIONS)
457
# Solving CSPs by Local Search Random - Restart Hill Climbing:
- Generates a series of different hill climbing searches of the same problem from random to initial states - Stops when goal is found - Can be parallelised / Threaded easily so does not take much time on modern Computers - RARE to have to wait for this to happen
458
# Solving CSPs by Local Search What is the only Complete variant of Hill Climbing?
Random - Restart Hill Climbing It generates a solution if one exists, as eventually random start solution will be OPTIMAL SOLUTION
459
# Solving CSPs by Local Search Pros of Simulating Annealing:
- Find near Optimal Solutions in reasonable time (High Chance of going to GLOBAL MAX, But only Good as the formulation of Optimisation Problem) - Avoids getting stuck in poor LOCAL MAX & PLATEAU by combining EXPLORATION & EXLIOTATION (many solutions we concurrently work with at the end of EXPLORATION )
460
# Solving CSPs by Local Search Cons of Simulating Annealing:
- Not Guaranteed to be complete OR OPTIMAL (SENSITIVE TO FORMUALTION) - NOT reliable = can't guarantee completeness - Time & Space Complexity is problem & Representation- dependent
461
# Solving CSPs by Local Search Formulating Optimisation Problems:
min / max f(x) => min/max objective functions s.t g_i(x) <= 0, i = 1,...,m h_i(x) <= 0, i = 1,...,n => Feasibility Constraint ==> x is the DESIGN VARAIBLE (can be anything) SEARCH SPACE is space of all possible x values
462
# Solving CSPs by Local Search What is Search Space of a Problem:
is the space of all possible x values ALLOWED by constraints in CANDIDATE SOLUTIONS
463
# Solving CSPs by Local Search Define Explicit Constraint:
Explicitly mentioned CANNOT BE ASSUMED
464
# Solving CSPs by Local Search Define Implicit Constraint:
Rules that must be in place by the problem definitions in order for solutions to be CONSIDERED FEASIBLE e.g: x,y > 0
465
# Solving CSPs by Local Search What is the objective function?
-Takes design variables as an Input - Outputs a NUMERICAL value that problem aims to MININMISE OR MAXIMISE CAN have Multiple Objective functions in Formulation Defines the Cost or Quality of a Solution
466
# Solving CSPs by Local Search What are Constraints in Formulating Operation Problems?
-Constraints that design variables must satisfy for the solution to be FEASIBLE -Usually depicted by functions that take the design variables as input and output a numeric value. -They specify the values that these functions are allowed to take for the solution to be feasible. -There may be zero or more constraints in a problem DEFINES THE FEASIBILITY OF THE SOLUTION
467
# Solving CSPs by Local Search Benefits of Local Search?
DO NOT keep track of the paths or States that have been visited NOT systematic ,but PROS are: - Use very little Memory - Find Reasonable solutions in Large or Infinite State Space
468
# Solving CSPs by Local Search What are Local Search Algorithms?
Optimisation Algorithms that operate by searching from initial state to neighbouring states
469
# Solving CSPs by Local Search Purpose Of Hill Climbing?
TO find & Reach Global Maximum
470
# Solving CSPs by Local Search Purpose of Gradient Descent
TO find & Reach Global Minimum
471
# Solving CSPs by Local Search Why is Hill Climbing Greedy?
Does not look beyond the immediate neighbours of the current state
472
# Solving CSPs by Local Search What are the 3 Components of Hill Climbing we must design?
-Representation -Initialisation Procedure -Neighbourhood Operator
473
# Solving CSPs by Local Search What is Representation?
How to store Design variables in the problem(s) Should facilitate the Application of the Initialisation Procedure
474
# Solving CSPs by Local Search What is Initialisation Procedure?
How to pick initial solution. USUALLY RANDOM, Can Be Heuristic
475
# Solving CSPs by Local Search What is Neighbourhood Operator?
How to generate Neighbourhood Solutions (INCREMENT/STEP SIZE)
476
# Solving CSPs by Local Search Performance of Hill Climbing:
Completeness: No, Depends on problem formulation & design of the algorithms (GET STUCK ON LOCAL MINIMA) OPTIMALITY: Not Optimal, (GET STUCK ON LOCAL MINIMA) Time: O(mnp) m = MAX no. of iteration, n = MAX no. of neighbours, EACH take O(p) to generate Space: O(nq+r) ==> r is a constant so ==> O(nq) n = MAX no. of neighbours, Variable takes O(q) and r represents the space to generate the neighbours sequentially(NEGLIGIBLE COMPARED TO n & q)
477
# Search & Constraint Satisfication Problems Step-by-Step BFS?
1) Start at the intial start 2) Expand all possible moves at the first level 3) Move to the next level & repeat 4) Continue until the goal is found
478
# Games What are the main ideas for Games?
NOT JUST FOR FUN - *Formal model of decision-making between agents* - Each player has a set of possible actions - Resulting outcome depends on the combination of everyone's choices - Each player has a preference over outcomes
479
# Game What type of search is a Game?
Typically a **ADVERSARIAL SEARCH** problem where your opponent has soemthing to say about your strategy
480
# Game Give Types of Games:
- Deterministic - Stochastic - No.of player : Single/2/Multiple - Zero-sum - Non-zero-sum - Perfect Information - Imperfect Information
481
# Game What does Deterministic mean?
- The next state completely determined by the current state & Action - NO RANDOMNESS IF U KNWO EXACTLY WHAT WILL HAPPEN - E.g: Tic-Tac-Toe, Chess, Go
482
# Game What does Stochastic mean?
Next State is uncertain, actions have random outcomes e.g. Poker, mahjong
483
# Game How many players can a game have?
- One (NO Adversary) *e.g. Soliatre* - Two *e.g. Chess, Go* - 2+ *e.g. Poker, Mahjong*
484
# Game What are Zero Sum games?
Games where one player wins and another loses *Utilities are negatives of each other* ## Footnote e.g. Chess, go, poker
485
# Game What are non-zero-sum games?
Games where players can both gain (win) & lose *Where cooperation is allowed/ required* ## Footnote e.g Prisoner's Dilemma
486
# Game What is Perfect Information?
Plauers can see everything, so its more strightforward as all the information is avaliable ## Footnote e.g. Tic-Tac-Toe, Chess ,Go
487
# Game What is imperfect Information?
Parts of the game state is hidden, therefore requires strategies like belief states & probalisitic reasoning HARD TO SEE NEXT OPTIONS OR STATES ## Footnote E.g. Poker, Mahjong Poker (Don't know anyone else's cards)
488
# Game Give the Formalisation of a Game:
**States: S** - Intial State/ Root of game tree **Action: A** - Returns all the possible legal moves from State s **Transition Function: S * A => S** - Given State S & Action A, this returns the new state after the move. - From Current state how to get to another state **Terminal Test: S => (true,false)** - Boolean that checks if its an end state - Leaves of game tree, where we assigns final utility values **Players: P = (1,...,N)** - Alternates through players - Return who's turn is next/ currently **Utilities: S{terminal} X P => R** - AKA Objective Function - The final Numeric value of the terminal state S from perpective of a player P - Scores that minimax uses to progate values up a tree to decide players best moves
489
# Game What is the Goal for a search Algorithm?
To find a strategy (Policy) which recommends a move for each state, in order that they can end up with the max achievable utiltity *Whatever state we are in we want the computer to tell us the best next state to go to*
490
# Game What is value of state?
Best achievable outcome (utility) from that state *IS IT A GOOD STATE OR NOT*
491
# Game How to draw a Game Tree?
- Draw the Current State (Initial State) & assign a player - Draw successor for intial state for other player - Draw successor for all States until reaching Final/ Terminal states/ leaves - Then working backwards from leaves to root assign each state a utility value *Remember the utility is based on a player's perspective and the other player will affect them negatively*
492
# Minimax What is minimax value of a node/state?
Utility of the terminal to which both player play optimally from that node *(PLAY BEST MOVES FOR THEMSELVES)*
493
# Minimax What is the process of 2 player game for Minimax?
- One player has to maximise utility - Another player has to minimise the Utility of the other player
494
# Minimax Give the Function for minimax:
**function** minimax_val (state) **return **its minimax value if *state* is a terminal state **return** its utility if *state* is for agent MAX to take an action **return** max_val(*state*) if *state* is for agent MIN to take an action **return** min_val(*state*) **function** max_val (state) **return **its minimax value v intialise v = NEGATIVE INFINATE **for** each successor of *state* v = max (v,minimax_val(successor)) **return** v **function** min_val (state) **return **its minimax value v intialise v = INFINATE **for** each successor of *state* v = max (v,minimax_val(successor)) **return** v
495
# Minimax Explain how minimax algorithm works?
Game tree has alternating layers of Max and Min nodes: - Max nodes (your turn) choose the highest value. - Min nodes (opponent's turn) choose the lowest value. - Start from the leaf nodes (end of game states with known values). Recursively propagate values up the tree: - At each node, pick the max or min of the child values depending on its type. At the root, you get the best possible value assuming: - You play optimally, and - Your opponent also plays optimally. Key Notes: - Acts like Depth-First Search (DFS): - Explore all paths left to right, assign values bottom-up. - At the root: - Initially assign the value of the leftmost child. - As you explore other children, only update if a better (max/min) value is found.
496
# Minimax Explain the Implementation of Minimax?
- Explores all the successors to find the max value the min player will choose - Explores successors & find the path to the best solution depending on minimax
497
# Minimax Give a Summary for Minimax:
Usually Used in *Deterministic, zero-sum games* - Our win is our adversary's loss - Player 1 maxes its own utlity and other player minimises it Search: - A state-space serch tree - Players Alternate turns (MAX/MIN) - Compute each node's minimax value (*Best achievalbe utility against an optimal adversary (ASSUMES OPPONENT WILL PLAY OPTIMALLY)*)
498
# Minimax What is the Computational Complexity of Minimax?
Minimax use DFS (*So explores the deepest nodes first*) **Time:** - complexity: O(b^m) - b = branching factor (*no of successors a branch may have*) - d = Max depth **Space:** - Complexity: O(bm) This computatuonally expensive, it will take too much time and space to search using minimax Solving game Trees are completely infeasible in most cases: - Deep Searches, requires all the nodes to be evaluated, and searching millions of nodes takes too much time - Bad or bad branches will still be explored wasting time and memory - Storing all the explored nodes will take alot of memory
499
# Game Tree Pruning How does Game Tree Pruining Work?
Purpose: To reduce the number of nodes explored in the game tree during Minimax search without affecting the final decision. How it works: During exploration, some branches are cut off if they cannot influence the final decision. This happens when a node proves to be worse than a previously examined move and therefore doesn't need further exploration.
500
# Game Tree Search How can node be worse?
If the parent has a constraint and the succesor has its own cosntraint and the domain has no overlapping values e.g. root from another branch get v>= 3 Exploring the next succssor you get v<=2 There are no overlapping values in the domains of either, If there were it would check the next leaf node to see it the roots successor value will change(New min value)
501
# Game Tree Search What is v in minimax?
The minimax value
502
# Game Tree Pruning What is Alpha in Alpha-Beta Pruning?
**The best value that MAX can guarantee** - It represents a lower bound, MAX will never accept a move worse than ths
503
# Game Tree Search What is Beta in Alph-Beta Pruning?
The Upperbound of the minimum value that the maximising player is assured off
504
# Game Tree Search In Alpha Beta Pruning, What happens at MIN Nodes?
- Compute min_val(node) - The Goal for MIN is to mimimize the socre, s.t. it chooses the smallest possible values among its children
505
# Game Tree Pruning When do we Prune, during Alpha-Beta Pruning?
- While exploring children of a node, we find a node that never gives us a better value than α - As a result we stop searching the children
506
# Game Tree Pruning What are the Properties of Alpha-Beta Pruning?
**Pruning has not effect on Minimax value for the root** - Doesnt change final value - Sppeds up Computation, best moves stay the same **Good Child ordering improves effectivness** -Evalutes best move first, pruning becomes much more effective *Best move picked first, pruning occurs earlier and more often* *Worst move first, pruning occurs later and not as frequent* **Complexity of perfect ordering : O(b^(m/2))** - Can look twice as deep as your opponent - Improved efficeincy - Can explore a greater depth than opponent/ normal minimax - Full Serach of many games is hopless, *i.e. chess, go* due to the exponential growth of possibilites