AI 2 Flashcards by Rowan Dixon

What is a maximum likelihood estimate?

The chance of the outcome most likely to occur. for example if we have a likelihood function that takes the shape of normal distriburion with observed result of success/tries against probability of each. the peak of the curve is the most likely to occur and represents the MLE

How well did you know this?

Not at all

Perfectly

What is a cost function and what is the cost function usually for a given likelihood function L(θ|x)

A cost function is effectively measures the discrepency between an estimation of a function and the actual function. The cost function is usually -log(L(θ|x))

How well did you know this?

Not at all

Perfectly

What is supervised learning?

Taking a dataset of input/output pairs and creating a function that describes them with as little discrepency as possible. this is used for classifying data or predicting data.

How well did you know this?

Not at all

Perfectly

What is unsupervised learning?

Structuring data with multiple dimensions but no specific output. Unsupervised objectives may be clustering, dimensionality reduction or anomaly detection

How well did you know this?

Not at all

Perfectly

How do we define a stationary point (w*) on a multidimensional function?

∂/∂w₁ g (w ∗) = 0
∂/∂w₂ g (w ∗) = 0
…
∂/∂wₙ g (w ∗) = 0
∴ if the point is stationary, the derivative operator with respect to each variable is 0

How well did you know this?

Not at all

Perfectly

What value does gradient descent use for the direction of updating the parameters?

The negative of the gradient of the cost function

How well did you know this?

Not at all

Perfectly

What equation can expresses the conditional probability of the binary dependent variable yᵢ being 1 considering the parameters θ in logistic regression? P(yᵢ=1|xᵢ)

What are the odds?

σ(xᵢ) = 1/ (1+exp(-θ⋅xᵢ))
where θ is the parameters of the independent variable vector xᵢ

odds = exp(θ⋅xᵢ)

How well did you know this?

Not at all

Perfectly

How is the function that represents the chance of a point being y=1 based on a dataset of input vectors and coefficients written?

hθ(X ) = P(Y=1|X ;θ) = 1/ ( 1+exp(-θ⋅X) )
where capital X and Y represent the input and outputs of all the points in the training set

; in X ;θ means probability based on the two variables

How well did you know this?

Not at all

Perfectly

What is the likelihood function for logistic regression?

L(θ|y;x) = P(Y|X;θ)= Π P(yᵢ |xᵢ ;θ)
where P(yᵢ |xᵢ ;θ) is hθ(xᵢ) if y=1 and 1-hθ(xᵢ) if y=0

How well did you know this?

Not at all

Perfectly

How do we calculate the cost function for logistic regression?

− log(L(θ|y;x)) =Σ(i=1>N) yᵢ log(hθ(xᵢ)) + (1−yᵢ) log(1− hθ(xᵢ))
(the y and 1-y coefficients ensure that only the relevant term will be used and the other will be 0

How well did you know this?

Not at all

Perfectly

What do we need in the dataset for logistic regression to work?

Binary output
Independent variables
Variables have low multicollinearity (variables unrelated)
There is a linear relationship between the log odds and the variables i.e. theta is made up of constant coefficients log( p/(1-p) ) = θ₀ + θ₁x₁ + …
large sample size. rule: sample needs to be greater than 10 * samples / lower probability of y outcome for example 0.1 chance y=0

How well did you know this?

Not at all

Perfectly

What are the three axioms of measuring the information of a given event?

An event with probability of 100% yields no information
The less probable an event is, the more information it yields
If two events are measured separately, the information gathered is the sum of both informations

How well did you know this?

Not at all

Perfectly

How is information from an event measured?

Iₓ(x) = logₐ [ 1/(Pₓ(x)) ]
where Pₓ(x) is the probability of x being the value it is
information calculated with a=2 is called bits, a=e is called natural units or nats and a=10 is called dits, bans or hartleys

How well did you know this?

Not at all

Perfectly

How do you calculate logit(x) (log-odds)?

log (p/(1-p)) = log(p) - log(1-p) = log(x) - log(¬x)

How well did you know this?

Not at all

Perfectly

What is entropy and how is it calculated (discrete data)?

Entropy is the uncertainty in a random variable X.
E[Iₓ(x)] = -Σ(i=1>n) P(X=xᵢ) * logₐ(P(X=xᵢ))

xi is all the values that X could be. for example a dice would be 1-6

How well did you know this?

Not at all

Perfectly

What is joint entropy and how is it calculated?

Study These Flashcards

Uncertainty of two variables X and Y being the values they are.
H(X,Y) = -Σ(i=1>n) Σ(j=1>n) P(X=xᵢ, Y=yⱼ) * logₐ(P(X=xᵢ, Y=yⱼ))

What is conditional entropy and how is it calculated?

Study These Flashcards

Uncertainty of a random variable Y given the outcome of another random variable X
H(Y|X) = -Σ(i=1>n) Σ(j=1>n) P(X=xᵢ, Y=yⱼ) * logₐ(P(Y=yⱼ | X=xᵢ))
H(Y|X) = H(X,Y) - H(X)

What is relative entropy (Kullback-Leibler distribution) and how is it calculated?

Study These Flashcards

Estimation of how well Y approximates
Dₖₗ(X||Y) = Σ(x=1>n) P(x)*log[ P(x)/Q(x) ]
Where P and Q are two probability distributions (events) of a discrete random variable X

probably dont need in exam

What is:
* Cross entropy
* Jenson-Shannon divergence

Study These Flashcards

Cross entropy: H(X,Y) = H(X) + Dₖₗ(X||Y)
Jenson-Shannon divergence: JSD(X||Y) = 0.5Dₖₗ(X||M) + 0.5Dₖₗ(Y||M)
where M = 0.5(X+Y)

probably dont need this in exam

What is mutual information and how is it calculated?

Study These Flashcards

How much two distributions share information or how much knowing one of the variables reduces uncertainty about the other
I(X;Y) = Σ(x=1>n) Σ(y=1>n) p(x,y)*log( p(x,y) / p(x)p(y) )

How can we find mutual information based on entropy values?

Study These Flashcards

I(X;Y) = H(X) - H(X|Y)
OR
I(X;Y) = H(Y) - H(Y|X)

What is information gain?

Study These Flashcards

Information gained when traversing the decision tree from node X to Y. Written as IG(Y,X) and is the same as mutual information of X and Y: I(X;Y)

How are nodes chosen to be higher or lower in the decision tree?

Study These Flashcards

Characteristics (A) with higher information gained from the random variable (Y) are chosen to be higher. This is done by calculating H(Y) and then with all the characteristics (A,B,C,…) calculating conditional entropy after each split by characteristic ( H(Y|A), H(Y|B), … ) and selecting the one with the greatest value.

What is feature selection and how does it work?

Study These Flashcards

There may be many characteristics so the top x amount are selected to form the decision tree. This is done by adding characteristics with highest [mutual information - c * Σ mutual information with the other chosen variables] until x variables are chosen.

How do we calculate p(x|y)?

p(x,y)/p(y)

What is full joint distribution and how do we calculate it in a Bayesian network?

It is a measure of how likely a combination of variables are and how dependent these variables depend on each other P(X₁, X₂, X₃, ... Xₙ) = Π(i=1>n) P(Xᵢ| parents(Xᵢ) )

What is conditional independence?

When two variables A and B are independent given C. P(A,B|C) = P(A|C)*P(B|C)

What is the Markov condition of a Bayesian network?

Each random variable X is conditionally independent of its non-descendants, given its parents

Which are the only 4 standard structures allowed in standard Bayesian networks?

* Direct cause: A>B * Indirect Cause: A>B>C * Common cause: A>BC

What are constraints and preferences in search problems?

Constraints are rules that legal solutions have to abide by. Preferences are values that need to be minimised to make a solution better.

What is a constraint satisfaciton problem?

A problem where a solution is a combination that satisfies all constraints and has no preferences

What makes up a CSP?

* Variables e.g. squares on a chess board * Domain e.g. {0,1} denoting if the square contains a queen or not * Constraints e.g. no more than one queen on a row, column or diagonal

what are two methods that speed up solving a CSP when using backtracking?

* Filtering: Can we detect inevitable failure early e.g. forward checking which, after selecting a value for a variable, removes all values from the domains of the other variables that violates the constraints given the values chosen * Ordering: Choosing next variable to be assigned based by choosing the one that has the smallest domain

What is local search to solve a CSP?

* Randomly choose values for each variable, violating the constraints * While constraints are violated: * Choose a variable that violates a constraint and change it to the value that has the least violations Can get stuck

What are the attributes of a game?

* States (S) * Actions (A) * Transition function S x A -> S * Terminal test: S -> (true, false) * Players: P -> (1,...,n) * Utilities e.g. Terminal function that takes a terminal state and a player and gives a value for the game for that player

What is the value of a state?

The value of the best possible outcome from the next move of that state

What is the minimax algorithm in a two-player zero-sum (outcomes are win/loss, loss/win or draw) game?

* Generate a tree with all possible outcomes * Assign correct values to terminal states * At each stage, assign values based on who's turn it would be and what they would pick (either min or max of possible values) * Make decisions that best suited to player's goal

How to do alpha-beta pruning in exam?

When doing backpropagation stage, only check down paths if, based on the current and previous min/max decisions, the outcome may be different depending on if the value changes when propagating

AI 2 Flashcards

(38 cards)