AI 2 Flashcards

1
Q

What is a maximum likelihood estimate?

A

The chance of the outcome most likely to occur. for example if we have a likelihood function that takes the shape of normal distriburion with observed result of success/tries against probability of each. the peak of the curve is the most likely to occur and represents the MLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a cost function and what is the cost function usually for a given likelihood function L(θ|x)

A

A cost function is effectively measures the discrepency between an estimation of a function and the actual function. The cost function is usually -log(L(θ|x))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is supervised learning?

A

Taking a dataset of input/output pairs and creating a function that describes them with as little discrepency as possible. this is used for classifying data or predicting data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is unsupervised learning?

A

Structuring data with multiple dimensions but no specific output. Unsupervised objectives may be clustering, dimensionality reduction or anomaly detection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do we define a stationary point (w*) on a multidimensional function?

A

∂/∂w₁ g (w ∗) = 0
∂/∂w₂ g (w ∗) = 0

∂/∂wₙ g (w ∗) = 0
∴ if the point is stationary, the derivative operator with respect to each variable is 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What value does gradient descent use for the direction of updating the parameters?

A

The negative of the gradient of the cost function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What equation can expresses the conditional probability of the binary dependent variable yᵢ being 1 considering the parameters θ in logistic regression? P(yᵢ=1|xᵢ)

What are the odds?

A

σ(xᵢ) = 1/ (1+exp(-θ⋅xᵢ))
where θ is the parameters of the independent variable vector x

odds = exp(θ⋅xᵢ)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How is the function that represents the chance of a point being y=1 based on a dataset of input vectors and coefficients written?

A

hθ(X ) = P(Y=1|X ;θ) = 1/ ( 1+exp(-θ⋅X) )
where capital X and Y represent the input and outputs of all the points in the training set

; in X ;θ means probability based on the two variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the likelihood function for logistic regression?

A

L(θ|y;x) = P(Y|X;θ)= Π P(yᵢ |xᵢ ;θ)
where P(yᵢ |xᵢ ;θ) is hθ(xᵢ) if y=1 and 1-hθ(xᵢ) if y=0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do we calculate the cost function for logistic regression?

A

− log(L(θ|y;x)) =Σ(i=1>N) yᵢ log(hθ(xᵢ)) + (1−yᵢ) log(1− hθ(xᵢ))
(the y and 1-y coefficients ensure that only the relevant term will be used and the other will be 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What do we need in the dataset for logistic regression to work?

A
  • Binary output
  • Independent variables
  • Variables have low multicollinearity (variables unrelated)
  • There is a linear relationship between the log odds and the variables i.e. theta is made up of constant coefficients log( p/(1-p) ) = θ₀ + θ₁x₁ + …
  • large sample size. rule: sample needs to be greater than 10 * samples / lower probability of y outcome for example 0.1 chance y=0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the three axioms of measuring the information of a given event?

A
  • An event with probability of 100% yields no information
  • The less probable an event is, the more information it yields
  • If two events are measured separately, the information gathered is the sum of both informations
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How is information from an event measured?

A

Iₓ(x) = logₐ [ 1/(Pₓ(x)) ]
where Pₓ(x) is the probability of x being the value it is
information calculated with a=2 is called bits, a=e is called natural units or nats and a=10 is called dits, bans or hartleys

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How do you calculate logit(x) (log-odds)?

A

log (p/(1-p)) = log(p) - log(1-p) = log(x) - log(¬x)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is entropy and how is it calculated (discrete data)?

A

Entropy is the uncertainty in a random variable X.
E[Iₓ(x)] = -Σ(i=1>n) P(X=xᵢ) * logₐ(P(X=xᵢ))

xi is all the values that X could be. for example a dice would be 1-6

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is joint entropy and how is it calculated?

A

Uncertainty of two variables X and Y being the values they are.
H(X,Y) = -Σ(i=1>n) Σ(j=1>n) P(X=xᵢ, Y=yⱼ) * logₐ(P(X=xᵢ, Y=yⱼ))

17
Q

What is conditional entropy and how is it calculated?

A

Uncertainty of a random variable Y given the outcome of another random variable X
H(Y|X) = -Σ(i=1>n) Σ(j=1>n) P(X=xᵢ, Y=yⱼ) * logₐ(P(Y=yⱼ | X=xᵢ))
H(Y|X) = H(X,Y) - H(X)

18
Q

What is relative entropy (Kullback-Leibler distribution) and how is it calculated?

A

Estimation of how well Y approximates
Dₖₗ(X||Y) = Σ(x=1>n) P(x)*log[ P(x)/Q(x) ]
Where P and Q are two probability distributions (events) of a discrete random variable X

probably dont need in exam

19
Q

What is:
* Cross entropy
* Jenson-Shannon divergence

A

Cross entropy: H(X,Y) = H(X) + Dₖₗ(X||Y)
Jenson-Shannon divergence: JSD(X||Y) = 0.5Dₖₗ(X||M) + 0.5Dₖₗ(Y||M)
where M = 0.5(X+Y)

probably dont need this in exam

20
Q

What is mutual information and how is it calculated?

A

How much two distributions share information or how much knowing one of the variables reduces uncertainty about the other
I(X;Y) = Σ(x=1>n) Σ(y=1>n) p(x,y)*log( p(x,y) / p(x)p(y) )

21
Q

How can we find mutual information based on entropy values?

A

I(X;Y) = H(X) - H(X|Y)
OR
I(X;Y) = H(Y) - H(Y|X)

22
Q

What is information gain?

A

Information gained when traversing the decision tree from node X to Y. Written as IG(Y,X) and is the same as mutual information of X and Y: I(X;Y)

23
Q

How are nodes chosen to be higher or lower in the decision tree?

A

Characteristics (A) with higher information gained from the random variable (Y) are chosen to be higher. This is done by calculating H(Y) and then with all the characteristics (A,B,C,…) calculating conditional entropy after each split by characteristic ( H(Y|A), H(Y|B), … ) and selecting the one with the greatest value.

24
Q

What is feature selection and how does it work?

A

There may be many characteristics so the top x amount are selected to form the decision tree. This is done by adding characteristics with highest [mutual information - c * Σ mutual information with the other chosen variables] until x variables are chosen.

25
Q

How do we calculate p(x|y)?

A

p(x,y)/p(y)

26
Q

What is full joint distribution and how do we calculate it in a Bayesian network?

A

It is a measure of how likely a combination of variables are and how dependent these variables depend on each other
P(X₁, X₂, X₃, … Xₙ) = Π(i=1>n) P(Xᵢ| parents(Xᵢ) )

27
Q

What is conditional independence?

A

When two variables A and B are independent given C.
P(A,B|C) = P(A|C)*P(B|C)

28
Q

What is the Markov condition of a Bayesian network?

A

Each random variable X is conditionally independent of its non-descendants, given its parents

29
Q

Which are the only 4 standard structures allowed in standard Bayesian networks?

A
  • Direct cause: A>B
  • Indirect Cause: A>B>C
  • Common cause: A>B<C
  • Common effect: A<b>C</b>
30
Q

What are constraints and preferences in search problems?

A

Constraints are rules that legal solutions have to abide by. Preferences are values that need to be minimised to make a solution better.

31
Q

What is a constraint satisfaciton problem?

A

A problem where a solution is a combination that satisfies all constraints and has no preferences

32
Q

What makes up a CSP?

A
  • Variables e.g. squares on a chess board
  • Domain e.g. {0,1} denoting if the square contains a queen or not
  • Constraints e.g. no more than one queen on a row, column or diagonal
33
Q

what are two methods that speed up solving a CSP when using backtracking?

A
  • Filtering: Can we detect inevitable failure early e.g. forward checking which, after selecting a value for a variable, removes all values from the domains of the other variables that violates the constraints given the values chosen
  • Ordering: Choosing next variable to be assigned based by choosing the one that has the smallest domain
34
Q

What is local search to solve a CSP?

A
  • Randomly choose values for each variable, violating the constraints
  • While constraints are violated:
  • Choose a variable that violates a constraint and change it to the value that has the least violations

Can get stuck

35
Q

What are the attributes of a game?

A
  • States (S)
  • Actions (A)
  • Transition function S x A -> S
  • Terminal test: S -> (true, false)
  • Players: P -> (1,…,n)
  • Utilities e.g. Terminal function that takes a terminal state and a player and gives a value for the game for that player
36
Q

What is the value of a state?

A

The value of the best possible outcome from the next move of that state

37
Q

What is the minimax algorithm in a two-player zero-sum (outcomes are win/loss, loss/win or draw) game?

A
  • Generate a tree with all possible outcomes
  • Assign correct values to terminal states
  • At each stage, assign values based on who’s turn it would be and what they would pick (either min or max of possible values)
  • Make decisions that best suited to player’s goal
38
Q

How to do alpha-beta pruning in exam?

A

When doing backpropagation stage, only check down paths if, based on the current and previous min/max decisions, the outcome may be different depending on if the value changes when propagating