AI 2 Flashcards
What is a maximum likelihood estimate?
The chance of the outcome most likely to occur. for example if we have a likelihood function that takes the shape of normal distriburion with observed result of success/tries against probability of each. the peak of the curve is the most likely to occur and represents the MLE
What is a cost function and what is the cost function usually for a given likelihood function L(θ|x)
A cost function is effectively measures the discrepency between an estimation of a function and the actual function. The cost function is usually -log(L(θ|x))
What is supervised learning?
Taking a dataset of input/output pairs and creating a function that describes them with as little discrepency as possible. this is used for classifying data or predicting data.
What is unsupervised learning?
Structuring data with multiple dimensions but no specific output. Unsupervised objectives may be clustering, dimensionality reduction or anomaly detection
How do we define a stationary point (w*) on a multidimensional function?
∂/∂w₁ g (w ∗) = 0
∂/∂w₂ g (w ∗) = 0
…
∂/∂wₙ g (w ∗) = 0
∴ if the point is stationary, the derivative operator with respect to each variable is 0
What value does gradient descent use for the direction of updating the parameters?
The negative of the gradient of the cost function
What equation can expresses the conditional probability of the binary dependent variable yᵢ being 1 considering the parameters θ in logistic regression? P(yᵢ=1|xᵢ)
What are the odds?
σ(xᵢ) = 1/ (1+exp(-θ⋅xᵢ))
where θ is the parameters of the independent variable vector xᵢ
odds = exp(θ⋅xᵢ)
How is the function that represents the chance of a point being y=1 based on a dataset of input vectors and coefficients written?
hθ(X ) = P(Y=1|X ;θ) = 1/ ( 1+exp(-θ⋅X) )
where capital X and Y represent the input and outputs of all the points in the training set
; in X ;θ means probability based on the two variables
What is the likelihood function for logistic regression?
L(θ|y;x) = P(Y|X;θ)= Π P(yᵢ |xᵢ ;θ)
where P(yᵢ |xᵢ ;θ) is hθ(xᵢ) if y=1 and 1-hθ(xᵢ) if y=0
How do we calculate the cost function for logistic regression?
− log(L(θ|y;x)) =Σ(i=1>N) yᵢ log(hθ(xᵢ)) + (1−yᵢ) log(1− hθ(xᵢ))
(the y and 1-y coefficients ensure that only the relevant term will be used and the other will be 0
What do we need in the dataset for logistic regression to work?
- Binary output
- Independent variables
- Variables have low multicollinearity (variables unrelated)
- There is a linear relationship between the log odds and the variables i.e. theta is made up of constant coefficients log( p/(1-p) ) = θ₀ + θ₁x₁ + …
- large sample size. rule: sample needs to be greater than 10 * samples / lower probability of y outcome for example 0.1 chance y=0
What are the three axioms of measuring the information of a given event?
- An event with probability of 100% yields no information
- The less probable an event is, the more information it yields
- If two events are measured separately, the information gathered is the sum of both informations
How is information from an event measured?
Iₓ(x) = logₐ [ 1/(Pₓ(x)) ]
where Pₓ(x) is the probability of x being the value it is
information calculated with a=2 is called bits, a=e is called natural units or nats and a=10 is called dits, bans or hartleys
How do you calculate logit(x) (log-odds)?
log (p/(1-p)) = log(p) - log(1-p) = log(x) - log(¬x)
What is entropy and how is it calculated (discrete data)?
Entropy is the uncertainty in a random variable X.
E[Iₓ(x)] = -Σ(i=1>n) P(X=xᵢ) * logₐ(P(X=xᵢ))
xi is all the values that X could be. for example a dice would be 1-6
What is joint entropy and how is it calculated?
Uncertainty of two variables X and Y being the values they are.
H(X,Y) = -Σ(i=1>n) Σ(j=1>n) P(X=xᵢ, Y=yⱼ) * logₐ(P(X=xᵢ, Y=yⱼ))
What is conditional entropy and how is it calculated?
Uncertainty of a random variable Y given the outcome of another random variable X
H(Y|X) = -Σ(i=1>n) Σ(j=1>n) P(X=xᵢ, Y=yⱼ) * logₐ(P(Y=yⱼ | X=xᵢ))
H(Y|X) = H(X,Y) - H(X)
What is relative entropy (Kullback-Leibler distribution) and how is it calculated?
Estimation of how well Y approximates
Dₖₗ(X||Y) = Σ(x=1>n) P(x)*log[ P(x)/Q(x) ]
Where P and Q are two probability distributions (events) of a discrete random variable X
probably dont need in exam
What is:
* Cross entropy
* Jenson-Shannon divergence
Cross entropy: H(X,Y) = H(X) + Dₖₗ(X||Y)
Jenson-Shannon divergence: JSD(X||Y) = 0.5Dₖₗ(X||M) + 0.5Dₖₗ(Y||M)
where M = 0.5(X+Y)
probably dont need this in exam
What is mutual information and how is it calculated?
How much two distributions share information or how much knowing one of the variables reduces uncertainty about the other
I(X;Y) = Σ(x=1>n) Σ(y=1>n) p(x,y)*log( p(x,y) / p(x)p(y) )
How can we find mutual information based on entropy values?
I(X;Y) = H(X) - H(X|Y)
OR
I(X;Y) = H(Y) - H(Y|X)
What is information gain?
Information gained when traversing the decision tree from node X to Y. Written as IG(Y,X) and is the same as mutual information of X and Y: I(X;Y)
How are nodes chosen to be higher or lower in the decision tree?
Characteristics (A) with higher information gained from the random variable (Y) are chosen to be higher. This is done by calculating H(Y) and then with all the characteristics (A,B,C,…) calculating conditional entropy after each split by characteristic ( H(Y|A), H(Y|B), … ) and selecting the one with the greatest value.
What is feature selection and how does it work?
There may be many characteristics so the top x amount are selected to form the decision tree. This is done by adding characteristics with highest [mutual information - c * Σ mutual information with the other chosen variables] until x variables are chosen.
How do we calculate p(x|y)?
p(x,y)/p(y)
What is full joint distribution and how do we calculate it in a Bayesian network?
It is a measure of how likely a combination of variables are and how dependent these variables depend on each other
P(X₁, X₂, X₃, … Xₙ) = Π(i=1>n) P(Xᵢ| parents(Xᵢ) )
What is conditional independence?
When two variables A and B are independent given C.
P(A,B|C) = P(A|C)*P(B|C)
What is the Markov condition of a Bayesian network?
Each random variable X is conditionally independent of its non-descendants, given its parents
Which are the only 4 standard structures allowed in standard Bayesian networks?
- Direct cause: A>B
- Indirect Cause: A>B>C
- Common cause: A>B<C
- Common effect: A<b>C</b>
What are constraints and preferences in search problems?
Constraints are rules that legal solutions have to abide by. Preferences are values that need to be minimised to make a solution better.
What is a constraint satisfaciton problem?
A problem where a solution is a combination that satisfies all constraints and has no preferences
What makes up a CSP?
- Variables e.g. squares on a chess board
- Domain e.g. {0,1} denoting if the square contains a queen or not
- Constraints e.g. no more than one queen on a row, column or diagonal
what are two methods that speed up solving a CSP when using backtracking?
- Filtering: Can we detect inevitable failure early e.g. forward checking which, after selecting a value for a variable, removes all values from the domains of the other variables that violates the constraints given the values chosen
- Ordering: Choosing next variable to be assigned based by choosing the one that has the smallest domain
What is local search to solve a CSP?
- Randomly choose values for each variable, violating the constraints
- While constraints are violated:
- Choose a variable that violates a constraint and change it to the value that has the least violations
Can get stuck
What are the attributes of a game?
- States (S)
- Actions (A)
- Transition function S x A -> S
- Terminal test: S -> (true, false)
- Players: P -> (1,…,n)
- Utilities e.g. Terminal function that takes a terminal state and a player and gives a value for the game for that player
What is the value of a state?
The value of the best possible outcome from the next move of that state
What is the minimax algorithm in a two-player zero-sum (outcomes are win/loss, loss/win or draw) game?
- Generate a tree with all possible outcomes
- Assign correct values to terminal states
- At each stage, assign values based on who’s turn it would be and what they would pick (either min or max of possible values)
- Make decisions that best suited to player’s goal
How to do alpha-beta pruning in exam?
When doing backpropagation stage, only check down paths if, based on the current and previous min/max decisions, the outcome may be different depending on if the value changes when propagating