Week 1 Flashcards
Linear modelling
Learning a linear relationship between attributes and responses.
What does this mean?
t = f(x;a)
A function f() that acts on x and has a parameter a.
What parameter is known as the intercept?
w0 in
f(x) = w0 + w1*x
What does the squared loss function describe?
How much accuracy we are losing through the use of a certain function to model a phenomenon.
What is this?
Ln(y, f(x;w0,w1))
The squared loss function, telling us how much accuracy we lose through the use of f(x;w0,w1) to model y.
How do you calculate the average loss across a whole dataset?
L =
1/N *
SUM(n=1 to N) of the squared loss function with xn.
argmin means…
Find the argument that minimises.
Bias-variance tradeoff
The tradeoff between a model’s ability to generalise and the risk of overfitting.
Validation set
A second dataset that is used to validate the predictive performance of the model.
K-fold cross-validation
Splits the data into K equally sized blocks. Each block is a validation set with the other K-1 blocks as training set.
LOOCV (abbreviation)
Leave-One-Out Cross-Validation
What is LOOVCV?
A type of K-fold cross-validation where K=N.
0! ==
1
What is a prerequisite for multiplying an n x m matrix A and a q x r matrix B?
A*B is possible if…
m == q
So the number of columns of the first matrix needs to be equal to the number of rows in the second matrix.
(X*w)^T can be simplified to…
(w^T) * (X^T)
(ABCD)^T can be simplified to…
( (AB) (CD) )^T
( (AB) (CD) )^T can be simplified to…
(CD)^T * (AB)^T
(CD)^T * (AB)^T can be simplified to…
D^T * C^T * B^T * A^T
What is the partial derivative with respect to w of:
w^T * x
x
What is the partial derivative with respect to w of:
x^T * w
x
What is the partial derivative with respect to w of:
w^T * w
2w
What is the partial derivative with respect to w of:
w^T * c*w
2cw
Multiplying a scalar by an identity matrix results in..
a matrix with the scalar value on each diagonal element.
The inverse of a matrix that only has values on the diagonal, is
Another diagonal matrix where each diagonal element is the inverse of the corresponding element in the original.
How do we write the optimum value of w?
^w
w met een dakje erop
^w ==
(The formula)
(X^T * X) ^-1 * X^T * t
Linear regression
Supervised learning where t is a real number.
Linear classification
Supervised learning where t is an element from a finite set.
In supervised learning, each dataset is of the form x,t (with t in R in regression). What is the goal?
We look for a hypothesis such that t = f(x). We want the computer to predict the data.
Overfitting
When the model comes up with a hypothesis that is too complex, so it fits the existing data very well but has terrible prediction skills.
What does regularisation add to supervised learning?
It doesn’t only try to find minimized loss, but also to find minimal weights. So a penalty is added for a higher weight.
What is the central question in the generative modelling problem?
Can we build a model that could generate a dataset like ours?
What does the equation
f(x;w) = w^T * x
do?
It generates a datapoint (the label essentially) for every input data info x.
What does the italics N mean?
Example: N(0, sig^2)
It means ‘normal distribution with mean 0 and variance sig^2’.
For a Gaussian variable, the most likely point corresponds to…
the mean
Give the name (left side) of the function for the joint density of t over all datapoints in a dataset.
p(t | x, w, sig^2)
p(t | x, w, sig^2) =
PRODUCT(n=1 to N) p(t.n | x.n, w, sig^2)
Give the formula for log L, the log likelihood:
-(N/2) * log(2pi) -
N * log(sig) -
(1/(2 * sig^2)) *
SUM(n=1 to N) of (t.n - w^T * n)^2
Give the Bernoulli distribution
P(X=x) = q^x * ((1-q)^(1-x))
IID (abbreviation)
independent and identically distributed
When is a matrix A negative definite?
If x^T * A * x < 0 for all vectors of real values x.
How do we actually show negative definiteness?
By solving the equation
- (1/sig^2) * z^T * X^T * Xz <0
for any vector z that z^T * X^T * Xz > 0
What is M with a line on it?
The expected value of the squared error between estimated parameter values and the true values.
Give the formula for M with the stripe on it.
M_ = B^2 + V
with B= bias and V = variance
A function of a random variable is…
itself a random variable.
How can you check if two random variables x and y are independent?
Check if p(x, y ) = p(x)*pp(y)
How do you check conditional independence?
p(x,y|z) = p(x|z) * p(y|z)
If you know the probability of the outcomes of a random variable X, how do you calculate the expected value E(X)?
Multiply each value of X by its probability and add all these products.
E(X) = SUM of all ( x * P(X=x))
What should you mention when multiplying probabilities of variables?
That the variables are independent.
How do you calculate the variance of a random variable?
Use the formula
var(X) = E(X^2) - E(X)^2.
How do you calculate E(X^2)?
The probabilities in E(X) stay the same, but you now need to use the square of x instead of x to calculate the sum of probabilities.
What can you say about the hypothesis/ the weight vector in a model with data in matrix X with N rows and 1 column?
The matrix usually contains feature values in the columns, and an additional column per row with a 1. Thus there are no features in this matrix X.
The weight vector contains a value for every column, now only 1s so w.T * x.n = w.0 * 1 = w.0, it’s just a number that is multiplied with every xn and gives the same hypothesis h for every value xn.
What is the loss function of the least-squares regression problem?
1/N * SUM(n=1 to N) (t.n-w.T*x.n)^2)
The derivative of a sum of terms is equal to…
the sum of the derivatives of those terms.
The derivative of (ax+b)^q = …
(q(ax+b)^q-1) * a
What would the log of the likelihood be:
PRODUCT(n=1 to N) of r^x.n * ((1-r)^(1-x.n))
SUM(n=1 to N) of x.nlog(r) + (1-x.n)log(1-r)
When you are asked to compute the maximum likelihood estimate of a Bernoulli parameter, what do you do?
Take the derivative of the log likelihood, so a log L/a r. If there is a sum in the log L, it stays in the derivative for now. All a*log(b) become a/b in the derivative.
Equate it to zero.
Multiplying a negative definite matrix by a negative constant gives…
a positive definite matrix.