Week 1 Flashcards by Jedidja Marsman

Linear modelling

Learning a linear relationship between attributes and responses.

How well did you know this?

Not at all

Perfectly

What does this mean?
t = f(x;a)

A function f() that acts on x and has a parameter a.

How well did you know this?

Not at all

Perfectly

What parameter is known as the intercept?

w0 in
f(x) = w0 + w1*x

How well did you know this?

Not at all

Perfectly

What does the squared loss function describe?

How much accuracy we are losing through the use of a certain function to model a phenomenon.

How well did you know this?

Not at all

Perfectly

What is this?
Ln(y, f(x;w0,w1))

The squared loss function, telling us how much accuracy we lose through the use of f(x;w0,w1) to model y.

How well did you know this?

Not at all

Perfectly

How do you calculate the average loss across a whole dataset?

L =
1/N *
SUM(n=1 to N) of the squared loss function with xn.

How well did you know this?

Not at all

Perfectly

argmin means…

Find the argument that minimises.

How well did you know this?

Not at all

Perfectly

Bias-variance tradeoff

The tradeoff between a model’s ability to generalise and the risk of overfitting.

How well did you know this?

Not at all

Perfectly

Validation set

A second dataset that is used to validate the predictive performance of the model.

How well did you know this?

Not at all

Perfectly

K-fold cross-validation

Splits the data into K equally sized blocks. Each block is a validation set with the other K-1 blocks as training set.

How well did you know this?

Not at all

Perfectly

LOOCV (abbreviation)

Leave-One-Out Cross-Validation

How well did you know this?

Not at all

Perfectly

What is LOOVCV?

A type of K-fold cross-validation where K=N.

How well did you know this?

Not at all

Perfectly

0! ==

How well did you know this?

Not at all

Perfectly

What is a prerequisite for multiplying an n x m matrix A and a q x r matrix B?

A*B is possible if…

m == q
So the number of columns of the first matrix needs to be equal to the number of rows in the second matrix.

How well did you know this?

Not at all

Perfectly

(X*w)^T can be simplified to…

(w^T) * (X^T)

How well did you know this?

Not at all

Perfectly

(ABCD)^T can be simplified to…

( (AB) (CD) )^T

How well did you know this?

Not at all

Perfectly

( (AB) (CD) )^T can be simplified to…

(CD)^T * (AB)^T

How well did you know this?

Not at all

Perfectly

(CD)^T * (AB)^T can be simplified to…

D^T * C^T * B^T * A^T

How well did you know this?

Not at all

Perfectly

What is the partial derivative with respect to w of:
w^T * x

How well did you know this?

Not at all

Perfectly

What is the partial derivative with respect to w of:
x^T * w

How well did you know this?

Not at all

Perfectly

What is the partial derivative with respect to w of:
w^T * w

How well did you know this?

Not at all

Perfectly

What is the partial derivative with respect to w of:
w^T * c*w

2cw

How well did you know this?

Not at all

Perfectly

Multiplying a scalar by an identity matrix results in..

a matrix with the scalar value on each diagonal element.

How well did you know this?

Not at all

Perfectly

The inverse of a matrix that only has values on the diagonal, is

Study These Flashcards

Another diagonal matrix where each diagonal element is the inverse of the corresponding element in the original.

How do we write the optimum value of w?

^w w met een dakje erop

^w == (The formula)

(X^T * X) ^-1 * X^T * t

Linear regression

Supervised learning where t is a real number.

Linear classification

Supervised learning where t is an element from a finite set.

In supervised learning, each dataset is of the form x,t (with t in R in regression). What is the goal?

We look for a hypothesis such that t = f(x). We want the computer to predict the data.

Overfitting

When the model comes up with a hypothesis that is too complex, so it fits the existing data very well but has terrible prediction skills.

What does regularisation add to supervised learning?

It doesn't only try to find minimized loss, but also to find minimal weights. So a penalty is added for a higher weight.

What is the central question in the generative modelling problem?

Can we build a model that could generate a dataset like ours?

What does the equation f(x;w) = w^T * x do?

It generates a datapoint (the label essentially) for every input data info x.

What does the italics N mean? Example: N(0, sig^2)

It means 'normal distribution with mean 0 and variance sig^2'.

For a Gaussian variable, the most likely point corresponds to...

the mean

Give the name (left side) of the function for the joint density of t over all datapoints in a dataset.

p(t | x, w, sig^2)

p(t | x, w, sig^2) =

PRODUCT(n=1 to N) p(t.n | x.n, w, sig^2)

Give the formula for log L, the log likelihood:

-(N/2) * log(2pi) - N * log(sig) - (1/(2 * sig^2)) * SUM(n=1 to N) of (t.n - w^T * n)^2

Give the Bernoulli distribution

P(X=x) = q^x * ((1-q)^(1-x))

IID (abbreviation)

independent and identically distributed

When is a matrix A negative definite?

If x^T * A * x < 0 for all vectors of real values x.

How do we actually show negative definiteness?

By solving the equation - (1/sig^2) * z^T * X^T * X*z <0 for any vector z that z^T * X^T * X*z > 0

What is M with a line on it?

The expected value of the squared error between estimated parameter values and the true values.

Give the formula for M with the stripe on it.

M_ = B^2 + V with B= bias and V = variance

A function of a random variable is...

itself a random variable.

How can you check if two random variables x and y are independent?

Check if p(x, y ) = p(x)*pp(y)

How do you check conditional independence?

p(x,y|z) = p(x|z) * p(y|z)

If you know the probability of the outcomes of a random variable X, how do you calculate the expected value E(X)?

Multiply each value of X by its probability and add all these products. E(X) = SUM of all ( x * P(X=x))

What should you mention when multiplying probabilities of variables?

That the variables are independent.

How do you calculate the variance of a random variable?

Use the formula var(X) = E(X^2) - E(X)^2.

How do you calculate E(X^2)?

The probabilities in E(X) stay the same, but you now need to use the square of x instead of x to calculate the sum of probabilities.

What can you say about the hypothesis/ the weight vector in a model with data in matrix X with N rows and 1 column?

The matrix usually contains feature values in the columns, and an additional column per row with a 1. Thus there are no features in this matrix X. The weight vector contains a value for every column, now only 1s so w.T * x.n = w.0 * 1 = w.0, it's just a number that is multiplied with every xn and gives the same hypothesis h for every value xn.

What is the loss function of the least-squares regression problem?

1/N * SUM(n=1 to N) (t.n-w.T*x.n)^2)

The derivative of a sum of terms is equal to...

the sum of the derivatives of those terms.

The derivative of (ax+b)^q = ...

(q(ax+b)^q-1) * a

What would the log of the likelihood be: PRODUCT(n=1 to N) of r^x.n * ((1-r)^(1-x.n))

SUM(n=1 to N) of x.n*log(r) + (1-x.n)*log(1-r)

When you are asked to compute the maximum likelihood estimate of a Bernoulli parameter, what do you do?

Take the derivative of the log likelihood, so a log L/a r. If there is a sum in the log L, it stays in the derivative for now. All a*log(b) become a/b in the derivative. Equate it to zero.

Multiplying a negative definite matrix by a negative constant gives...

a positive definite matrix.

Week 1 Flashcards

(59 cards)