Week 6 Flashcards
Overfitting
When fitting the observed facts, the data seen so far, well, does no longer indicate a small out-of-sample error.
Deterministic noise
The part of the target function that is outside of the best approximation to the target function.
Stochastic noise
Random noise that cannot be modeled.
State two differences between deterministic and stochastic noise
1) If we generate the same data again, the deterministic noise would be the same but the stochastic noise would be different.
2) Different models capture different parts of the target function -> deterministic noise depends on the learning model you use.
The variance of the stochastic noise is captured by the variable…
sigma_squared
What is the cause of overfitting?
Noise
Name two cures for overfitting:
1) Regularization
2) Validation
Regularization
Attempts to minimize Eout by working through the equation
Eout(h) = Ein(h) + overfit penalty
Validation
Estimates the out-of-sample error directly
validation set
A subset from the data that is not used in training.
When is a set no longer a test set?
When it affects the learning process in any way.
How is the validation set created?
The data set D is divided in a training set of size (N-K) and a validation set of size K. A final hypothesis is learned by the algorithm using the training set. Then the validation error is calculated with the validation set.
What is the rule of thumb for determining K in validation?
K = N/5
Use 80% for training and 20% for validation.
Cross validation estimate
The average value of the error made by gn on its validation set.
Wat denoteert H.theta?
De polynomen van graad d (~erboven)
Wat is theta(x)? (z)
de x-vector (bijv (1,x).T)p met daarbij nog x^2 … x^d(~erboven)
Wanneer zijn we aan het overfitten?
Als het algorithme probeert te leren van de ruis in plaats van van het patroon.
Wat is de oorzaak van overfitting in een voorbeeld zonder ruis?
Deterministische ruis: we kunnen met de beste hypothesefunctie niet perfect de target functie benaderen
Hoe kun je E.out(g.D) uitdrukken in 3 dingen?
var+bias+sigma_squared
Wat is de var bij het berekenen van E.out(g.D) in 3 delen?
E.D,x [ ( g.D(x) - gemiddelde g(x) ) ^2 ]
Wat is de bias bij het berekenen van E.out(g.D) in 3 delen?
bias = E.x [ (gemiddelde g(x) - f(x) ) ^2 ]
Wat is de sigma_squared bij het berekenen van E.out(g.D) in 3 delen?
sigma_squared = E.epsilon,x [ (epsilon(x))^2]
Hoe bereken je in lineare regressie met ruis de verwachte in-sample error?
sigma_squared * (1 - (d+1)/N )
Hoe bereken je de verwachte out-of-sample error in lineare regressi met ruis?
sigma_squared * (1 + (d+1)/N)
Hoe bereken je de verwachte generalization error in lineare regressie met ruis?
2*sigma_squared * ((d+1)/N)
Wat is H0 in principe?
De verzameling van alle hypotheses van de vorm f(x) = b
Wat is H1 in principe?
De verzameling van alle hypotheses van de vorm f(x) = ax + b
Diff. linear classification and linear regression:
Classification = binary (or trinary…)
Regression = real numbers
Diff. Logistic regression and linear regression:
Logistic regression = real between 0 and 1
Linear regression = just real.
What does linear regression use to measure the distance between h(x) and f(x)
Mean Square Error (MSE)
What is the formula for the mean squared error?
E.in = 1/N * (h(x.i) - y.i) ^2
for all i in N.
What is h(x) in 1) linear regression, 2) perceptron and 3)logistic regression?
1) h(x) = s
2) h(x) = sign(s)
3) h(x) = theta(s)
for s = w.T * x
Give the logistic regression algorithm (2 steps):
For every time step, do
1) compute the gradient
2) Update the weights with fixed learning rate eta:
w(t+1) = w(t) - eta * E.in gradient
Stochastic gradient descent
Does not use all examples for E.in, it uses a single data sample error or several.
How do represent the XOR function in Ands and Ors?
f(X) = (not h1 AND h2) OR (h1 AND not h2)
+1 if exactly one of h1, h2 equals +1
What does more nodes per hidden layer do with the approximation and generalization of the MLP?
approximation goes up,
generalization goes down.