Ch4 training models Flashcards

1
Q

What is the most common performance measure for a regression model and what is it formula?

A

The most common performance measure of a regression model is the mean squared error (MSE). Note the MSE is a convex function, which mean that if you pick any two points on the curve, the line segment joining them never crosses the curve. its formula is:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the closed-form solution (In other words, a mathematical equation that gives the result directly) of the linear regression model? This equation is also called the normal equation.

A

In this equation, the theta hat represent the value of theta (a column vector) that minimizes the cost function (most often the MSE) and the Y represent the column vector of target values containing y(1) to y(m)

Note: The normal equation may not work if the matrix XTX is not invertible, such as if m<n></n>

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the computational complexity of finding the parameters, using the normal equation, for a linear regression model?

A

The SKlearn LinearRegression class is about O(n2). If you double the number of features, you multiply the computation time by roughly 4. Note however, that when using the model it become a O(n) situation. Training the model is the computationaly hard part.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the general idea of Gradient Descent?

A

Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. Its tweak the parameters iteratively in order to minimize a cost function.

The algorithm measures the local gradient of the error function with regard to the parameter vector theta, and it goes in the direction of descending gradient (maximum downward slope). Once the gradient is zero, you have reached a minimum!

An importat parameter of gradient descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is to slow it will take a lot o time to converge to a solution. On the other hand if it too high you may never find a optimal solution and make the algorithm diverge.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the three main variant of gradient descent?

A

1) Batch gradient Descent
2) Stochastic Gradient Descent
3) Mini-Batch gradient Descent

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the main idea of Batch gradient Descent?

A

Batch gradient Descent: The batch gradient compute the gradient (slope or partial derivative) with regard to each model paramter theta. If MSE is our cost function and the learning rate is n, the formula is:

θnext=θ-n∇θ(MSE(θ))

Note at every step it calculate ∇θ(MSE(θ)) which make it very slow. The learning rate n is also really important. If it to high the solution may bounce up and down without ever reaching a solution and if it too low it will take more time to find the solution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the main idea of Stochastic Gradient Descent?

A

The formula is the same as batch gradient descend. For example if using the MSE in a linear regression problem the formula would be:

θnext=θ-n∇θ(MSE(θ))

But the stochastic gradient Descent (SGD) only use a subset of the data to compute the gradient. After 1 iteration it discard the data and select another batch (randomly or pre-set) and re compute the data.

It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the main idea of mini-batch Gradient Descent?

A

The mini-batch gradient descend is a combination of Batch gradient and Stochastic gradient descent. The formula is the same as batch gradient descend. For example if using the MSE in a linear regression problem the formula would be:

θnext=θ-n∇θ(MSE(θ))

The mini-batch gradient Descent (SGD) only use a subset of the data to compute the gradient. After 1 iteration it discard the data and select another batch (randomly or pre-set) and re compute the data.

It can be regarded as a stochastic approximation of gradient descent optimization, since it replaces the actual gradient (calculated from the entire data set) by an estimate thereof (calculated from a randomly selected subset of the data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Why is a polynomial regression model capable of finding relationships between features?

A

PolynomialFeatures adds all combinations of features up to the given degree. For exemple, if there were two features a and b, PolynomialFeatures with degree = 3 would not only add the features a2,a3,b2 and b3, but also the combinations ab, a2b and ab2.

Note that since the relationships is of degree 2 and above, a linear regression cannot find the relationship between features. Also beware of the combinatorial explosion of the number of features!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Imagine you have a function, y = 0.5x12 + x1 + 2 + Gaussian noise generating random data point. What would happen if you used a regression model of degree 1 ? of degree 100 ?

A

The high-degree polynomial regression model would sevelery overfit the training data. While the linear model is underfitting it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Imagine you have a function, y = 0.5x12 + x1 + 2 + Gaussian noise generating random data point. What would happen if you plotted the learning curve agaisnt the training set size for a regression model of degree 1 ? of degree 10 ? Note the learning curve here is how fast it is lowering the cost function the RMSE in this case.

A

For the underfitting model the error on the training data goes up until it reaches a plateau, at which point adding new instances to the trainin set does not make the average error much better or worse.

The overfitting model is similar to the previous ones, but there are two important difference:

1) the error on the training data is much lower.
2) there is a big gap between the curve which is a sign of overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the ridge regression?

A

Ridge regression is a regularized version of linear Regression. A regularization term is added to the cost function. This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible. The cost function is now:

Cost = MSE(θ)+ a(1/2)Σθi2

Note that the regularization term should only be added to the cost function during training. Once the model is trained, you want to use the unregularized performance measure to evaluate the model’s performance. The hyperparameter a controls how much you want to regularize the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What do we need to do before performing Ridge Regression?

A

It is important to scale the data before performing Ridge Regression, as it is sensitive to the scale of the input features. This is true of most regularized models.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What does regularization tend to do?

A

It tend to increase the bias but lower the variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the closed form solution of the Ridge regression?

A

ô = (XTX+aA)-1XTy

A is the (n+1)x(n+1) identidy matrix, except with a 0 in the top-left cell, corresponding to the bias term. Note ô is used instead of theta for the paramaters estimation. The hyperparameter a controls how much you want to regularize the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the Lasso regression and what its formula?

A

The Least absolute shrinkage and selection operator regression (usually simply called Lasso Regression) is a regularized version of Linear Regression. The cost function is given by:

Cost = MSE(θ) + aΣ|θi|

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the difference between the ridge regression and lasso regression?

A

They are both a regularized version of linear regression. The ridge use the L2 norm (square of the parameter) and the lasso use the L1 norm (absolute value of the parameter).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the most important characteristic of the Lasso Regression?

A

It tends to eliminate the weights of the least important features (i.e., set them to zero).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the difference between the L1 norm and L2 norm.

A

L2 norm:

sqrt(Σθi2)

L1 norm:

Σ|θi|

20
Q

How can we avoid Gradient Descent from bouncing around the optimum at the end when using Lasso?

A

You need to gradually reduce the learning rate during training (it will still bounce around the optimum but the steps will get smaller and smaller, so it will converge).

21
Q

What is Elastic net regression?

A

It a middle ground between ridge regression and lasso regression.

22
Q

When should you use plain Linear regression, ridge, lasso, or elastic net?

A

It is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain linear regression. Ridge is a good default, but if you suspect that only a few features are usefull, you should prefer lasso or elastic net because they tend to reduce the useless features.

23
Q

What do we mean by early stopping when regularizing model and why should we use it?

A

IT a different way to regularize iterative learning algorithms such as gradient descent is to stop training as soon as the validation error reaches a minimum.

To prevent the model from overfitting the training data. Geoffrey Hinton called early stopping a beautiful free lunch.

24
Q

What is the formula for the logistic regression ?

A

The beta represent the initial term and the alpha the different features.

25
Q

What is a logistic regression model (or logit model)?

A

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1 and the sum adding to one.

26
Q

Explain the cost function of logistic regression for a single training instance.

A

cost={-log(p^) if y = 1 or -log(1-p^) if y = 0}

This cost function make sense because -log(t) grows very large when t approaches 0. On the other hand -log(t) is close to - when t is close to 1.

27
Q

What is the logistic regression cost function (log loss)?

A

cost(θ) = -(1/m)Σ[y(i)log(p^(i)) + (1-y(i))log(1-p^(i))]

Note y(i) is the output value. It is either 0 or 1.

28
Q

What is the closed form solution for the logistic regresion cost function?

A

There is no known closed-form equation to compute the value of θ that minimizes the cost function.

The good news is that this cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum (if the learning rate is not too large and you wait long enough).

29
Q

What is the softmax regression, or multinomial logistic regression?

A

It the logistic regression but generalized to support multiple classes. This model support multiple classes directly without having to train and combine multiple binary classifiers.

The idea is simple, when given an instance x, the softmax regression model compute the probability for each class and base it prediction on the class with the highest probability.

30
Q

What is the softmax function (or multinomial logistic function)?

A

pk^ = σ(s(x))k = exp(sk(x)) / (Σi=1K exp(s<span>i</span>(x))) is the softmax function

sk(x) = xTθ(k) is the softmax score for class k

Where: K is the number of classes, s(x) is a vector containing the scores of each class for the instance x. And σ(s(x))k is the estimated probability that the instance x belongs to class k, given the scores of each class for that instance.

Note, once you have computed the score of every class for the instance x, you can estimate the probability that the instance belongs to class k by running the scores through the softmax function.

31
Q

What we need to keep in mind when using softmax regression?

A

The softmax regression classifier predicts only one class at a time (i.e., it is multiclass, not multioutput), so it should be used only with mutually exclusive classes, such as different types of plants. you cannot use it to recognize multiple people in one picture.

32
Q

The softmax regression uses the cross-entropy cost function. What its formula and explain its meaning?

A

Cost(θ) = -(1/m)Σi=1mΣk=1K yk(i)log(pk(i))

In this equation, yk(i) is the target probability that the ith instance belongs to class k. In general, it is either equal to 1 or 0, depending on whether the instance belongs to the class or not.

The objective is to have a model that estimates a high probability for the target class, and a low probability for the other class. The cross entropy cost function penalizes the model when it estimates a low probability for a target class.

33
Q

Which linear regression training algorithm can you use if you have a training set wiht millions of features?

A

You can use stochastic gradient descent or mini batch gradient descent, and perhaps batch gradient descent if the training set fits in memory. But you cannot use the normal equation or the SVD approach because the computational complexity grows quickly (more than quadratically) with the number of features.

34
Q

Suppose the features in your training set have very different scales. Which algorithms might suffer from this, and how? What can you do about it?

A

If the features in your training set have very different scales, the cost function will have the shape of an elongated bowl, so the gradient descent algo will take a long time to converge. To solve this you should scale the data before training the model.

35
Q

Can Gradient Descent get stuck in a local minimum when training a logistic regression model?

A

Gradient descent cannot get stuck in a local minimum when training a logistic regression model because the cost function is convex and has only 1 minimum.

36
Q

Do all gradient descent algorithms lead to the same model, provided you let them run long enough?

A

If the optimization problem is convex (such as linear regression or logistic regression), and assuming the learning rate is not too high, then all GD algorithm will approach the global optimum. However, unless you gradually reduce the learning rate Stochastic GD and mini-batch GD will never truly converge; instead they will keep jumping back and forth around the global optimum.

37
Q

Suppose you use batch gradien descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this?

A

One possibility is that the learning rate is to high and the algorithm is diverging. If the training error also goes up, then this is cleary the problem and you should reduce the learning rate. However, if the training error is not going up, then you model is overfitting and you should stop training.

38
Q

Is it a good idea to stop mini-batch gradient descent immediately when the validation error goes up?

A

Due to their random nature, neither stochastic GD nor mini-batch GD is guaranteed to make progress at every single training iteration. So if you immediately stop training stop much too early, before the optimum is reached.

39
Q

Which GD algorithm will reach the vicinity of the optimal solution the fastest? Which will actually converge? how can you make the others converge as well?

A

Stochastic gradient descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity of the global optimum. However, only batch Gd will actually converge, given enough training time. As mentioned, Stochastic GD or mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.

40
Q

Suppose you are using polynomial regression. You plot the learning curves and you notice that there is a large gap between the training error and the validation error. What is happening? What are three way to solve this?

A

You are likely overfitting the training set. The three solutions are:

1) Try a model with fewer degress of freedom
2) regularize the model using ridge (L2 penalty) or Lasso (L1 penalty).
3) you can try to increase the size of the training set.

41
Q

Suppose you are using Ridge regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter a or reduce it?

A

If both the training and validation error are almost equal and fairly high, the model is likely underfitting the training set, which means it has a high bias. You should try reducing the regularization hyperparameter.

Note the hyperparameter is used to reduce overfitting. This situation give us a hint that we are underfitting.

42
Q

Why would you want to use a ride regression instead of plain linear regression?

A

A model with some regularization typically performs better than a model without any regularization, so you should generally prefer ridge regression over plain linear regression.

43
Q

Why would you want to use Lasso instead of Ridge regression?

A

Lasso uses a L1 penalty, which tends to push the weights down to exactly zero. This leads to sparse models, where all weights are zero except for the most important weights. This is a way to perform feature selection automatically, which is good if you suspect that only a few features actually matter. When you are not sure, you should prefer Ridge regression.

44
Q

Why would you want to use Elastic net instead of Lasso

A

Elastic net is generally preferred over Lasso since Lasso may behave erratically in some cases (when several features are strongly correlated or when there are more features than raining instances).

45
Q

Suppose you want to classify pictures as outdoor/indoor and daytime/nightime. Shoud you implement two logistic regression classifiers or one softmax regression classifier?

A

Since these are not mutually exclusive classes you should train two logistic regression classifiers.