Linear Regression Flashcards

1
Q

What is gradient descent?

A

an algorithm that tweaks parameters iteratively (individually) in order to minimize a cost function

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is batch gradient descent?

A

instead of computing gradients individually, it computes them all in one go by using the whole training set at each iteration

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

what is the downsides to batch gradient descent?

A

because it uses the whole set at each step, it is very slow on large sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

what does a higher learning rate mean with gradient descent?

A

fails to find a good solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

what does a lower learning rate mean with gradient descent?

A

takes longer to compute

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how does gradient descent perform with features with different scales?

A

it takes longer to reach the minimum, making the algorithm slower

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how does gradient descent perform with features with same scales?

A

it goes directly to the minimum without jumping around, making the algorithm faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what algorithm is better to use with Linear Regression out of -> Gradient Descent or Normal Equation when you have a larger dataset? Why?

A

Gradient Descent because it is faster

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

out of gradient descent and normal equation, which is faster, why?

A

Gradient descent, because it handles instances one at a time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

what is a cost function?

A

?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what is stochastic gradient descent?

A

as opposed to batch-gd which uses the whole set at each step, sgd picks a random instance and handles them one at a time, making it much faster and better for bigger sets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

which gd algorithm is better for large sets?

A

sgd because it handles instances one at a time, instead of using the whole set at each step like bgd

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

what happens when you reduce the sgd’s learning rate slowly?

A

jumps around for ages

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what happens when you reduce the sgd’s learning rate quickly?

A

get stuck in local minimum or frozen

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

what is mini-batch gradient descent?

A

computes gradient on small random set (both sgd and bgd)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

what is polynomial regression?

A

when the data is too complex for a straight line, you can use powers on every feature and add them as new features, then use linear regression

17
Q

what is good about polynomial regression?

A

it finds relationships between features, and can be used when a straight line wont fit the data

18
Q

what are some regularized linear regression models?

A

ridge, lasso, elastic net

19
Q

what is ridge regression?

A

a regularized version of linear regression that forces the learning algorithm to git the data, but also keep the model weights as small as possible

20
Q

what happens if the ridges a=0?

A

it is linear regression

21
Q

what happens if the ridges a=large?

A

all weights end up very close to zero, resulting in a flat line

22
Q

what is an important step before using ridge regression?

23
Q

what should you do before ridge regression?

24
Q

what is lasso regression?

A

a regularized version of linear regression that eliminates weights of least important features and automatically performs feature selection and outputs a sparse model

25
Q

what is good about lasso?

A

it automatically performs feature selection

26
Q

what is elastic net?

A

it is in between ridge and lasso, a mix of both

27
Q

what happens if elastic nets r = 0?

A

it is ridge

28
Q

what happens if elastic nets r =1?

A

it is lasso

29
Q

when should you use ridge?

A

as a default for linear regression

30
Q

when should you use lasso or elastic net?

A

when only few features are useful because they both try reduce useless features

31
Q

when should you choose elastic net over lasso?

A

when the number of features is larger than the number of instances, because lasso may behave erratically OR when several features are correlated

32
Q

why would you want to use ridge regression instead of linear regression?

A

a model with some regularization performs better than a one without, so you should use ridge as a default over linear regression

33
Q

why would you want to use lasso instead of ridge regression?

A

lasso leads to a sparse model, which is a way to perform feature selection automatically, which is good for you if you suspect that only a few features actually matter. When you are not sure, you should use ridge regression.

34
Q

why would you want to use elastic net instead of lasso?

A

elastic net is preferred over lasso, however lasso has a hyperparameter that you can use if you want to use it without the erratic behaviour. set it close to 1.

35
Q

when would you choose two logistic regression classifiers over one softmax regression classifier? and vice versa

A

two when they are not exclusive,

softmax when they are.

36
Q

what suffers from features having different scales?

A

GD algorithms

37
Q

Can GD get stuck in a local minimum when training a logistic regression model?

A

no because the cost function is convex

38
Q

what does it mean when there is a large gap the training error and the validation error?

A

if the validation error is much higher than the training error, it is because your model is over fitting

39
Q

what does it mean when your validation error is much higher than your training error?

A

the model is overfitting