Linear Regression Flashcards by Cherese Eriepa

What is gradient descent?

an algorithm that tweaks parameters iteratively (individually) in order to minimize a cost function

How well did you know this?

Not at all

Perfectly

What is batch gradient descent?

instead of computing gradients individually, it computes them all in one go by using the whole training set at each iteration

How well did you know this?

Not at all

Perfectly

what is the downsides to batch gradient descent?

because it uses the whole set at each step, it is very slow on large sets

How well did you know this?

Not at all

Perfectly

what does a higher learning rate mean with gradient descent?

fails to find a good solution

How well did you know this?

Not at all

Perfectly

what does a lower learning rate mean with gradient descent?

takes longer to compute

How well did you know this?

Not at all

Perfectly

how does gradient descent perform with features with different scales?

it takes longer to reach the minimum, making the algorithm slower

How well did you know this?

Not at all

Perfectly

how does gradient descent perform with features with same scales?

it goes directly to the minimum without jumping around, making the algorithm faster

How well did you know this?

Not at all

Perfectly

what algorithm is better to use with Linear Regression out of -> Gradient Descent or Normal Equation when you have a larger dataset? Why?

Gradient Descent because it is faster

How well did you know this?

Not at all

Perfectly

out of gradient descent and normal equation, which is faster, why?

Gradient descent, because it handles instances one at a time.

How well did you know this?

Not at all

Perfectly

what is a cost function?

How well did you know this?

Not at all

Perfectly

what is stochastic gradient descent?

as opposed to batch-gd which uses the whole set at each step, sgd picks a random instance and handles them one at a time, making it much faster and better for bigger sets

How well did you know this?

Not at all

Perfectly

which gd algorithm is better for large sets?

sgd because it handles instances one at a time, instead of using the whole set at each step like bgd

How well did you know this?

Not at all

Perfectly

what happens when you reduce the sgd’s learning rate slowly?

jumps around for ages

How well did you know this?

Not at all

Perfectly

what happens when you reduce the sgd’s learning rate quickly?

get stuck in local minimum or frozen

How well did you know this?

Not at all

Perfectly

what is mini-batch gradient descent?

computes gradient on small random set (both sgd and bgd)

How well did you know this?

Not at all

Perfectly

what is polynomial regression?

Study These Flashcards

when the data is too complex for a straight line, you can use powers on every feature and add them as new features, then use linear regression

what is good about polynomial regression?

Study These Flashcards

it finds relationships between features, and can be used when a straight line wont fit the data

what are some regularized linear regression models?

Study These Flashcards

ridge, lasso, elastic net

what is ridge regression?

Study These Flashcards

a regularized version of linear regression that forces the learning algorithm to git the data, but also keep the model weights as small as possible

what happens if the ridges a=0?

Study These Flashcards

it is linear regression

what happens if the ridges a=large?

Study These Flashcards

all weights end up very close to zero, resulting in a flat line

what is an important step before using ridge regression?

Study These Flashcards

scale!

what should you do before ridge regression?

Study These Flashcards

scale!

what is lasso regression?

Study These Flashcards

a regularized version of linear regression that eliminates weights of least important features and automatically performs feature selection and outputs a sparse model

what is good about lasso?

it automatically performs feature selection

what is elastic net?

it is in between ridge and lasso, a mix of both

what happens if elastic nets r = 0?

it is ridge

what happens if elastic nets r =1?

it is lasso

when should you use ridge?

as a default for linear regression

when should you use lasso or elastic net?

when only few features are useful because they both try reduce useless features

when should you choose elastic net over lasso?

when the number of features is larger than the number of instances, because lasso may behave erratically OR when several features are correlated

why would you want to use ridge regression instead of linear regression?

a model with some regularization performs better than a one without, so you should use ridge as a default over linear regression

why would you want to use lasso instead of ridge regression?

lasso leads to a sparse model, which is a way to perform feature selection automatically, which is good for you if you suspect that only a few features actually matter. When you are not sure, you should use ridge regression.

why would you want to use elastic net instead of lasso?

elastic net is preferred over lasso, however lasso has a hyperparameter that you can use if you want to use it without the erratic behaviour. set it close to 1.

when would you choose two logistic regression classifiers over one softmax regression classifier? and vice versa

two when they are not exclusive, | softmax when they are.

what suffers from features having different scales?

GD algorithms

Can GD get stuck in a local minimum when training a logistic regression model?

no because the cost function is convex

what does it mean when there is a large gap the training error and the validation error?

if the validation error is much higher than the training error, it is because your model is over fitting

what does it mean when your validation error is much higher than your training error?

the model is overfitting

Linear Regression Flashcards

(39 cards)