Linear Regression Flashcards
What is gradient descent?
an algorithm that tweaks parameters iteratively (individually) in order to minimize a cost function
What is batch gradient descent?
instead of computing gradients individually, it computes them all in one go by using the whole training set at each iteration
what is the downsides to batch gradient descent?
because it uses the whole set at each step, it is very slow on large sets
what does a higher learning rate mean with gradient descent?
fails to find a good solution
what does a lower learning rate mean with gradient descent?
takes longer to compute
how does gradient descent perform with features with different scales?
it takes longer to reach the minimum, making the algorithm slower
how does gradient descent perform with features with same scales?
it goes directly to the minimum without jumping around, making the algorithm faster
what algorithm is better to use with Linear Regression out of -> Gradient Descent or Normal Equation when you have a larger dataset? Why?
Gradient Descent because it is faster
out of gradient descent and normal equation, which is faster, why?
Gradient descent, because it handles instances one at a time.
what is a cost function?
?
what is stochastic gradient descent?
as opposed to batch-gd which uses the whole set at each step, sgd picks a random instance and handles them one at a time, making it much faster and better for bigger sets
which gd algorithm is better for large sets?
sgd because it handles instances one at a time, instead of using the whole set at each step like bgd
what happens when you reduce the sgd’s learning rate slowly?
jumps around for ages
what happens when you reduce the sgd’s learning rate quickly?
get stuck in local minimum or frozen
what is mini-batch gradient descent?
computes gradient on small random set (both sgd and bgd)
what is polynomial regression?
when the data is too complex for a straight line, you can use powers on every feature and add them as new features, then use linear regression
what is good about polynomial regression?
it finds relationships between features, and can be used when a straight line wont fit the data
what are some regularized linear regression models?
ridge, lasso, elastic net
what is ridge regression?
a regularized version of linear regression that forces the learning algorithm to git the data, but also keep the model weights as small as possible
what happens if the ridges a=0?
it is linear regression
what happens if the ridges a=large?
all weights end up very close to zero, resulting in a flat line
what is an important step before using ridge regression?
scale!
what should you do before ridge regression?
scale!
what is lasso regression?
a regularized version of linear regression that eliminates weights of least important features and automatically performs feature selection and outputs a sparse model
what is good about lasso?
it automatically performs feature selection
what is elastic net?
it is in between ridge and lasso, a mix of both
what happens if elastic nets r = 0?
it is ridge
what happens if elastic nets r =1?
it is lasso
when should you use ridge?
as a default for linear regression
when should you use lasso or elastic net?
when only few features are useful because they both try reduce useless features
when should you choose elastic net over lasso?
when the number of features is larger than the number of instances, because lasso may behave erratically OR when several features are correlated
why would you want to use ridge regression instead of linear regression?
a model with some regularization performs better than a one without, so you should use ridge as a default over linear regression
why would you want to use lasso instead of ridge regression?
lasso leads to a sparse model, which is a way to perform feature selection automatically, which is good for you if you suspect that only a few features actually matter. When you are not sure, you should use ridge regression.
why would you want to use elastic net instead of lasso?
elastic net is preferred over lasso, however lasso has a hyperparameter that you can use if you want to use it without the erratic behaviour. set it close to 1.
when would you choose two logistic regression classifiers over one softmax regression classifier? and vice versa
two when they are not exclusive,
softmax when they are.
what suffers from features having different scales?
GD algorithms
Can GD get stuck in a local minimum when training a logistic regression model?
no because the cost function is convex
what does it mean when there is a large gap the training error and the validation error?
if the validation error is much higher than the training error, it is because your model is over fitting
what does it mean when your validation error is much higher than your training error?
the model is overfitting