Linear Regression Flashcards
Linear regression
models the relationship between an independent (explanatory) variable $X$ and a (real-valued) dependent value $Y$.
Intercept of the line
this the extra independent variable of the linear equation
L = B_0 + B_1*x
B_0 is the intercept. Means without x then still would have B_0.
B_1 is the slope of the line, increase per unit added.
R^2
measure how good linear regression is from 0 to 1. takes variance and error into account.
Says how much the model explains the variance
correlation squared
hypothesis testing
use standard error
Evaluate likelihood of obtaining as extreme of a model as computed
null hypothesis there is a linear relation
add a normally distributed variable to model
Look at the p values for the intercept and slope if < 0.05 (give proba of getting these values). If <0.05 proba 95% that value is contained in interval of it.
standard error
the standard deviation of its sampling distribution or an estimate of that standard deviation.
the standard error of the mean is a measure of the dispersion of sample means around the population mean.
Residual
A residual is the vertical distance between a data point and the regression line. Each data point has one residual. They are positive if they are above the regression line and negative if they are below the regression line. In other words, the residual is the error that isn’t explained by the regression line.
Confounder
Can be significant simple linear regression (p<0.05) but not significant in multilinear (p>0.05)
In statistics, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Independent variables are the explonatory variables X and the dependent is the predictor Y.
If 2 variables are correlated and used to predict Y then the accuracy might not be impacted but the coefficients associated with each variable might not be meaningful anymore.
Categorical variables
do dummy variables 0 and 1
Don’t do more as otherwise, it indicates an ordinal value which isn’t the case !
A categorical value can be added to the multilinear model if it is one: then it also changes the intercept !
non linear
can add a degree to a variable to introduce nonlinearity
Adjusted R^2
R^2 reduces if you add more and more predictors and there is not much gain
Razor method
should choose simplest method if not much gain in higher d or more variables.
bias-variance tradeoff
The bias of the method is the error caused by the simplifying assumptions built into the method.
The variance of the method is how much the model will change based on the sampled data.
The irreducible error is error in the data itself, so no model can capture this error.
There is a tradeoff between the bias and variance of a model. High-variance methods (e.g., the blue method) are accurate on the training set, but overfit noise in the data, so don’t generalized well to new data. High-bias models (e.g., the black method) are too simple to fit the data, but are better at generalizing to new test data.
Generalize a model
cross-validation (k fold, split K times, train on K-1 and test on remainder, test different parameters and take model with the best average performance)
step-wise selection (forward selection: we start with one predictor, find the best model with only one predictor (based on a performance metric), move to models with two predictors (by keeping the one predictor fixed) etc.
backward selection: opposite as above, we start with a model with all predictors and reduce them one by one.
Not guaranteed to find the best selection)
regularization (Lasso and Ridge, Lasso drive some coef to 0 and thus do selection at the same time, Ridge penalize too much variables)