Linear Regression Flashcards
Regression
- what is it used for?
a mathematical tool to:
1. analyse direction and strength of the relationship between the variable of interest (target) and other known variables
- predict the value of the unknown variable (target) based on its past and other variables
Linear Regression
linear approach for modelling the relationship between a scalar dependent variable y and on or more explanatory/independent variables X
- simple: 1 explanatory value
- multiple: >1 explanatory variable
Supervised Learning
training set contains actual outcome (target)
- analyses training data and produced an inferred function, used to map
Variable type
independent
- explanatory
- control
- input
- predictor
dependent
- response
- outcome
- target
Pearson Correlation
quantifies dependence between 2 variables
- r (coefficient of correlation) : extent of interdependence
- value indicates strength
- sign indicates direction
- between -1 (negative total correlation) to +1 (positive total correlation); 0 = no linear correlation
Types of relationships between 2 variables
- strong positive linear relationship
- positive linear relationship
- perfect negative linear relationship
- perfect parabolic relationship
- negative curvilinear relationship
- no relationship
Regression vs correlation
regression allows use of MULTIPLE independent variables while correlation is only between 2 variables
Statistical significance
- regression: evidence based analysis
- provides a mathematical way to find out the simultaneous impact of multiple independent variables on the dependent variable
Application of Regression
Analytical (explanatory models)
- analyse if increase in police force has impact on crime rae
Predictive
- given persons lifestyle, predict % body fat
Types of data used in regression
Cross sectional data : collected at one point in time
Time series: collected over a period of time
Pooled: combination of both cross-sectional and time series data ( eg population survey over 5 years)
Simple linear regression analysis model
𝑦 = 𝛽0 +𝛽1𝑥
y = predicted/dependent variable
x = predictor/independent variable
𝛽0 = y-intercept
𝛽1 = slope / beta coefficient of x
For predictions:
𝑦_𝑖=𝑦̂_𝑖+𝜀_𝑖
=𝛽_0+𝛽_1 𝑥_𝑖+𝜀_𝑖
Error term 𝜀𝑖= is the difference between actual value 𝑦_𝑖 and the predicted value 𝑦̂𝑖
Determine best fit line
sum of squared distances
- penalises or magnifies large errors
- cancels effect of sign
minimise sum of squared errors
- method of least squares : minimises the sum of squares of the vertical distance between data and fitted line
Assessing goodness of fitted line
coefficient of determination (R square)
- amount of variation in Y that is explained by the regression line
- higher value, better regression line
= SSR/SST
- takes values [0,1]
what is multiple linear regression
- increase accuracy: use more than 1 independent variable to estimate the dependent variable
- allows us to use more of the information available to estimate the dependent variable
- the best MLR accounts for the largest proportion of the variation in the dependent variable with the fewest number of independent variables
parsimonious model
accomplishes a desired level of explanation or prediction with as few predictor variables as possible
multiple regression equation
𝒀=𝛽_0+𝛽_1 𝑥_1+𝛽_2 𝑥_2+…+𝛽_𝑘 𝑥_𝑘
model significance using F-test and coefficient (variable) significance using T-test
adjusted R squared
considers the number of variables being included in the model and penalises for overfitting
- decrease: added variables that are not significant or have multicollinearity with outher predictors
𝑅̅^2 = 1 − (1 −𝑅^2) (𝑛−1)/(𝑛−𝑝−1)
MLR model significance
f-test: statistical tests for checking model significance in regression
- H0 : model with no independent variables (intercept-only model) fits the data as well as your model
- H1: model fits data better than the intercept model
Prob(F-statistic) = p-value
- if < significant level: reject null hypothesis
MLR coefficient significance
t-Test: statistical tests for checking coefficient (variable) significance
- H0: coefficient 0, no r/s
- H1: coefficient not 0, variable can explain the model, regression model fits better
P>|t| = p-value
- if <significant level: reject null hypothesis
estimate accuracy of prediction model
Mean Squared Error (MSE) (Variance of Error) = 1/𝑛 ∑( y_i − 𝑦_𝑖̂)^2
Root Mean Squared Error (RMSE) (Std Dev. Of Error) = √𝑀𝑆𝐸 = √1/𝑛 ∑( y_i − 𝑦_𝑖 ̂ )^2
Assumptions for Linear Regression
LINE
Linearity relationship
Independence of errors
Normality of error distribution
Equal Variance of errors
Residual Plots
𝑦=𝛽_0+𝛽_1 𝑥1 + … + 𝛽_𝑗 𝑥𝑗+ 𝜀_𝑖
Residual (a.k.a. errors 𝜀_𝑖)
= Observed Value of Y – Predicted Value of Y
= 𝑦_𝑖−(𝑦_𝑖 )̂
Residual plot: predicted values (𝑦_𝑖 )̂ (x-axis) vs residual values 𝜀_𝑖 (y-axis)
if points are along the prediction line, how do residual plots look like?
- symmetrically distributed, tending to cluster towards the middle of the residual plot
- no clear patterns
Multi-collinearity
- high level of correlation between independent variables.
- regression coefficients often become less reliable (or unstable) as the degree of correlation between the independent variable increases
Multicollinearity problems
- hide significant variables
- causes parameter estimates to become unstable by increasing their variance
Checking for multicollinearity
- pairwise correlation among predictors
- compute Variance Inflation Factor for each variable used in the regression model (remove predictors with VIF values> 5 or 10; OR use domain knowledge to decide which predictors have a dependence on another)