Linear Regression Flashcards by I S

Regression
- what is it used for?

a mathematical tool to:
1. analyse direction and strength of the relationship between the variable of interest (target) and other known variables

predict the value of the unknown variable (target) based on its past and other variables

How well did you know this?

Not at all

Perfectly

Linear Regression

linear approach for modelling the relationship between a scalar dependent variable y and on or more explanatory/independent variables X

simple: 1 explanatory value
multiple: >1 explanatory variable

How well did you know this?

Not at all

Perfectly

Supervised Learning

training set contains actual outcome (target)
- analyses training data and produced an inferred function, used to map

How well did you know this?

Not at all

Perfectly

Variable type

independent
- explanatory
- control
- input
- predictor

dependent
- response
- outcome
- target

How well did you know this?

Not at all

Perfectly

Pearson Correlation

quantifies dependence between 2 variables
- r (coefficient of correlation) : extent of interdependence
- value indicates strength
- sign indicates direction
- between -1 (negative total correlation) to +1 (positive total correlation); 0 = no linear correlation

How well did you know this?

Not at all

Perfectly

Types of relationships between 2 variables

strong positive linear relationship
positive linear relationship
perfect negative linear relationship
perfect parabolic relationship
negative curvilinear relationship
no relationship

How well did you know this?

Not at all

Perfectly

Regression vs correlation

regression allows use of MULTIPLE independent variables while correlation is only between 2 variables

How well did you know this?

Not at all

Perfectly

Statistical significance

regression: evidence based analysis
provides a mathematical way to find out the simultaneous impact of multiple independent variables on the dependent variable

How well did you know this?

Not at all

Perfectly

Application of Regression

Analytical (explanatory models)
- analyse if increase in police force has impact on crime rae
Predictive
- given persons lifestyle, predict % body fat

How well did you know this?

Not at all

Perfectly

Types of data used in regression

Cross sectional data : collected at one point in time
Time series: collected over a period of time
Pooled: combination of both cross-sectional and time series data ( eg population survey over 5 years)

How well did you know this?

Not at all

Perfectly

Simple linear regression analysis model

𝑦 = 𝛽0 +𝛽1𝑥

y = predicted/dependent variable
x = predictor/independent variable
𝛽0 = y-intercept
𝛽1 = slope / beta coefficient of x

For predictions:
𝑦_𝑖=𝑦̂_𝑖+𝜀_𝑖
=𝛽_0+𝛽_1 𝑥_𝑖+𝜀_𝑖
Error term 𝜀𝑖= is the difference between actual value 𝑦_𝑖 and the predicted value 𝑦̂𝑖

How well did you know this?

Not at all

Perfectly

Determine best fit line

sum of squared distances
- penalises or magnifies large errors
- cancels effect of sign

minimise sum of squared errors
- method of least squares : minimises the sum of squares of the vertical distance between data and fitted line

How well did you know this?

Not at all

Perfectly

Assessing goodness of fitted line

coefficient of determination (R square)
- amount of variation in Y that is explained by the regression line
- higher value, better regression line
= SSR/SST

takes values [0,1]

How well did you know this?

Not at all

Perfectly

what is multiple linear regression

increase accuracy: use more than 1 independent variable to estimate the dependent variable
allows us to use more of the information available to estimate the dependent variable
the best MLR accounts for the largest proportion of the variation in the dependent variable with the fewest number of independent variables

How well did you know this?

Not at all

Perfectly

parsimonious model

accomplishes a desired level of explanation or prediction with as few predictor variables as possible

How well did you know this?

Not at all

Perfectly

multiple regression equation

Study These Flashcards

𝒀=𝛽_0+𝛽_1 𝑥_1+𝛽_2 𝑥_2+…+𝛽_𝑘 𝑥_𝑘

model significance using F-test and coefficient (variable) significance using T-test

adjusted R squared

Study These Flashcards

considers the number of variables being included in the model and penalises for overfitting
- decrease: added variables that are not significant or have multicollinearity with outher predictors

𝑅̅^2 = 1 − (1 −𝑅^2) (𝑛−1)/(𝑛−𝑝−1)

MLR model significance

Study These Flashcards

f-test: statistical tests for checking model significance in regression

H0 : model with no independent variables (intercept-only model) fits the data as well as your model
H1: model fits data better than the intercept model

Prob(F-statistic) = p-value
- if < significant level: reject null hypothesis

MLR coefficient significance

Study These Flashcards

t-Test: statistical tests for checking coefficient (variable) significance

H0: coefficient 0, no r/s
H1: coefficient not 0, variable can explain the model, regression model fits better

P>|t| = p-value
- if <significant level: reject null hypothesis

estimate accuracy of prediction model

Study These Flashcards

Mean Squared Error (MSE) (Variance of Error) = 1/𝑛 ∑( y_i − 𝑦_𝑖̂)^2

Root Mean Squared Error (RMSE) (Std Dev. Of Error) = √𝑀𝑆𝐸 = √1/𝑛 ∑( y_i − 𝑦_𝑖 ̂ )^2

Assumptions for Linear Regression

Study These Flashcards

LINE

Linearity relationship
Independence of errors
Normality of error distribution
Equal Variance of errors

Residual Plots

Study These Flashcards

𝑦=𝛽_0+𝛽_1 𝑥1 + … + 𝛽_𝑗 𝑥𝑗+ 𝜀_𝑖

Residual (a.k.a. errors 𝜀_𝑖)
= Observed Value of Y – Predicted Value of Y
= 𝑦_𝑖−(𝑦_𝑖 )̂

Residual plot: predicted values (𝑦_𝑖 )̂ (x-axis) vs residual values 𝜀_𝑖 (y-axis)

if points are along the prediction line, how do residual plots look like?

Study These Flashcards

symmetrically distributed, tending to cluster towards the middle of the residual plot
no clear patterns

Multi-collinearity

Study These Flashcards

high level of correlation between independent variables.
regression coefficients often become less reliable (or unstable) as the degree of correlation between the independent variable increases

Multicollinearity problems

- hide significant variables - causes parameter estimates to become unstable by increasing their variance

Checking for multicollinearity

- pairwise correlation among predictors - compute Variance Inflation Factor for each variable used in the regression model (remove predictors with VIF values> 5 or 10; OR use domain knowledge to decide which predictors have a dependence on another)

Linear Regression Flashcards

(26 cards)