Linear Regression Flashcards

1
Q

Regression
- what is it used for?

A

a mathematical tool to:
1. analyse direction and strength of the relationship between the variable of interest (target) and other known variables

  1. predict the value of the unknown variable (target) based on its past and other variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Linear Regression

A

linear approach for modelling the relationship between a scalar dependent variable y and on or more explanatory/independent variables X

  • simple: 1 explanatory value
  • multiple: >1 explanatory variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Supervised Learning

A

training set contains actual outcome (target)
- analyses training data and produced an inferred function, used to map

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Variable type

A

independent
- explanatory
- control
- input
- predictor

dependent
- response
- outcome
- target

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Pearson Correlation

A

quantifies dependence between 2 variables
- r (coefficient of correlation) : extent of interdependence
- value indicates strength
- sign indicates direction
- between -1 (negative total correlation) to +1 (positive total correlation); 0 = no linear correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Types of relationships between 2 variables

A
  • strong positive linear relationship
  • positive linear relationship
  • perfect negative linear relationship
  • perfect parabolic relationship
  • negative curvilinear relationship
  • no relationship
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Regression vs correlation

A

regression allows use of MULTIPLE independent variables while correlation is only between 2 variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Statistical significance

A
  • regression: evidence based analysis
  • provides a mathematical way to find out the simultaneous impact of multiple independent variables on the dependent variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Application of Regression

A

Analytical (explanatory models)
- analyse if increase in police force has impact on crime rae
Predictive
- given persons lifestyle, predict % body fat

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Types of data used in regression

A

Cross sectional data : collected at one point in time
Time series: collected over a period of time
Pooled: combination of both cross-sectional and time series data ( eg population survey over 5 years)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Simple linear regression analysis model

A

𝑦 = 𝛽0 +𝛽1𝑥

y = predicted/dependent variable
x = predictor/independent variable
𝛽0 = y-intercept
𝛽1 = slope / beta coefficient of x

For predictions:
𝑦_𝑖=𝑦̂_𝑖+𝜀_𝑖
=𝛽_0+𝛽_1 𝑥_𝑖+𝜀_𝑖
Error term 𝜀𝑖= is the difference between actual value 𝑦_𝑖 and the predicted value 𝑦̂𝑖

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Determine best fit line

A

sum of squared distances
- penalises or magnifies large errors
- cancels effect of sign

minimise sum of squared errors
- method of least squares : minimises the sum of squares of the vertical distance between data and fitted line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Assessing goodness of fitted line

A

coefficient of determination (R square)
- amount of variation in Y that is explained by the regression line
- higher value, better regression line
= SSR/SST

  • takes values [0,1]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

what is multiple linear regression

A
  • increase accuracy: use more than 1 independent variable to estimate the dependent variable
  • allows us to use more of the information available to estimate the dependent variable
  • the best MLR accounts for the largest proportion of the variation in the dependent variable with the fewest number of independent variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

parsimonious model

A

accomplishes a desired level of explanation or prediction with as few predictor variables as possible

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

multiple regression equation

A

𝒀=𝛽_0+𝛽_1 𝑥_1+𝛽_2 𝑥_2+…+𝛽_𝑘 𝑥_𝑘

model significance using F-test and coefficient (variable) significance using T-test

17
Q

adjusted R squared

A

considers the number of variables being included in the model and penalises for overfitting
- decrease: added variables that are not significant or have multicollinearity with outher predictors

𝑅̅^2 = 1 − (1 −𝑅^2) (𝑛−1)/(𝑛−𝑝−1)

18
Q

MLR model significance

A

f-test: statistical tests for checking model significance in regression

  • H0 : model with no independent variables (intercept-only model) fits the data as well as your model
  • H1: model fits data better than the intercept model

Prob(F-statistic) = p-value
- if < significant level: reject null hypothesis

19
Q

MLR coefficient significance

A

t-Test: statistical tests for checking coefficient (variable) significance

  • H0: coefficient 0, no r/s
  • H1: coefficient not 0, variable can explain the model, regression model fits better

P>|t| = p-value
- if <significant level: reject null hypothesis

20
Q

estimate accuracy of prediction model

A

Mean Squared Error (MSE) (Variance of Error) = 1/𝑛 ∑( y_i − 𝑦_𝑖̂)^2

Root Mean Squared Error (RMSE) (Std Dev. Of Error) = √𝑀𝑆𝐸 = √1/𝑛 ∑( y_i − 𝑦_𝑖 ̂ )^2

21
Q

Assumptions for Linear Regression

A

LINE

Linearity relationship
Independence of errors
Normality of error distribution
Equal Variance of errors

22
Q

Residual Plots

A

𝑦=𝛽_0+𝛽_1 𝑥1 + … + 𝛽_𝑗 𝑥𝑗+ 𝜀_𝑖

Residual (a.k.a. errors 𝜀_𝑖)
= Observed Value of Y – Predicted Value of Y
= 𝑦_𝑖−(𝑦_𝑖 )̂

Residual plot: predicted values (𝑦_𝑖 )̂ (x-axis) vs residual values 𝜀_𝑖 (y-axis)

23
Q

if points are along the prediction line, how do residual plots look like?

A
  • symmetrically distributed, tending to cluster towards the middle of the residual plot
  • no clear patterns
24
Q

Multi-collinearity

A
  • high level of correlation between independent variables.
  • regression coefficients often become less reliable (or unstable) as the degree of correlation between the independent variable increases
25
Q

Multicollinearity problems

A
  • hide significant variables
  • causes parameter estimates to become unstable by increasing their variance
26
Q

Checking for multicollinearity

A
  • pairwise correlation among predictors
  • compute Variance Inflation Factor for each variable used in the regression model (remove predictors with VIF values> 5 or 10; OR use domain knowledge to decide which predictors have a dependence on another)