Linear Regression Flashcards
Regression
- what is it used for?
a mathematical tool to:
1. analyse direction and strength of the relationship between the variable of interest (target) and other known variables
- predict the value of the unknown variable (target) based on its past and other variables
Linear Regression
linear approach for modelling the relationship between a scalar dependent variable y and on or more explanatory/independent variables X
- simple: 1 explanatory value
- multiple: >1 explanatory variable
Supervised Learning
training set contains actual outcome (target)
- analyses training data and produced an inferred function, used to map
Variable type
independent
- explanatory
- control
- input
- predictor
dependent
- response
- outcome
- target
Pearson Correlation
quantifies dependence between 2 variables
- r (coefficient of correlation) : extent of interdependence
- value indicates strength
- sign indicates direction
- between -1 (negative total correlation) to +1 (positive total correlation); 0 = no linear correlation
Types of relationships between 2 variables
- strong positive linear relationship
- positive linear relationship
- perfect negative linear relationship
- perfect parabolic relationship
- negative curvilinear relationship
- no relationship
Regression vs correlation
regression allows use of MULTIPLE independent variables while correlation is only between 2 variables
Statistical significance
- regression: evidence based analysis
- provides a mathematical way to find out the simultaneous impact of multiple independent variables on the dependent variable
Application of Regression
Analytical (explanatory models)
- analyse if increase in police force has impact on crime rae
Predictive
- given persons lifestyle, predict % body fat
Types of data used in regression
Cross sectional data : collected at one point in time
Time series: collected over a period of time
Pooled: combination of both cross-sectional and time series data ( eg population survey over 5 years)
Simple linear regression analysis model
𝑦 = 𝛽0 +𝛽1𝑥
y = predicted/dependent variable
x = predictor/independent variable
𝛽0 = y-intercept
𝛽1 = slope / beta coefficient of x
For predictions:
𝑦_𝑖=𝑦̂_𝑖+𝜀_𝑖
=𝛽_0+𝛽_1 𝑥_𝑖+𝜀_𝑖
Error term 𝜀𝑖= is the difference between actual value 𝑦_𝑖 and the predicted value 𝑦̂𝑖
Determine best fit line
sum of squared distances
- penalises or magnifies large errors
- cancels effect of sign
minimise sum of squared errors
- method of least squares : minimises the sum of squares of the vertical distance between data and fitted line
Assessing goodness of fitted line
coefficient of determination (R square)
- amount of variation in Y that is explained by the regression line
- higher value, better regression line
= SSR/SST
- takes values [0,1]
what is multiple linear regression
- increase accuracy: use more than 1 independent variable to estimate the dependent variable
- allows us to use more of the information available to estimate the dependent variable
- the best MLR accounts for the largest proportion of the variation in the dependent variable with the fewest number of independent variables
parsimonious model
accomplishes a desired level of explanation or prediction with as few predictor variables as possible