06.a Linear Regression Flashcards
In linear regression what is the output variable called and what are the input variables called
Output variable = dependent variable
Input variable = independent variable
What is linear regression
Linear regression is an analytical technique used to model the relationship between several input variables and a continuous outcome variable. A key assumption is that the relationship between an input variable and the outcome variable is linear. Although this assumption may appear restrictive, it is often possible to properly transform the input or outcome variables to achieve a linear relationship between the modified input and outcome variables.
What is the theory behind linear regression
Regression analysis attempts to explain the influence that a set of variables has on the outcome of another variable of interest. Regression analysis is a useful explanatory tool that can identify the input variables that have the greatest statistical influence on the outcome.
Name four common use cases for Linear Regression
Real estate
Demand forecasting
Medical correlation
Engineering
What is the relationship expression for Linear Regression
Y = β0 + β1X1 + β2X2 +…+ βnXn + ε
Y = dependent variable (continuous outcome variable)
Xj = independent variables (input variables, j = 1, 2, …, n)
β0 = intercept (the value of Y when each Xj equals 0)
β1..n = coefficients
ε = error between data and the model
We know Y and Xj (historical dataset) and we have to find the regression coefficients (β0, β1..n)
What does OLS mean and what does it do
Ordinary Least Squares. The goal is to find the line that best approximates the relationship between the outcome variable and the input variables. With OLS, the objective is to find the line through these points that minimises the sum of the squares of the difference between each point and the line in the vertical direction.
Name three graph types which are used in Linear Regression
Scatter plot - visualise the linear relationship
Box Plot - to help spot outliers
Denisty Plot - to see the distribution of the predictor variable. Ideally, a close to normal distribution (a bell shaped curve), without being skewed to the left or right is preferred.
What is the correlation coefficient in Linear Regression
Thecorrelation coefficient (r) quantifies both the strength and direction of the linear relationship between two measurement variables on ascatterplot. 1 = perfectly uphill, 0 = no correlation, -1 = perfectly downhill
What do Anscombes Quartet show
The data sets in the Anscombe’s quartet are designed to have approximately the same linear regression line (as well as nearly identical means, standard deviations, and correlations) but are graphically very different. This illustrates the pitfalls of relying solely on a fitted model to understand the relationship between variables.
Where is a good place to start in R when assessing if variables have correlation
Use the scatterplot to show a matrix of plots of all combinations of variables and look for correlations
What is R squared
R squared is the coefficient of determination and is the proportion of the variance in the dependent variable that is predictable from the independent variable(s)
It ranges from 0 to 1 where 1 is an excellent fit and 0 is no fit.
It is the comparison of the variance compared to the model variance
What is the syntax in R for linear regression
results = lm(y ~ x1 + x2, data = set1)
What is a quick function in R to draw a straight line onto a scatter plot
abline
What is Heteroscedasticity
It is the increasing spread of data away from the line of best fit for increasing data point values. The opposite is Homoscedasticity.
What is Multicollinearity
Multicollinearity means there are relationships between variables. This can impact the interpretation of a linear regression model.