Week 7: Linear Regression 2 Flashcards
What are the 4 assumptions for linear regression?
1) Linearity
2) Independence
3) Constant variance (Homoscedasticity)
4) Normality
Why is it always good to plot y against x before doing linear regression?
1) Help us visualise the relationship and identify any trends/patterns
2) Identify outliers
3) Check the linearity assumptions
4) Understand data
What is an “influential observation”?
An outlier
What is multi-collinearity and how do you remedy it?
When 2 or more independent variables in a regression model are highly correlated. Remedied by removing 1 of the variables
What is the point of multi-linear regression
To increase the accuracy of estimates by introducing more independent variables
What determines the best MLR? What do you use to measure it?
The one that can account for the largest proportion of variation in the dependent variable, while using the least number of independent variables
Use adjusted R-squared
Explain this line of code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101)
You are using the train_test_split function to split the dataset into training and test sets.
X is the input data, and y is the output or target data.
test_size=0.2 specifies that 20% of the data will be used for testing, and the remaining 80% will be used for training.
random_state=101 ensures that the same random split will be generated each time the code is run, which allows for reproducibility of the results.
The function returns four variables:
X_train - the training input data
X_test - the testing input data
y_train - the training output data
y_test - the testing output data
Explain this code:
X_test.drop(“Price”,axis=1,inplace=True)
What if inplace=False?
It means that you drop the “price” column from the x_test dataset. inplace=True means that you change the dataset rather than return a new one
if inplace=False, then the new dataset won’t be saved unless it is assigned to a new variable
What does R^2 measure?
The proportion of variance in the dependent variable that can be explained by the independent variables
How do you check that y is linearly related to x? (1st assumption of LR)
1) Plot it out using a scatter plot
2) Calculate coefficient correlation
What does “residuals” mean?
y_actual - y_hat (predicted y)
How do you conclude that 2 independent variables are multicollinear using the VIF?
1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.