Week 7: Linear Regression 2 Flashcards

Question 1

Q

What are the 4 assumptions for linear regression?

Answer

A

1) Linearity
2) Independence
3) Constant variance (Homoscedasticity)
4) Normality

Question 2

Q

Why is it always good to plot y against x before doing linear regression?

Answer

A

1) Help us visualise the relationship and identify any trends/patterns
2) Identify outliers
3) Check the linearity assumptions
4) Understand data

Question 3

Q

What is an “influential observation”?

Answer

A

An outlier

Question 4

Q

What is multi-collinearity and how do you remedy it?

Answer

A

When 2 or more independent variables in a regression model are highly correlated. Remedied by removing 1 of the variables

Question 5

Q

What is the point of multi-linear regression

Answer

A

To increase the accuracy of estimates by introducing more independent variables

Question 6

Q

What determines the best MLR? What do you use to measure it?

Answer

A

The one that can account for the largest proportion of variation in the dependent variable, while using the least number of independent variables

Use adjusted R-squared

Question 7

Q

Explain this line of code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101)

Answer

A

You are using the train_test_split function to split the dataset into training and test sets.

X is the input data, and y is the output or target data.

test_size=0.2 specifies that 20% of the data will be used for testing, and the remaining 80% will be used for training.

random_state=101 ensures that the same random split will be generated each time the code is run, which allows for reproducibility of the results.

The function returns four variables:

X_train - the training input data
X_test - the testing input data
y_train - the training output data
y_test - the testing output data

Question 8

Q

Explain this code:
X_test.drop(“Price”,axis=1,inplace=True)

What if inplace=False?

Answer

A

It means that you drop the “price” column from the x_test dataset. inplace=True means that you change the dataset rather than return a new one

if inplace=False, then the new dataset won’t be saved unless it is assigned to a new variable

Question 9

Q

What does R^2 measure?

Answer

A

The proportion of variance in the dependent variable that can be explained by the independent variables

Question 10

Q

How do you check that y is linearly related to x? (1st assumption of LR)

Answer

A

1) Plot it out using a scatter plot
2) Calculate coefficient correlation

Question 11

Q

What does “residuals” mean?

Answer

A

y_actual - y_hat (predicted y)

Question 12

Q

How do you conclude that 2 independent variables are multicollinear using the VIF?

Answer

A

1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.

Week 7: Linear Regression 2 Flashcards

(12 cards)