Week 7: Linear Regression 2 Flashcards

1
Q

What are the 4 assumptions for linear regression?

A

1) Linearity
2) Independence
3) Constant variance (Homoscedasticity)
4) Normality

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it always good to plot y against x before doing linear regression?

A

1) Help us visualise the relationship and identify any trends/patterns
2) Identify outliers
3) Check the linearity assumptions
4) Understand data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is an “influential observation”?

A

An outlier

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is multi-collinearity and how do you remedy it?

A

When 2 or more independent variables in a regression model are highly correlated. Remedied by removing 1 of the variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the point of multi-linear regression

A

To increase the accuracy of estimates by introducing more independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What determines the best MLR? What do you use to measure it?

A

The one that can account for the largest proportion of variation in the dependent variable, while using the least number of independent variables

Use adjusted R-squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Explain this line of code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101)

A

You are using the train_test_split function to split the dataset into training and test sets.

X is the input data, and y is the output or target data.

test_size=0.2 specifies that 20% of the data will be used for testing, and the remaining 80% will be used for training.

random_state=101 ensures that the same random split will be generated each time the code is run, which allows for reproducibility of the results.

The function returns four variables:

X_train - the training input data
X_test - the testing input data
y_train - the training output data
y_test - the testing output data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain this code:
X_test.drop(“Price”,axis=1,inplace=True)

What if inplace=False?

A

It means that you drop the “price” column from the x_test dataset. inplace=True means that you change the dataset rather than return a new one

if inplace=False, then the new dataset won’t be saved unless it is assigned to a new variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does R^2 measure?

A

The proportion of variance in the dependent variable that can be explained by the independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How do you check that y is linearly related to x? (1st assumption of LR)

A

1) Plot it out using a scatter plot
2) Calculate coefficient correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What does “residuals” mean?

A

y_actual - y_hat (predicted y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How do you conclude that 2 independent variables are multicollinear using the VIF?

A

1 = not correlated.
Between 1 and 5 = moderately correlated.
Greater than 5 = highly correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly