L4 - Other Regression Models, Data Splits, Comparing Models Flashcards

1
Q

How can we improve on the simple linear regression model?

A

By using polynomial regression. This provides higher ability of learning from the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When using polynomial regression, what do we need to be careful of?

A

We must ensure that the model is not overfit.

We want to learn from the data, not learn the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the ‘Order of Polynomial’?

A

The number of terms in the polynomial regression equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the relationship between polynomial regression complexity and the order of polynomial?

A

As the order increases, so does the model complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Is polynomial regression linear?

A

Yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What determines whether the model is non-linear or not?

A

The order of the polynomial coefficients (Beta).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Bayes Information Criterion?

A

A method for comparing the performance of a model given different parameters.

E.g Model A with 2 params, Model A with 4 params etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What does BIC measure?

A

Complexity and error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Using BIC, how can we establish the best performing model?

A

The best performing version of the model will be the one with the lowest BIC score. I.e the lowest on the BIC graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the purpose of test / train splits?

A

Partitions the dataset into training and test subsets. We can train the model on the train set, and analyse its performance on the test set.

This indicates how well the model will perform on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is another name for the test set?

A

Holdout set.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the 2 main issues with conducting trains / test splits?

A

Data Leakage - Test set data can leak into the training set, and influence the training of the model. E.g duplicate data.

Weak Model - If the model performs weakly, we have to re-split the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of Cross-Validation?

A

Protect the model against overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does K-fold Cross-Validation work?

A
  1. Shuffle dataset into K groups
  2. For each group, use it as a test set, and the remainder of the data as a training set.
  3. Perform train-test on all K groups.
  4. Summarise model performance based on the model error.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are the 2 sampling methods for balancing datasets?

A

Random - Can lead to non-equal group sizes.

Stratified - Ensures cross-validation is a close approximation of the generalisation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly