Chapter 5 - Overfitting and its Avoidance Flashcards

1
Q

What is overfitting?

A

The tendency of data mining procedures to tailer models to the training data, at the expense of generalization to previously unseen data points.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a table model and what is the problem with it?

A

A tabble model memorizes the training data and preforms NO GENERALIZATION. This is also the problem, because the model can only calculate the known data 100%, but does not have any predictive power.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is generalization?

A

This is a property of a model or modeling process, whereby the model applies to data that was not used to build the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the fitting graph

A

This is an analytical tool to recognize overfitting. It shows the accuracy of a model as a function of complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is holdout data?

A

Data not used in building the model, but for which we do know the actual value of the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does the error rate of holdout data react to an increase in data points?

A

The holdout set error never decreases, because there is never an overlap between training and holdout sets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the base rate?

A

The base rate is the error rate of a holdout set. This is calculated as e.g. the percentage of churn cases in a population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Why does performance degrade due to overfitting?

A

As models become more complex, it is allowed to pick up harmful spurious correlations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What it the holdout evaluation method?

A

This is a way of solving overfitting by dividing the data in test data (holdout data) and training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the cons a holdout set?

A
  1. It is a single estimate and therefore we too much focussed on one dataset (we could have had luck with a good dataset).
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is cross-validation?

A

Solution to holdout set problems. It includes the mean and variance of estimated performance. The variance is critical in assesing confidence in the perfoemance estimate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the pro’s of cross-validation?

A
  1. Makes better use of a limited dataset

2. Computes estimates over all the data by performing multiple splits and swapping out samples for testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How does cross validation work?

A
  1. Split dataset into k partitions (k = 5 or 10)
  2. Iterates training and testing data k times.
  3. Each iteration: training data = (k-1)/k and test data = 1/k
  4. Calculate the average accuracy
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Doe you trust performance done on a training set and why?

A

No, due to the high chances of overfitting. Possible solutions are: Cross-validation and holdout methods

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Explain the learing curve.

A

This is a plot of generalization performance against the amount of training data. They are steep initially and flattens with bigger datasets due to a decrease in marginal advantages.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the difference between learning curves and fitting graphs?

A

A learning curve shows the generalization performance ONLY ON TESTING DATA. (plotted against THE AMOUNT OF TRAINING DATA)
A fitting graph shows the generalization performance on BOTH PERFORMANCE AND TESTING DATA. (plotted against COMPLEXITY)

fitting graphs are generally shown for fixed amount of training data.

17
Q

Compare classification trees and linear logistic regression

A

Classification tress:
* Will overfit more for smaller datasets.
* Can work better for large datasets.
Logistic regression:
* Usually leads to better performance on smaller datasets

18
Q

What is the deal w/ tree induction

A

Main problems:

  1. Tree will keep growing to fit the training data -> creates pure leave nodes.
  2. Overly complex trees that overfit data.

Solutions:

  1. Stop growing the tree before it gets too complex
  2. Grow tree until it is too large and then “prune” back to reduce the size and complexity.

Methods for solution:

  1. Limit tree size to specify a minimum number of instances that must be present in a leaf.
  2. “Prune” an overly large tree (cut off leaves and branhes, and replace with leaves).
    * By