Session 4 Flashcards

1
Q

A fitting graph shows

A

the accuracy (or error rate) of a model as a function of model complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Fitting graph

Generally, there will be more overfitting as…

A

one allows the model to be more complex

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Complexity is a measure of

A

the flexibility of a model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

If the model is a mathematical function, complexity is measured by

A

the number of parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

If the model is a tree, complexity is measured by

A

the number of nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

To look at overfitting, you only look at the

A

holdout data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Generally: A procedure that grows trees until the leaves are pure tends to overfit

A
  • If allowed to grow without bound, decision trees can fit any data to arbitrary precision
  • The complexity of a tree lies in the number of nodes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Ways to avoid overfitting for tree induction

A
  • Stop growing the tree before it gets too complex

- Prune back a tree that is too large (reduce its size)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Tuning model parameters

When choices all are made, then test on test set this “nested” test set is often called the

A

“validation” set (to differentiate from the final test set).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cross-validation

A

the dataset was first shuffled, then divided into ten partitions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How to find which variables are the most important for the model?

A

There are multiple ways of determining how important is a variable:

  • (Weighted) Sum of information gain in each split a variable is used (tree-based models)
  • Difference in model performance with and without using that variable (all models)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

“Random Forest”

A

is a tree-based model that uses multiple decision trees simultaneously

A “Random Forest” model is equivalent to a decision tree if:

  • We set the parameter “number of trees” to 1; and
  • We set the parameter “subset ratio” to 1

Rapidminer has an operator called “Weight by Tree Importance” that calculates the variable importance based on information gain (among others)
This operator only works with models of type “Random Forest”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

A learning curve is…

A

a plot of the generalization performance (testing data) against the amount of training data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Learning curve

Generalization performance improves as…

A

more training data are available

Steep initially, but then marginal advantage of more data decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The ROC graph shows

A

the entire space of performance possibilities for a given model, independent of class balance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

A ROC curve can be constructed from

A

selecting different thresholds from a rank classifier

17
Q

The area under the ROC curve

A

is the probability that the model will rank a randomly chosen positive case higher than a negative case

18
Q

AUC is useful when

A

a single number is needed to summarize performance, or when nothing is known about the operating conditions

19
Q

Comparing models

We can compare the performance of different models by looking

A

at their ROC curves.

We can choose the optimal model for each threshold

20
Q

The expected value framework is …

A

an analytical tool that is extremely helpful in organizing thinking about data-analytic problems.

21
Q

The expected value framework combines:

A
  1. Structure of the problem
  2. Elements of the analysis that can be extracted from the data
  3. Elements of the analysis that need to be acquired from other sources (e.g., business knowledge)
22
Q

The benefit/cost matrix

A

summarizes the benefits and costs of each potential outcome, always comparing with a base scenario