Session 4 Flashcards

Question 1

Q

A fitting graph shows

Answer

A

the accuracy (or error rate) of a model as a function of model complexity

Question 2

Q

Fitting graph

Generally, there will be more overfitting as…

Answer

A

one allows the model to be more complex

Question 3

Q

Complexity is a measure of

Answer

A

the flexibility of a model

Question 4

Q

If the model is a mathematical function, complexity is measured by

Answer

A

the number of parameters

Question 5

Q

If the model is a tree, complexity is measured by

Answer

A

the number of nodes

Question 6

Q

To look at overfitting, you only look at the

Answer

A

holdout data

Question 7

Q

Generally: A procedure that grows trees until the leaves are pure tends to overfit

Answer

A

If allowed to grow without bound, decision trees can fit any data to arbitrary precision
The complexity of a tree lies in the number of nodes

Question 8

Q

Ways to avoid overfitting for tree induction

Answer

A

Stop growing the tree before it gets too complex

- Prune back a tree that is too large (reduce its size)

Question 9

Q

Tuning model parameters

When choices all are made, then test on test set this “nested” test set is often called the

Answer

A

“validation” set (to differentiate from the final test set).

Question 10

Q

Cross-validation

Answer

A

the dataset was first shuffled, then divided into ten partitions

Question 11

Q

How to find which variables are the most important for the model?

Answer

A

There are multiple ways of determining how important is a variable:

(Weighted) Sum of information gain in each split a variable is used (tree-based models)
Difference in model performance with and without using that variable (all models)

Question 12

Q

“Random Forest”

Answer

A

is a tree-based model that uses multiple decision trees simultaneously

A “Random Forest” model is equivalent to a decision tree if:

We set the parameter “number of trees” to 1; and
We set the parameter “subset ratio” to 1

Rapidminer has an operator called “Weight by Tree Importance” that calculates the variable importance based on information gain (among others)
This operator only works with models of type “Random Forest”

Question 13

Q

A learning curve is…

Answer

A

a plot of the generalization performance (testing data) against the amount of training data

Question 14

Q

Learning curve

Generalization performance improves as…

Answer

A

more training data are available

Steep initially, but then marginal advantage of more data decreases

Question 15

Q

The ROC graph shows

Answer

A

the entire space of performance possibilities for a given model, independent of class balance

Question 16

Q

A ROC curve can be constructed from

Answer

Study These Flashcards

A

selecting different thresholds from a rank classifier

Question 17

Q

The area under the ROC curve

Answer

Study These Flashcards

A

is the probability that the model will rank a randomly chosen positive case higher than a negative case

Question 18

Q

AUC is useful when

Answer

Study These Flashcards

A

a single number is needed to summarize performance, or when nothing is known about the operating conditions

Question 19

Q

Comparing models

We can compare the performance of different models by looking

Answer

Study These Flashcards

A

at their ROC curves.

We can choose the optimal model for each threshold

Question 20

Q

The expected value framework is …

Answer

Study These Flashcards

A

an analytical tool that is extremely helpful in organizing thinking about data-analytic problems.

Question 21

Q

The expected value framework combines:

Answer

Study These Flashcards

A

Structure of the problem
Elements of the analysis that can be extracted from the data
Elements of the analysis that need to be acquired from other sources (e.g., business knowledge)

Question 22

Q

The benefit/cost matrix

Answer

Study These Flashcards

A

summarizes the benefits and costs of each potential outcome, always comparing with a base scenario

Session 4 Flashcards

(22 cards)