Session 4 Flashcards
A fitting graph shows
the accuracy (or error rate) of a model as a function of model complexity
Fitting graph
Generally, there will be more overfitting as…
one allows the model to be more complex
Complexity is a measure of
the flexibility of a model
If the model is a mathematical function, complexity is measured by
the number of parameters
If the model is a tree, complexity is measured by
the number of nodes
To look at overfitting, you only look at the
holdout data
Generally: A procedure that grows trees until the leaves are pure tends to overfit
- If allowed to grow without bound, decision trees can fit any data to arbitrary precision
- The complexity of a tree lies in the number of nodes
Ways to avoid overfitting for tree induction
- Stop growing the tree before it gets too complex
- Prune back a tree that is too large (reduce its size)
Tuning model parameters
When choices all are made, then test on test set this “nested” test set is often called the
“validation” set (to differentiate from the final test set).
Cross-validation
the dataset was first shuffled, then divided into ten partitions
How to find which variables are the most important for the model?
There are multiple ways of determining how important is a variable:
- (Weighted) Sum of information gain in each split a variable is used (tree-based models)
- Difference in model performance with and without using that variable (all models)
“Random Forest”
is a tree-based model that uses multiple decision trees simultaneously
A “Random Forest” model is equivalent to a decision tree if:
- We set the parameter “number of trees” to 1; and
- We set the parameter “subset ratio” to 1
Rapidminer has an operator called “Weight by Tree Importance” that calculates the variable importance based on information gain (among others)
This operator only works with models of type “Random Forest”
A learning curve is…
a plot of the generalization performance (testing data) against the amount of training data
Learning curve
Generalization performance improves as…
more training data are available
Steep initially, but then marginal advantage of more data decreases
The ROC graph shows
the entire space of performance possibilities for a given model, independent of class balance
A ROC curve can be constructed from
selecting different thresholds from a rank classifier
The area under the ROC curve
is the probability that the model will rank a randomly chosen positive case higher than a negative case
AUC is useful when
a single number is needed to summarize performance, or when nothing is known about the operating conditions
Comparing models
We can compare the performance of different models by looking
at their ROC curves.
We can choose the optimal model for each threshold
The expected value framework is …
an analytical tool that is extremely helpful in organizing thinking about data-analytic problems.
The expected value framework combines:
- Structure of the problem
- Elements of the analysis that can be extracted from the data
- Elements of the analysis that need to be acquired from other sources (e.g., business knowledge)
The benefit/cost matrix
summarizes the benefits and costs of each potential outcome, always comparing with a base scenario