Session 3 Flashcards
We have a particular life insurance product we would like to sell, we have a nice offer, but we incur a cost to target it. How should we proceed?
- Define target
- Collect data
- Build a model
- Predict outcomes
How to choose at each step which of the attributes to use to segment the population?
General rule: resulting groups to be as pure as possible
i.e., homogeneous with respect to the target variable.
The concept of information provides a way to…
… quantify the amount of surprise for an event measured in bits.
Intuition
the events that are rare (low probability) are more surprising and therefore contain more information than those events that are common (high probability)
Entropy
- Disorder corresponds to how mixed (impure) a segment is
- Entropy is zero at minimum disorder (all members belong to the same class)
- Entropy is one at maximal disorder (members equally distributed among classes)
Information Gain
- Information gain (IG) measures the change in entropy due to any amount of new information being added
- Information gain measures how much an attribute decreases entropy over the whole segmentation it creates
How to choose at each step which of the attributes to use to segment the population?
Rule: choose the variable that provides the most information gain with respect to the target variable
Do decision trees evaluate the information gain of all the variables at each split?
Yes
Can we use the same variable to split the data more than once?
Yes
How is the split done for continuous
variables (e.g., income)?
Different thresholds are tested; threshold
with highest IG is used
The confusion matrix
The confusion matrix allows visualization of the performance of a model
True Positives (TP)
actual positives correctly predicted as
positive
True Negatives (TN)
actual negatives correctly predicted as
negative
False Positives (FP)
negatives incorrectly predicted as positive
False Negatives (FN)
positives incorrectly predicted as negative