Foundation Flashcards
How does Predictive Modelling work?
Predicts OUTCOME (target), based on set of INPUTS
What are the 2 types of Prediction?
Classification, Estimation
What criteria is needed to use “Classification”?
Target must be CATEGORICAL
Hint: “Class” in “Classification” => Categorical
What criteria is needed to use “Estimation”?
Target must be CONTINOUS (numerical)
Hint: “Estimate” => Numbers hence, continuous
What percentage should we split the data?
70% Training, 30% TESTING
Using what node in SPSS Modeler can we split the data?
Partition node
Why is it neccessary to SPLIT data?
Aim of predictive model: It should be trained to be accurate on UNSEEN data
What is UNSEEN data? It is the TRAINING data!
Why we use “Seed” in IBM SPSS?
It helps to RANDOMLY select records (datarows) to be either training/testing data.
Why remember the exact seed number?
To ensure that it does not randomly select a record to be training/testing data
This further ensures that the model’s result is consistent as the same records are chosen to be training and testing data respectively.
What does CART stand for?
Classification And Regression Tree
When to use Regression Tree?
When it is to estimate (the type of predict is to ESTIMATE numerical target)
What are the 2 types of impurities to measure for Classification?
Gini Index, Entrophy
What are the impurity measure for Regression?
Sum of Squared Error (SSE)
Is lower or higher GINI better?
Lower! Because Gini shows impurity. Gini = 1-purity.
Why is lower GINI more desirable?
The nodes are more homogenous = better prediction = better model
State the 4 kind of nodes in a decision tree
Parent node, Child Node, Root node, Leaf node
What is the main difference between a Confusion Matrix vs. Analysis Node that gives accuracy (correct/wrong)?
Confusion Matrix can give HIT rate by True positive, false negative, true false, false positive. Provides a more in depth detail.
Conversely, Analysis node only states Correct/ Wrong.
Why is Confusion Matrix useful?
By analysing the True and Negative Positive/False, we can determine the severity (depending on context) to it’s hit rate. E.g) in a medical field, we would want MORE “False Positive” than “False Negative.
Why? Consequence is severe! So hint is to see the context’s consequences.
Formula for False Positive Hit Rate
FP / FP + TN
Formula for False Negative Hit Rate
FN / FN + TP
What are the thing we have to take EXTRA note in decision trees?
Overfitting
State the 2 signs of overfitting
- Training accuracy > Testing Accuracy %
- Leaf node only has 1 sample -> Gini = 0 -> 100% Pure (impossible because it is TOO specialized in training data)
How can we prevent Overfitting?
- Set rules to stop growth in IBM (“Overfitting Prevention”)
- Prune tree until it does not overfit (see the best condition for MINIMUM ERROR RATE!!%)