Foundation Flashcards

Question 1

Q

How does Predictive Modelling work?

Answer

A

Predicts OUTCOME (target), based on set of INPUTS

Question 2

Q

What are the 2 types of Prediction?

Answer

A

Classification, Estimation

Question 3

Q

What criteria is needed to use “Classification”?

Answer

A

Target must be CATEGORICAL
Hint: “Class” in “Classification” => Categorical

Question 4

Q

What criteria is needed to use “Estimation”?

Answer

A

Target must be CONTINOUS (numerical)
Hint: “Estimate” => Numbers hence, continuous

Question 5

Q

What percentage should we split the data?

Answer

A

70% Training, 30% TESTING

Question 6

Q

Using what node in SPSS Modeler can we split the data?

Answer

A

Partition node

Question 7

Q

Why is it neccessary to SPLIT data?

Answer

A

Aim of predictive model: It should be trained to be accurate on UNSEEN data

What is UNSEEN data? It is the TRAINING data!

Question 8

Q

Why we use “Seed” in IBM SPSS?

Answer

A

It helps to RANDOMLY select records (datarows) to be either training/testing data.

Question 9

Q

Why remember the exact seed number?

Answer

A

To ensure that it does not randomly select a record to be training/testing data

This further ensures that the model’s result is consistent as the same records are chosen to be training and testing data respectively.

Question 10

Q

What does CART stand for?

Answer

A

Classification And Regression Tree

Question 11

Q

When to use Regression Tree?

Answer

A

When it is to estimate (the type of predict is to ESTIMATE numerical target)

Question 12

Q

What are the 2 types of impurities to measure for Classification?

Answer

A

Gini Index, Entrophy

Question 13

Q

What are the impurity measure for Regression?

Answer

A

Sum of Squared Error (SSE)

Question 14

Q

Is lower or higher GINI better?

Answer

A

Lower! Because Gini shows impurity. Gini = 1-purity.

Question 15

Q

Why is lower GINI more desirable?

Answer

A

The nodes are more homogenous = better prediction = better model

Question 16

Q

State the 4 kind of nodes in a decision tree

Answer

Study These Flashcards

A

Parent node, Child Node, Root node, Leaf node

Question 17

Q

What is the main difference between a Confusion Matrix vs. Analysis Node that gives accuracy (correct/wrong)?

Answer

Study These Flashcards

A

Confusion Matrix can give HIT rate by True positive, false negative, true false, false positive. Provides a more in depth detail.

Conversely, Analysis node only states Correct/ Wrong.

Question 18

Q

Why is Confusion Matrix useful?

Answer

Study These Flashcards

A

By analysing the True and Negative Positive/False, we can determine the severity (depending on context) to it’s hit rate. E.g) in a medical field, we would want MORE “False Positive” than “False Negative.

Why? Consequence is severe! So hint is to see the context’s consequences.

Question 19

Q

Formula for False Positive Hit Rate

Answer

Study These Flashcards

A

FP / FP + TN

Question 20

Q

Formula for False Negative Hit Rate

Answer

Study These Flashcards

A

FN / FN + TP

Question 21

Q

What are the thing we have to take EXTRA note in decision trees?

Answer

Study These Flashcards

A

Overfitting

Question 22

Q

State the 2 signs of overfitting

Answer

Study These Flashcards

A

Training accuracy > Testing Accuracy %
Leaf node only has 1 sample -> Gini = 0 -> 100% Pure (impossible because it is TOO specialized in training data)

Question 23

Q

How can we prevent Overfitting?

Answer

Study These Flashcards

A

Set rules to stop growth in IBM (“Overfitting Prevention”)
Prune tree until it does not overfit (see the best condition for MINIMUM ERROR RATE!!%)

Foundation Flashcards

(23 cards)