Session 3 Flashcards

1
Q

We have a particular life insurance product we would like to sell, we have a nice offer, but we incur a cost to target it. How should we proceed?

A
  1. Define target
  2. Collect data
  3. Build a model
  4. Predict outcomes
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How to choose at each step which of the attributes to use to segment the population?

A

General rule: resulting groups to be as pure as possible

i.e., homogeneous with respect to the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The concept of information provides a way to…

A

… quantify the amount of surprise for an event measured in bits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Intuition

A

the events that are rare (low probability) are more surprising and therefore contain more information than those events that are common (high probability)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Entropy

A
  • Disorder corresponds to how mixed (impure) a segment is
  • Entropy is zero at minimum disorder (all members belong to the same class)
  • Entropy is one at maximal disorder (members equally distributed among classes)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Information Gain

A
  • Information gain (IG) measures the change in entropy due to any amount of new information being added
  • Information gain measures how much an attribute decreases entropy over the whole segmentation it creates
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How to choose at each step which of the attributes to use to segment the population?

A

Rule: choose the variable that provides the most information gain with respect to the target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Do decision trees evaluate the information gain of all the variables at each split?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Can we use the same variable to split the data more than once?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How is the split done for continuous

variables (e.g., income)?

A

Different  thresholds are tested; threshold

with highest IG is used

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The confusion matrix

A

The confusion matrix allows visualization of the performance of a model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True Positives (TP)

A

actual positives correctly predicted as

positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

True Negatives (TN)

A

actual negatives correctly predicted as

negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

False Positives (FP)

A

negatives incorrectly predicted as positive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

False Negatives (FN)

A

positives incorrectly predicted as negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

RapidMiner Studio

A

is a commercial software that provides an integrated environment for machine learning and business analytics

  • It is a good tool for teaching the basic data science concepts
  • It is free to use for small data sets (up to 10,000 observations)