5.3 Predictive Analytics (EN) Flashcards
Given the decision tree below and a test set with 20 observations, what is the accuracy of this model?
incorrect IIIIIII= 7
correct IIIIIIIIIIIII = 13
accuracy = nr of correct predictions/ nr of observations *100
13/20
0,65
You are in the process of building a decision tree for the dataset below. In the first step, you identify attribute “Color” as the best possible attribute to split the instances in the root node of the tree. As such, you end up with the so-called “decision stump” below. You are using the misclassification error as the impurity measure for constructing the tree.
Suppose that you want to further improve the tree and therefore look into how to further split “Internal Node 3”. What is the resulting impurity when you split “Internal Node 3”, based on the best attribute available?
1/4 ???????
color blue = 8/20
second attr= 3/8 true
misclassification error = 1 - max (proportion of majority class, proportion of second majority class)
decision stump= it is a simple decision tree with a single decision point and two leaf nodes.
You are constructing a decision tree for the data set below. You use the misclassification error as impurity metric for building a decision tree
Error(t)= 1- max[p(iIt)]
What is the impurity gain of the split when using the best attribute for the first split when building the tree?
What is the misclassification error of the split when using the best attribute for the first split when building the tree?
4/20 ????
8/20????
In linear regression, the parameter coefficients are chosen in such a way that the
sum of squared residuals or errors is maximized.
sum of squared residuals or errors is minimized.
product of squared residuals or errors is minimized.
product of squared residuals or errors is maximized.
sum of squared residuals or errors is minimized.
The higher the Area under the ROC curve (AUC) the
better the performance
worse the performance
better the performance
Given the following Gains chart: …
With a total client base of 10 000 people and 5000 responders on a marketing campaign, If we target 8000 clients with the highest scores from our model, we expect to reach:
1250 responders.
5000 responders.
2500 responders.
4750 responders.
4750 responders.
x-as is contacted, y-axis is responders
8000contacted/10 000total= 80% contacted op de x as –> waarde (0,8; 0,95)
0,95* 5000 responders= 4750
Which of the following statements is NOT CORRECT about the k-nearest neighbor classifier?
It is intuitive and easy to understand.
It has a large computing power requirement.
It needs a value for k which should be determined upfront.
It is unaffected by the presence of irrelevant variables.
It is unaffected by the presence of irrelevant variables.
KNN is sensitive to the presence of irrelevant variables bcs If irrelevant variables are present, they may introduce noise and contribute to incorrect distance calculations,
When the cut-off is set at its minimum (e.g., 0), then
the sensitivity becomes 1 and the specificity becomes 1.
the sensitivity becomes 1 and the specificity becomes 0.
the sensitivity becomes 0 and the specificity becomes 1.
the sensitivity becomes 0 and the specificity becomes 0.
the sensitivity becomes 1 and the specificity becomes 0.
When the cutoff is set at its minimum (e.g., 0), the interpretation is typically that all predictions are classified as positive. In binary classification, this means that the model is predicting the positive class for all instances, and there are no negative predictions.
Consider a data set with 100% good customers and 0% bad customers. This data set has an entropy of
0
0,5
1
10
0
entropy is a measure of impurity, often used for making splits in decision trees
the dataset has 100% good customers and 0% bad customers. Therefore, Pgood= 1, Pbad=0
H= -1log2(1) - 0 log2(0) = 0
=0.
Confusion matrix;
real classes
predicted values 23 16
55 6
The classification accuracy is 29/100, the error rate is 71/100, the sensitivity is 23/78 and the specificity is 6/22.
The classification accuracy is 29/100, the error rate is 71/100, the sensitivity is 6/22 and the specificity is 23/78.
The classification accuracy is 71/100, the error rate is 29/100, the sensitivity is 23/78 and the specificity is 6/22.
The classification accuracy is 71/100, the error rate is 29/100, the sensitivity is 6/22 and the specificity is 23/78.
The classification accuracy is 29/100, the error rate is 71/100, the sensitivity is 23/78 and the specificity is 6/22.
True Positives (TP): 23
False Positives (FP): 16
False Negatives (FN): 55
True Negatives (TN): 6
Class accuracy = (TP + TN)/total = (23+6)/ (23+16+55+6) = 29/100
error rate = (FP + FN)/ total = (16+55)/ 100= 71/100
sensitivty / recall= TP/ actual postive = 23/ (23+55) = 23/78
specificty (TN)= TN/ actual negative TN= 6+ (16+ 6) =6/22
Which statement is NOT CORRECT?
In terms of advantages, decision trees are easy to interpret and understand, assuming they are not too big.
Decision trees are non-parametric, because no assumptions of normality, symmetric distributions, or independence are needed.
Decision trees are very robust with respect to outliers.
Decision trees are often referred to as stable classifiers since they are very insensitive to changes in the training data.
Decision trees are often referred to as stable classifiers since they are very insensitive to changes in the training data.
Decision trees can be sensitive to changes in the training data
(non-parametric= make no assumptions about undelrying distribution of data)
Netflix decision tree
Weather=sunny; Tired; No= No netflix –> III (I)
Weather=sunny; Tired; Yes= Netflix –> IIIII (II)
Weather=Rainy; Homework= No; Netflix –> II
Weather=Rainy; Homework= yes; Tired= No; No netflix –> II (I)
Weather=Rainy; Homework= yes; Tired= yes; netflix –> II (III)
The classification accuracy is 0.35, the error rate is 0.65, the sensitivity is 0.5, the specificity is 0.8.
The classification accuracy is 0.35, the error rate is 0.65, the sensitivity is 0.8, the specificty is 0.5.
The classification accuracy is 0.65, the error rate is 0.35, the sensitivity is 0.5, the specificity is 0.8.
The classification accuray is 0.65, the error rate is 0.35, the sensitivity is 0.8, the specificity is 0.5.
The classification accuracy is 0.65, the error rate is 0.35, the sensitivity is 0.8, the specificity is 0.5.
TP= 8
FP= 2
TN= 5
FN= 5
Class accuracy = (TP + TN)/total =
(9+5)/ (20) = 13/20 = 0,65
error rate = (FP + FN)/ total =
(2+5)/ 20= 7/20 = 0,35
sensitivty / recall= TN/ actual postive TP =
8/(8+5) =0,5
To avoid overfitting from happening when building a decision tree, various strategies can be adopted. One option is to split the data into a training set and a validation set. The optimal tree is then chosen where the
training set error is maximal.
validation set error is minimal.
training set error is minimal.
validation set error is maximal
validation set error is minimal.
By using a validation set, you can evaluate different tree sizes and select the one that provides the best performance on unseen data, thus avoiding overfitting.
Which is the most easy decision to make when building a decision tree?
splitting decision
stopping decision
assignment decision
assignment decision
Using the classification error to build a decision tree, the gain of the employment split is:
0.35.
0.33.
0.4.
0.5.
0,4?????
total instance 20
employed yes = 8 -> 0 churn
employed no = 12 -> 10 churn,, 2 no churn
Consider a data set with 50% good customers and 50% bad customers. This data set has an entropy of
0
0.5
1
10
1
k= 2, each class has proportion of 0,5
H= -0,5* log2(0,5) - 0,5* log(0,5)
H= -0,5 (-1) -0,5(-1) = 0,5+ 0,5 = 1