Midterm Flashcards

1
Q

What is the classification accuracy rate?

A

The number of correctly predicted instances out of all instances in your data.

Formula: S/n, where S is the number of accurately classified examples and n is the total number of examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why can classification accuracy be misleading?

A

It may show high accuracy on training data, which does not reflect the model’s performance on unseen data.

High training accuracy may indicate overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do we call the examples that were not used to induce the model?

A

Testing data.

Testing data is crucial for evaluating model performance on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two main data partitions used in model training?

A
  • Training data
  • Testing data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is generalization accuracy?

A

An estimation of how well your model predicts the class of examples from a different data set.

Also known as test accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the learning curve?

A

A graphical representation showing how model accuracy improves as the training set size increases.

X-axis: sample size of training data; Y-axis: accuracy of the model on testing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: More data generally improves model performance.

A

True.

More data allows the model to learn better and reduces the risk of overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What happens to model accuracy as training data increases?

A

Model accuracy generally increases until it plateaus.

This indicates diminishing returns on accuracy with additional data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is one drawback of splitting data into training and testing sets?

A

It limits the amount of data available for training and testing, which can affect model performance.

Insufficient data can lead to non-representative samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a common solution to avoid over-optimistic evaluation in model testing?

A

Use a sufficiently large dataset to ensure representativeness after splitting.

This helps maintain data integrity for both training and testing phases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the relationship between the size of the training data and the expected model performance?

A

Larger training data generally leads to better model performance.

More data helps the model generalize better to unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the drawback of partitioning data for training and testing?

A

Losing some data for the induction and testing process.

This can lead to a less reliable model if the dataset is small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is more data desirable in model training?

A

To maintain reliability and avoid issues from limited data when making training and testing cuts.

A larger dataset helps in achieving better generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is cross validation?

A

A model evaluation technique used to approximate generalization accuracy without building a predictive model.

It involves partitioning data into subsets for training and testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does cross validation improve model evaluation?

A

By conducting multiple experiments, it reduces the chance of bias from a single training/testing split.

This is especially useful when working with limited data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the steps in performing 10-fold cross validation?

A
  1. Partition data into 10 folds. 2. Hold one fold for testing. 3. Use the remaining folds for training. 4. Repeat for each fold.

Each portion of data serves as both training and testing at different times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the benefit of averaging the results in cross validation?

A

It mitigates the effects of outliers and provides a more reliable accuracy estimate.

Averaging across folds helps smooth out inaccuracies from any one fold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the potential disadvantage of increasing the number of folds in cross validation?

A

It can lead to very small testing sets, which may not be representative of the entire dataset.

This diminishes the effectiveness of the cross validation process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What happens in leave-one-out cross validation?

A

One record is held out as the test set while the rest are used for training.

This method can lead to very small testing sets, especially with limited data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a key consideration when using limited data in cross validation?

A

Each model induced will be similar, but care must be taken to ensure the test set is adequately sized.

Smaller datasets could lead to biased results if the test set is too small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

True or False: Cross validation is used for building predictive models.

A

False.

Cross validation is primarily an evaluation technique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Fill in the blank: Cross validation aims to approximate _______.

A

generalization accuracy.

This is crucial for assessing model performance on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the main purpose of cross validation in model evaluation?

A

To assess the performance of a model using different subsets of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

True or False: Cross validation is an inducing technique for models.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

In the context of cross validation, what does partitioning a small set of data allow for?

A

It allows for experimentation without inducing a model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What happens to the model’s performance when using cross validation?

A

It helps mitigate outliers by averaging results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Fill in the blank: In cross validation, you never use the same experiments for both ______ and ______.

A

training, testing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the significance of having a satisfactory cross validation accuracy?

A

It indicates the model is likely to perform well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is an example of model parameters mentioned in the text?

A

Max depth of 5, min sample leaves of 50

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What are the two main phases of model building as discussed?

A

Inducing a model and evaluating the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

How does cross validation improve the evaluation process when data is limited?

A

It allows for evaluation without a separate training and testing split

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What happens to the model’s structure when evaluated on testing data versus cross validation?

A

The model remains largely the same in both evaluations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What does each fold in cross validation consist of?

A

A separate training set and testing set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

True or False: All data is used for both training and testing in each fold of cross validation.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What can be concluded about the evaluation techniques discussed?

A

They serve to assess model performance effectively

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

When is cross validation typically applied in the model building process?

A

During the evaluation phase of the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is the formula for calculating classification accuracy?

A

S/N: S is the number accurately classified by the model, and N is the total number of examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is the difference between training accuracy and test accuracy?

A

Training accuracy is the model’s performance on training examples; test accuracy is the model’s performance on out-of-sample data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What is the common practice for partitioning data for model training and testing?

A

It is common to use ⅔ of the data for training and ⅓ for testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What is a learning curve in predictive analytics?

A

It characterizes how test accuracy improves as the training set size increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

What is Cross Validation (CV)?

A

CV is an experiment that provides a good approximation of generalization performance for a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

What are the steps involved in N-Fold Cross Validation?

A
  • Randomly partition data into N equally sized sets (folds)
  • Perform N experiments of model building and evaluation
  • Hold out one fold as the test set in each experiment
  • Induce a model from the remaining folds
  • Evaluate performance on the test set
  • Average the performance of the N experiments
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is overfitting in predictive modeling?

A

Overfitting occurs when a model captures not only regularities in the data but also peculiarities, undermining its predictive performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the purpose of a validation set?

A

A validation set is used to decide which subtrees to prune in a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

What happens when training error decreases while validation error increases?

A

It indicates that the model is likely overfitting the training data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Define precision in the context of classification models.

A

Precision is the ratio of true positives to the total predicted positives: True Positives/(True Positives + False Positives).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Define recall in the context of classification models.

A

Recall is the ratio of true positives to the total actual positives: True Positives/(True Positives + False Negatives).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What is the trade-off between precision and recall?

A

As precision increases, recall tends to decrease.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

What is a Lift Chart used for?

A

A Lift Chart is used to determine if a model is better at ranking customers than random ranking.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

What does the Receiver Operating Characteristic (ROC) curve illustrate?

A

The ROC curve illustrates the performance of a binary classifier as its discrimination threshold varies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

What does the area under the ROC curve (AUC) indicate?

A

AUC summarizes the overall performance of a model; a value of 1.0 indicates perfect performance, while 0.5 indicates random guessing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

What is the role of Class Probability Estimation (CPE)?

A

CPE shows the probability that a given example will belong to a certain class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Fill in the blank: The training set is used to grow a tree to its _______.

A

[max size]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

True or False: The validation set is the same as the test set.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is the benchmark for classification accuracy?

A

The base rate, which classifies all examples to the majority class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is the importance of evaluating model performance on test samples?

A

To detect overfitting and ensure the model generalizes well to unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

What is an example or instance in the context of data mining?

A

A fact that typically includes a set of attributes and an output variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

What is a data set?

A

A set of examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

What is training data?

A

Data used to induce (train) a model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

What are attributes in data mining?

A

Independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What is the target variable in data mining?

A

The dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

What is the purpose of analyzing customer data in predictive analytics?

A

To induce patterns common among customers who have terminated or extended their contracts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

Define ‘pattern’ in the context of data mining.

A

A conclusion drawn from data that predicts an outcome based on certain conditions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

What does induction or inductive learning refer to?

A

A method or algorithm used to induce a pattern from a set of examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

What is linear regression in data mining?

A

An induction algorithm that predicts a dependent variable based on independent variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

What is a model in data mining?

A

A general pattern induced from data that describes the data in concise form.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

What is the objective of a predictive model?

A

To estimate or predict an unknown value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

What is supervised learning?

A

Model induction followed by inference using the model to predict.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Define unsupervised learning in data mining.

A

Clustering/segmentation that organizes instances into cohesive groups without predicting an unknown value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

What type of questions can data mining answer regarding customer behavior?

A
  • What products are commonly bought together? * What is a customer likely to buy next? * How likely is a customer to respond to a marketing campaign?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

What does classification refer to in data mining?

A

A predictive model where the target variable is discrete (categorical).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

What does a classification model provide as a by-product?

A

The probability that the case belongs to each category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

What is a classification tree?

A

A classification model that includes a set of IF/THEN rules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What is regression in data mining?

A

A predictive model that predicts the value of a numerical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What is clustering/segmentation analysis?

A

Unsupervised learning that identifies distinct groups of similar instances.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

What is the purpose of association rules in data mining?

A

To find relations among attributes in the data that frequently co-occur.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

What is sequence analysis in data mining?

A

Finding patterns in time-stamped data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Fill in the blank: A learner in data mining is also known as a _______.

A

[induction algorithm].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

True or False: Supervised learning is used to predict unknown values.

A

True.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

True or False: Unsupervised learning requires labeled data.

A

False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

What is a model?

A

A concise description of a pattern (relationship) that exists in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

What do classification models predict?

A

They predict (estimate) an unknown value of interest, which is a categorical variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Examples of classification tasks include:

A
  • Customer retention (CRM)
  • Marketing
  • Risk management
  • Financial trading
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

What is a classification tree?

A

A predictive model represented as a tree that is used for classification tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Why are classification trees popular?

A

They are easy to understand, computationally fast to induce from data, and are the basis of high-performing modeling techniques.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

What do non-terminal nodes in a classification tree represent?

A

Tests on an attribute.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

What do terminal nodes (leaves) in a classification tree provide?

A

A prediction and a distribution over the classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

In a classification tree, what is the outcome when a leaf node is reached?

A

A class prediction is made.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

How are rules extracted from a classification tree?

A

Each path from the root to a leaf node constitutes a rule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

What is the classification tree model used for in tax compliance?

A

To predict whether an incoming tax report is noncompliant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

What is the purpose of partitioning in classification tree induction?

A

To create subgroups that are purer with respect to the class than the original group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

What are good predictors in classification tree induction?

A

Attributes that help partition the examples into purer sub-groups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

What is Information Gain (IG)?

A

A measure that captures how informative an attribute is for distinguishing between instances of different classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

What does entropy measure in the context of classification trees?

A

The impurity in a dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

What is a classification tree induction algorithm?

A

An algorithm used to construct decision trees from datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
96
Q

Fill in the blank: A classification model predicts a categorical variable, known as a _______.

A

[class]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
97
Q

True or False: Classification trees are computationally slow to induce from data.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
98
Q

What is a subtree in a classification tree?

A

A branching from a node that captures predictive patterns for a sub-population.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
99
Q

What is the goal of partitioning customers in classification tree induction?

A

To achieve increasingly purer class distribution in subgroups.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
100
Q

What are some examples of popular tree induction algorithms?

A
  • ID3
  • C4.5
  • CART
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
101
Q

What is the first step in applying a classification tree to predict a class?

A

Start from the root of the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
102
Q

What is the significance of the average monthly pay and age in a classification tree?

A

They are attributes used to make decisions at each node.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
103
Q

What does a classification tree model predict regarding customer behavior?

A

Whether a customer will switch or stay.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
104
Q

What does Information Gain (IG) quantify?

A

The gain from splitting the population into groups based on purity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
105
Q

What does Entropy measure?

A

Impurity in a group of examples.

106
Q

What is the relationship between Entropy and predictive accuracy?

A

Higher entropy indicates higher uncertainty about class membership.

107
Q

How is Entropy calculated?

A

Entropy = -Σ(Pi * log2(Pi)) where Pi is the proportion of class i.

108
Q

What is the formula for Information Gain?

A

Information Gain = Impurity(parent) – Weighted Avg. Impurity(children).

109
Q

What is the goal of recursive partitioning in classification trees?

A

To improve predictive accuracy by creating purer subgroups.

110
Q

What are some stopping rules for tree partitioning?

A
  • Maximum purity reached
  • All attributes used along the path
  • No information gain from additional splits
111
Q

What is the objective of recursive partitioning?

A

To predict with high certainty.

112
Q

What is a potential risk of recursive partitioning?

A

Finding incidental patterns in small subgroups that do not generalize.

113
Q

What are key strengths of classification trees?

A
  • High variance inductive technique
  • Computationally cheap
  • Easy for stakeholders to understand
114
Q

What are the attributes considered in the basketball prediction example?

A
  • Game location (Home/Away)
  • Starting time
  • Player positions and roles
  • Opponent’s center height
115
Q

What is a regression tree?

A

A model built using recursive partitioning for predicting numerical variables.

116
Q

Fill in the blank: Entropy captures how _______ are the sub-groups compared to the original group.

117
Q

True or False: The order of attributes split on in classification trees does not matter.

118
Q

What does a classification tree model aim to achieve at prediction time?

A

Predict with high certainty.

119
Q

What is the objective of model evaluation?

A

To determine how good the model is in predictive performance

This includes understanding the model’s accuracy and appropriateness for various objectives.

120
Q

What does the classification accuracy rate measure?

A

Proportion of examples whose class is predicted accurately by the model

Calculated as S/N, where S is the number of examples accurately classified and N is the total number of examples.

121
Q

What is the consequence of measuring classification accuracy on training data?

A

It tends to result in an over-optimistic estimation of the model’s future performance

This is because the model is evaluated on the same data it was trained on.

122
Q

What should examples used to evaluate the model be?

A

Examples that were not used to induce the model and whose class is known

This ensures accurate assessment of the model’s predictive capabilities.

123
Q

What is the common practice for splitting data into training and test sets?

A

2/3 of examples for training and 1/3 for testing

This ensures a balance between training the model and evaluating its performance.

124
Q

What is test accuracy?

A

An estimation of how well a model induced from training data will predict the class of examples in the population

It is also known as generalization accuracy.

125
Q

What is N-fold cross-validation?

A

An experiment to approximate the generalization performance of a model by partitioning data into N equally-sized sets

It helps in evaluating the model’s performance by averaging results across multiple training and test sets.

126
Q

How does N-fold cross-validation work?

A
  1. Partition data into N folds
  2. Perform N experiments, each time holding out one fold as the test set
  3. Average the performance results of all experiments

This method provides a reliable estimate of the model’s performance.

127
Q

What are the advantages of N-fold cross-validation for small samples?

A

It allows for a training set size very similar to the original sample, leading to a model that is likely very similar to the one induced from the complete sample

This minimizes discrepancies in model performance between small and full datasets.

128
Q

What is a learning curve?

A

Characterizes how test accuracy improves as the training set size increases

Particularly relevant for methods like classification trees and neural networks.

129
Q

What is the implication of using a smaller training set?

A

It may lead to over pessimistic evaluation of the model’s performance

If learning has not plateaued, the model may not perform as well as it could with a larger training set.

130
Q

What happens when the test set is too small?

A

It may not be representative of the population

This can compromise the accuracy of the model’s evaluation.

131
Q

True or False: Overfitting cannot be detected if we evaluate the model using the training data.

A

True

Evaluating on training data only shows that the model improves as it expands, without revealing overfitting.

132
Q

What is the purpose of cross-validation?

A

To approximate how well a model will perform when applied to the population

This involves using multiple folds to ensure a robust evaluation.

133
Q

What is cross-validation?

A

A technique used to evaluate the performance of a model by partitioning data into training and test sets.

134
Q

How can overfitting be detected?

A

By evaluating model performance on a representative test sample.

135
Q

What is overfitting?

A

When a model performs well on training data but poorly on unseen data due to excessive complexity.

136
Q

Why is measuring prediction error on the training set insufficient?

A

It does not reveal whether the model has overfitted the training data.

137
Q

What happens to generalization performance as a model expands?

A

Generalization performance may decrease even if training performance increases.

138
Q

What is the purpose of a validation set?

A

To decide which sub-trees to prune after growing the tree using the training set.

139
Q

How is pruning performed on classification trees?

A

Bottom up; prune the corresponding subtree if its performance is not worse than that of the unpruned tree.

140
Q

What is underfitting?

A

When a model is too simple to capture the complex patterns in the data.

141
Q

What is precision in the context of model evaluation?

A

The proportion of true positive predictions among all positive predictions made by the model.

142
Q

What is recall (True Positive Rate)?

A

The proportion of actual positive cases that are correctly predicted by the model.

143
Q

What does a confusion matrix show?

A

The different types of errors that the model makes and their frequency.

144
Q

What is a benchmark for a model’s classification accuracy rate?

A

The majority base rate, which is the proportion of examples from the majority class.

145
Q

What are asymmetric error costs?

A

Costs that differ based on the type of error made by a classifier.

146
Q

How can cost-sensitive evaluation improve model assessment?

A

By considering the actual costs of different types of errors rather than treating all errors equally.

147
Q

What is Class Probability Estimation (CPE)?

A

The estimated probability that an example belongs to a certain class provided by classification models.

148
Q

What is the significance of ranking customers by predicted probability of response?

A

It helps in targeting the most likely responders in marketing campaigns.

149
Q

What is the main goal of using a classification model in direct marketing?

A

To decide which customers to target for a campaign based on historical data.

150
Q

What is the relationship between training accuracy and test accuracy?

A

Training accuracy may be high while test accuracy may be low if the model overfits.

151
Q

Fill in the blank: Overfitting is particularly common in _______.

A

classification tree models.

152
Q

True or False: A high classification accuracy always indicates a useful model.

153
Q

What is the purpose of using a model to rank customers for targeting?

A

To predict probability of response and rank customers by their likelihood to respond.

154
Q

What does the y-axis represent in a lift chart?

A

The number (or percent) of responses.

155
Q

What does the x-axis represent in a lift chart?

A

The number of solicitations (or percent of solicitations out of the total number of customers).

156
Q

True or False: Lift charts can help determine whether a predictive model is better at ranking customers than random ranking.

157
Q

What is represented by the straight line in a lift chart?

A

Random ranking of customers.

158
Q

Why are most lift charts concave?

A

As more customers are targeted, the incremental gain in responses tends to decrease.

159
Q

What does ‘lift’ refer to in the context of lift charts?

A

The improvement in response rates achieved by using the model compared to random selection.

160
Q

How can lift charts be evaluated?

A

By comparing the lift of different classifiers to determine which is better for ranking customers.

161
Q

What is a profit lift chart?

A

A chart that factors in targeting costs and revenue, plotting cumulative profit against the number of solicitations.

162
Q

What is the shape of a profit lift chart curve typically, and why?

A

It generally decreases because increased targeting can lead to diminishing returns.

163
Q

What does the area under the ROC curve (AUC) indicate?

A

It assesses the impact of changes made to the model on the lift chart and ROC.

164
Q

What is precision in the context of customer prediction models?

A

The proportion of predicted buyers that are actually buyers.

165
Q

What is recall in customer prediction models?

A

The proportion of actual buyers that are predicted as such by the model.

166
Q

Fill in the blank: A lift chart allows us to diagnose the effectiveness of a model at ranking customers by the likelihood they belong to an important class (e.g., _______ or switchers).

167
Q

What is the tradeoff between precision and recall when increasing the threshold for targeting customers?

A

Increasing the threshold generally increases precision but decreases recall.

168
Q

What is the classification accuracy rate?

A

The rate at which the model correctly predicts the class of customers.

169
Q

What is the importance of estimating performance on an out-of-sample set?

A

It provides an unbiased assessment of the model’s performance.

170
Q

What is the significance of the Precision/Recall Curve (PRC)?

A

It shows the tradeoff between precision and recall for different thresholds.

171
Q

What are the two possibilities for performance estimation?

A

Partitioning data into train/test sets or using cross-validation.

172
Q

What should be considered when measuring performance in relation to business objectives?

A

The alignment between business objectives and the metrics being measured.

173
Q

What is the recommendation for targeting customers with costly incentives?

A

Strategies with high precision are more desirable.

174
Q

What is the role of confusion matrix in performance measurement?

A

It helps calculate costs of errors when error costs are asymmetric and known.

175
Q

What is Machine Learning primarily used for?

A

Predictive techniques for business decisions.

176
Q

What impact has Machine Learning had on business over the last two decades?

A

It has significantly improved predictions of future behaviors, values, and trends.

177
Q

What types of data are commonly used in Machine Learning?

A
  • Consumer behavior data
  • Financial data
  • Employee data
  • Health care data
  • Oil & gas, energy data
178
Q

What are some examples of consumer behavior data?

A
  • GPS
  • Internet use (weblogs)
  • Social media postings
  • Online purchases
179
Q

What kinds of predictions can companies make using Machine Learning?

A
  • Likelihood of customer response to products
  • Loan default probabilities
  • Fraudulent credit transactions detection
  • Employee satisfaction and retention predictions
  • Health predictions (e.g., diabetes risk)
180
Q

What characterizes Machine Learning as a general-purpose technology?

A

It finds patterns in data and informs a wide variety of problems.

181
Q

How does Machine Learning differ from traditional statistical models?

A

Machine Learning can handle various data types and patterns beyond just numerical data.

182
Q

What is the goal of this course on Machine Learning?

A
  • Develop understanding of ML fundamentals
  • Identify opportunities for business value
  • Evaluate ML solutions rigorously
183
Q

What is WEKA?

A

An award-winning Java-based machine learning tool with a graphical user interface.

184
Q

What are the course requirements for this Machine Learning course?

A
  • Textbook and readings
  • Class notes
  • Individual/group assignments
  • In-class quizzes
  • Final Exam
185
Q

What is the purpose of predictive models in Machine Learning?

A

To find relationships in data and predict unknown or future values.

186
Q

Fill in the blank: A _______ predictive model uses conditions to predict customer behavior.

A

rule-based

187
Q

What are major application areas for predictive modeling?

A
  • Marketing
  • Finance and Risk Management
  • Healthcare
  • Fraud Detection
  • Cyber Security
188
Q

What role does predictive analytics play in data-driven healthcare?

A

It produces a list of possible causes based on patient information.

189
Q

True or False: Machine Learning is only applicable in the finance sector.

190
Q

What is a common use of predictive analytics in finance?

A

Credit risk scoring.

191
Q

What is the significance of the FICO Score?

A

It is a measure of credit risk.

192
Q

What has led to the explosion of machine learning applications in recent years?

A

The impact of machine learning applications on practice has increased significantly over the past 5 years.

193
Q

Why didn’t the significant impact of machine learning occur 20 years ago?

A

The specific reasons are not detailed, but advancements in technology and data availability are implied.

194
Q

What is fact-based decision-making?

A

Decisions made by analysis, often considered the best kind of decisions.

195
Q

Who emphasized the importance of fact-based decision-making?

A

Jeff Bezos

196
Q

What challenge did the telecom firm Telco face?

A

700K customers switched to competitors once their contracts expired.

197
Q

What can machine learning predictions inform in marketing campaigns?

A

They can inform and benefit the campaign strategies.

198
Q

What is the foundation of any machine learning project?

A

Careful and thoughtful problem formulation.

199
Q

Who should be included in problem formulation for machine learning projects?

A

Problem owners and domain experts.

200
Q

What are the two key functions of data preparation?

A
  • Identifying informative data
  • Data cleaning, correction, and representation
201
Q

What is an example of a predictor that may be useful for predicting churn?

A

Customer demographics, experience with the firm, recent life changes.

202
Q

What percentage of an overall machine learning project can data preparation consume?

A

Can be up to 80% of the overall project’s time.

203
Q

What is a critical question to evaluate a machine learning model?

A

How good is the model?

204
Q

What should be estimated before deploying a model?

A

The expected impact of the modeling solution on relevant business objectives.

205
Q

What is essential to consider when evaluating a model?

A

The context and relevant measures.

206
Q

Is machine learning a magic wand?

A

No, it offers a set of methodologies that must be used correctly.

207
Q

What can lead to poor predictions despite high accuracy in a model?

A

The implicit assumption that past patterns will be valid in the future.

208
Q

What is a potential issue with predictive models based on historical data?

A

They may not perform well if the economic conditions change.

209
Q

What must training data represent?

A

The data to which the model will be applied.

210
Q

What are some challenges associated with machine learning?

A
  • Ethical challenges
  • Privacy challenges
211
Q

Fill in the blank: Machine learning offers a set of _______.

A

[methodologies]

212
Q

True or False: Data preparation is not a resource-intensive process.

213
Q

What challenges do managers face regarding algorithms?

A

Managers ought to be diligent about the risks posed by algorithms

Algorithms can exhibit bias, which is a significant concern in predictive modeling.

214
Q

What types of data are relevant for predictive modeling?

A

Data from our social media interactions, emails, homes (like Nest), and GPS information

These data sources are crucial for creating effective predictive models.

215
Q

What should be assessed alongside the benefits of modeling?

A

How the modeling will be perceived and any potential resistance

Understanding perception and resistance is vital for successful implementation.

216
Q

What is integral to a business proposition involving predictive analytics?

A

Monetization of data

This strategy can help acquire significant data that is hard for competitors to replicate.

217
Q

What is the ‘data race’?

A

The competition among entities to acquire and utilize data effectively

This race is critical for businesses looking to leverage predictive analytics.

218
Q

What should managers consider about the risks of algorithms?

A

Managers should consider the potential for bias in algorithms

It is essential to mitigate these biases to ensure fair outcomes.

219
Q

Fill in the blank: The __________ of data is crucial for predictive analytics.

A

monetization

Monetization strategies can drive the acquisition of valuable data.

220
Q

What might go wrong with predictive modeling?

A

Algorithms can exhibit bias

Bias can lead to inaccurate predictions and reinforce existing inequalities.

221
Q

What devices/apps collect potentially valuable data?

A

Examples include smart home devices, social media platforms, and GPS applications

These tools can provide insights that enhance predictive modeling.

222
Q

What is the sampling method used in bagging?

A

Draw N samples with replacement from the original training data set.

223
Q

What does ‘with replacement’ mean in sampling?

A

Once an instance is drawn, it is placed back into the pool.

224
Q

What does ‘without replacement’ mean in sampling?

A

Once an instance is drawn, it is removed from the pool.

225
Q

Can an instance be drawn to the same sample more than once in bagging?

A

Yes, because sampling is done with replacement.

226
Q

What is bagged trees?

A

An ensemble method that builds multiple trees from different samples of the data.

227
Q

What is the process for making predictions with an ensemble of models?

A

Each tree generates a prediction, and the predictions are combined to produce a single prediction.

228
Q

How are predictions combined in an ensemble?

A

By majority vote.

229
Q

Why might bagging improve predictive accuracy?

A

It reduces the risk of overfitting by averaging multiple models.

230
Q

What is the effect of outliers on an ensemble’s prediction?

A

An ensemble’s prediction can be adversely affected by outliers.

231
Q

What is the probability of not selecting an outlier in a single draw?

232
Q

What is the probability of not drawing the outlier at all in 1000 draws?

A

(999/1000)^1000 = 0.367.

233
Q

What is the probability that a sample includes at least one copy of the outlier?

234
Q

What is the likelihood of the first 60 samples including the outlier?

A

The probability is (0.632)^60 * (0.367)^40.

235
Q

How many combinations exist for 60 samples including the outlier and 40 samples not including it?

A

100!/(60! * 40!) = 1.37E+28.

236
Q

What is the overall probability that the outlier is in 60 samples?

237
Q

What is a key benefit of bagging?

A

It reduces the risk of overfitting by filtering outliers.

238
Q

Is the Bagging Model more effective at improving accuracy with large or small data sets?

A

Bagging is more effective with larger data sets.

239
Q

What is the diminishing effect of outliers in bagging?

A

Bagging diminishes the adverse effects of outliers on the final model’s prediction.

240
Q

What is a key advantage of bagged classification trees?

A

Can capture complex patterns and predictions are less likely to be undermined by overfitting

Bagged trees improve stability and accuracy by combining multiple trees.

241
Q

What is a disadvantage of bagged classification trees?

A

Less simple model: a ‘black box’ that is not as comprehensible as a single tree model

This complexity can hinder interpretability.

242
Q

What are the implications of having a small number of examples and many attributes in labor data?

A

Increases the risk of overfitting

The relationship between the number of attributes and the risk of overfitting is critical in model training.

243
Q

What are the two necessary conditions for any modeling technique to overfit?

A
  • The presence of outliers
  • The availability of attributes that allow capturing these patterns

Outliers can distort the learning process, while too many attributes can lead to complex models that do not generalize well.

244
Q

How does Random Forest reduce the risk of overfitting?

A

It combines alleviating the effect of outliers and reducing the risk that certain features contribute to overfitting

Random Forest addresses both issues by using a subset of attributes for each tree.

245
Q

What is the key difference between Random Forest and bagging?

A

In Random Forest, only a subset of randomly selected attributes is considered at each split

This approach helps prevent the same attribute from being used to fit accidental patterns across multiple trees.

246
Q

What is the rationale behind randomly removing attributes in Random Forest?

A

It is less likely that the same attribute will be used to fit an accidental pattern in the data by most trees in the ensemble

This randomness can enhance the model’s robustness.

247
Q

In a Random Forest model, how are attributes selected?

A

4-6 attributes are randomly selected to be considered at each split

This selection process is crucial for the model’s performance.

248
Q

What should be considered when determining a good number of trees to use in an ensemble?

A

The trade-off between computational efficiency and model accuracy

More trees can lead to better performance but also increase computation time.

249
Q

True or False: Bagging or Random Forest can improve a classification technique that does not tend to fit the data too well.

A

True

These techniques are designed to enhance model performance and reduce overfitting.

250
Q

What is typically higher, a model’s training accuracy or its test accuracy?

A

A model’s training accuracy is typically higher than its test accuracy

This reflects that a model fits the training data better than unseen data.

251
Q

True or False: A model’s training accuracy is always the same as the model’s test accuracy.

A

False

Training and test accuracies usually differ.

252
Q

When comparing the performances of two classification models, what does higher training accuracy imply?

A

It does not necessarily imply better predictive performance

Higher training accuracy can indicate overfitting.

253
Q

What should be ensured about a model’s test accuracy rate when evaluating its predictive accuracy?

A

It should be higher than the rate of the majority class

This provides a useful benchmark for model performance.

254
Q

In a predictive model for customer classification, which is a recommended practice?

A

Select the model with the highest training and test accuracy

This ensures both generalization and fit to the training data.

255
Q

True or False: A model can be evaluated strictly by its performance on a training set.

A

False

Evaluation should focus on out-of-sample representative data.

256
Q

What does classification tree pruning aim to improve?

A

A classification tree’s out-of-sample predictive performance

This is achieved by removing sub-trees that overfit the training data.

257
Q

When comparing classification models for credit risk, what is a relevant measure?

A

Classification accuracy rate

This is pertinent if the costs of misclassifying good and bad risks are equivalent.

258
Q

What does it indicate if a model’s training accuracy is higher than its test accuracy?

A

Some overfitting has likely occurred

Overfitting captures patterns in the training data that do not generalize.

259
Q

Fill in the blank: Overfitting occurs when a model captures patterns that are _______.

A

idiosyncratic to the training data

This leads to improved training performance at the cost of test performance.

260
Q

Which statement about training and test accuracies is generally true?

A

Training accuracy is often higher than test accuracy

This is a common phenomenon in machine learning models.