Midterm Flashcards

(260 cards)

1
Q

What is the classification accuracy rate?

A

The number of correctly predicted instances out of all instances in your data.

Formula: S/n, where S is the number of accurately classified examples and n is the total number of examples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why can classification accuracy be misleading?

A

It may show high accuracy on training data, which does not reflect the model’s performance on unseen data.

High training accuracy may indicate overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What do we call the examples that were not used to induce the model?

A

Testing data.

Testing data is crucial for evaluating model performance on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the two main data partitions used in model training?

A
  • Training data
  • Testing data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is generalization accuracy?

A

An estimation of how well your model predicts the class of examples from a different data set.

Also known as test accuracy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the learning curve?

A

A graphical representation showing how model accuracy improves as the training set size increases.

X-axis: sample size of training data; Y-axis: accuracy of the model on testing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

True or False: More data generally improves model performance.

A

True.

More data allows the model to learn better and reduces the risk of overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What happens to model accuracy as training data increases?

A

Model accuracy generally increases until it plateaus.

This indicates diminishing returns on accuracy with additional data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is one drawback of splitting data into training and testing sets?

A

It limits the amount of data available for training and testing, which can affect model performance.

Insufficient data can lead to non-representative samples.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a common solution to avoid over-optimistic evaluation in model testing?

A

Use a sufficiently large dataset to ensure representativeness after splitting.

This helps maintain data integrity for both training and testing phases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the relationship between the size of the training data and the expected model performance?

A

Larger training data generally leads to better model performance.

More data helps the model generalize better to unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the drawback of partitioning data for training and testing?

A

Losing some data for the induction and testing process.

This can lead to a less reliable model if the dataset is small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why is more data desirable in model training?

A

To maintain reliability and avoid issues from limited data when making training and testing cuts.

A larger dataset helps in achieving better generalization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is cross validation?

A

A model evaluation technique used to approximate generalization accuracy without building a predictive model.

It involves partitioning data into subsets for training and testing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How does cross validation improve model evaluation?

A

By conducting multiple experiments, it reduces the chance of bias from a single training/testing split.

This is especially useful when working with limited data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are the steps in performing 10-fold cross validation?

A
  1. Partition data into 10 folds. 2. Hold one fold for testing. 3. Use the remaining folds for training. 4. Repeat for each fold.

Each portion of data serves as both training and testing at different times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the benefit of averaging the results in cross validation?

A

It mitigates the effects of outliers and provides a more reliable accuracy estimate.

Averaging across folds helps smooth out inaccuracies from any one fold.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is the potential disadvantage of increasing the number of folds in cross validation?

A

It can lead to very small testing sets, which may not be representative of the entire dataset.

This diminishes the effectiveness of the cross validation process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What happens in leave-one-out cross validation?

A

One record is held out as the test set while the rest are used for training.

This method can lead to very small testing sets, especially with limited data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What is a key consideration when using limited data in cross validation?

A

Each model induced will be similar, but care must be taken to ensure the test set is adequately sized.

Smaller datasets could lead to biased results if the test set is too small.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

True or False: Cross validation is used for building predictive models.

A

False.

Cross validation is primarily an evaluation technique.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Fill in the blank: Cross validation aims to approximate _______.

A

generalization accuracy.

This is crucial for assessing model performance on unseen data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the main purpose of cross validation in model evaluation?

A

To assess the performance of a model using different subsets of data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

True or False: Cross validation is an inducing technique for models.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
In the context of cross validation, what does partitioning a small set of data allow for?
It allows for experimentation without inducing a model
26
What happens to the model's performance when using cross validation?
It helps mitigate outliers by averaging results
27
Fill in the blank: In cross validation, you never use the same experiments for both ______ and ______.
training, testing
28
What is the significance of having a satisfactory cross validation accuracy?
It indicates the model is likely to perform well
29
What is an example of model parameters mentioned in the text?
Max depth of 5, min sample leaves of 50
30
What are the two main phases of model building as discussed?
Inducing a model and evaluating the model
31
How does cross validation improve the evaluation process when data is limited?
It allows for evaluation without a separate training and testing split
32
What happens to the model's structure when evaluated on testing data versus cross validation?
The model remains largely the same in both evaluations
33
What does each fold in cross validation consist of?
A separate training set and testing set
34
True or False: All data is used for both training and testing in each fold of cross validation.
False
35
What can be concluded about the evaluation techniques discussed?
They serve to assess model performance effectively
36
When is cross validation typically applied in the model building process?
During the evaluation phase of the model
37
What is the formula for calculating classification accuracy?
S/N: S is the number accurately classified by the model, and N is the total number of examples.
38
What is the difference between training accuracy and test accuracy?
Training accuracy is the model’s performance on training examples; test accuracy is the model’s performance on out-of-sample data.
39
What is the common practice for partitioning data for model training and testing?
It is common to use ⅔ of the data for training and ⅓ for testing.
40
What is a learning curve in predictive analytics?
It characterizes how test accuracy improves as the training set size increases.
41
What is Cross Validation (CV)?
CV is an experiment that provides a good approximation of generalization performance for a model.
42
What are the steps involved in N-Fold Cross Validation?
* Randomly partition data into N equally sized sets (folds) * Perform N experiments of model building and evaluation * Hold out one fold as the test set in each experiment * Induce a model from the remaining folds * Evaluate performance on the test set * Average the performance of the N experiments
43
What is overfitting in predictive modeling?
Overfitting occurs when a model captures not only regularities in the data but also peculiarities, undermining its predictive performance.
44
What is the purpose of a validation set?
A validation set is used to decide which subtrees to prune in a model.
45
What happens when training error decreases while validation error increases?
It indicates that the model is likely overfitting the training data.
46
Define precision in the context of classification models.
Precision is the ratio of true positives to the total predicted positives: True Positives/(True Positives + False Positives).
47
Define recall in the context of classification models.
Recall is the ratio of true positives to the total actual positives: True Positives/(True Positives + False Negatives).
48
What is the trade-off between precision and recall?
As precision increases, recall tends to decrease.
49
What is a Lift Chart used for?
A Lift Chart is used to determine if a model is better at ranking customers than random ranking.
50
What does the Receiver Operating Characteristic (ROC) curve illustrate?
The ROC curve illustrates the performance of a binary classifier as its discrimination threshold varies.
51
What does the area under the ROC curve (AUC) indicate?
AUC summarizes the overall performance of a model; a value of 1.0 indicates perfect performance, while 0.5 indicates random guessing.
52
What is the role of Class Probability Estimation (CPE)?
CPE shows the probability that a given example will belong to a certain class.
53
Fill in the blank: The training set is used to grow a tree to its _______.
[max size]
54
True or False: The validation set is the same as the test set.
False
55
What is the benchmark for classification accuracy?
The base rate, which classifies all examples to the majority class.
56
What is the importance of evaluating model performance on test samples?
To detect overfitting and ensure the model generalizes well to unseen data.
57
What is an example or instance in the context of data mining?
A fact that typically includes a set of attributes and an output variable.
58
What is a data set?
A set of examples.
59
What is training data?
Data used to induce (train) a model.
60
What are attributes in data mining?
Independent variables.
61
What is the target variable in data mining?
The dependent variable.
62
What is the purpose of analyzing customer data in predictive analytics?
To induce patterns common among customers who have terminated or extended their contracts.
63
Define 'pattern' in the context of data mining.
A conclusion drawn from data that predicts an outcome based on certain conditions.
64
What does induction or inductive learning refer to?
A method or algorithm used to induce a pattern from a set of examples.
65
What is linear regression in data mining?
An induction algorithm that predicts a dependent variable based on independent variables.
66
What is a model in data mining?
A general pattern induced from data that describes the data in concise form.
67
What is the objective of a predictive model?
To estimate or predict an unknown value.
68
What is supervised learning?
Model induction followed by inference using the model to predict.
69
Define unsupervised learning in data mining.
Clustering/segmentation that organizes instances into cohesive groups without predicting an unknown value.
70
What type of questions can data mining answer regarding customer behavior?
* What products are commonly bought together? * What is a customer likely to buy next? * How likely is a customer to respond to a marketing campaign?
71
What does classification refer to in data mining?
A predictive model where the target variable is discrete (categorical).
72
What does a classification model provide as a by-product?
The probability that the case belongs to each category.
73
What is a classification tree?
A classification model that includes a set of IF/THEN rules.
74
What is regression in data mining?
A predictive model that predicts the value of a numerical variable.
75
What is clustering/segmentation analysis?
Unsupervised learning that identifies distinct groups of similar instances.
76
What is the purpose of association rules in data mining?
To find relations among attributes in the data that frequently co-occur.
77
What is sequence analysis in data mining?
Finding patterns in time-stamped data.
78
Fill in the blank: A learner in data mining is also known as a _______.
[induction algorithm].
79
True or False: Supervised learning is used to predict unknown values.
True.
80
True or False: Unsupervised learning requires labeled data.
False.
81
What is a model?
A concise description of a pattern (relationship) that exists in the data.
82
What do classification models predict?
They predict (estimate) an unknown value of interest, which is a categorical variable.
83
Examples of classification tasks include:
* Customer retention (CRM) * Marketing * Risk management * Financial trading
84
What is a classification tree?
A predictive model represented as a tree that is used for classification tasks.
85
Why are classification trees popular?
They are easy to understand, computationally fast to induce from data, and are the basis of high-performing modeling techniques.
86
What do non-terminal nodes in a classification tree represent?
Tests on an attribute.
87
What do terminal nodes (leaves) in a classification tree provide?
A prediction and a distribution over the classes.
88
In a classification tree, what is the outcome when a leaf node is reached?
A class prediction is made.
89
How are rules extracted from a classification tree?
Each path from the root to a leaf node constitutes a rule.
90
What is the classification tree model used for in tax compliance?
To predict whether an incoming tax report is noncompliant.
91
What is the purpose of partitioning in classification tree induction?
To create subgroups that are purer with respect to the class than the original group.
92
What are good predictors in classification tree induction?
Attributes that help partition the examples into purer sub-groups.
93
What is Information Gain (IG)?
A measure that captures how informative an attribute is for distinguishing between instances of different classes.
94
What does entropy measure in the context of classification trees?
The impurity in a dataset.
95
What is a classification tree induction algorithm?
An algorithm used to construct decision trees from datasets.
96
Fill in the blank: A classification model predicts a categorical variable, known as a _______.
[class]
97
True or False: Classification trees are computationally slow to induce from data.
False
98
What is a subtree in a classification tree?
A branching from a node that captures predictive patterns for a sub-population.
99
What is the goal of partitioning customers in classification tree induction?
To achieve increasingly purer class distribution in subgroups.
100
What are some examples of popular tree induction algorithms?
* ID3 * C4.5 * CART
101
What is the first step in applying a classification tree to predict a class?
Start from the root of the tree.
102
What is the significance of the average monthly pay and age in a classification tree?
They are attributes used to make decisions at each node.
103
What does a classification tree model predict regarding customer behavior?
Whether a customer will switch or stay.
104
What does Information Gain (IG) quantify?
The gain from splitting the population into groups based on purity.
105
What does Entropy measure?
Impurity in a group of examples.
106
What is the relationship between Entropy and predictive accuracy?
Higher entropy indicates higher uncertainty about class membership.
107
How is Entropy calculated?
Entropy = -Σ(Pi * log2(Pi)) where Pi is the proportion of class i.
108
What is the formula for Information Gain?
Information Gain = Impurity(parent) – Weighted Avg. Impurity(children).
109
What is the goal of recursive partitioning in classification trees?
To improve predictive accuracy by creating purer subgroups.
110
What are some stopping rules for tree partitioning?
* Maximum purity reached * All attributes used along the path * No information gain from additional splits
111
What is the objective of recursive partitioning?
To predict with high certainty.
112
What is a potential risk of recursive partitioning?
Finding incidental patterns in small subgroups that do not generalize.
113
What are key strengths of classification trees?
* High variance inductive technique * Computationally cheap * Easy for stakeholders to understand
114
What are the attributes considered in the basketball prediction example?
* Game location (Home/Away) * Starting time * Player positions and roles * Opponent's center height
115
What is a regression tree?
A model built using recursive partitioning for predicting numerical variables.
116
Fill in the blank: Entropy captures how _______ are the sub-groups compared to the original group.
[purer]
117
True or False: The order of attributes split on in classification trees does not matter.
False
118
What does a classification tree model aim to achieve at prediction time?
Predict with high certainty.
119
What is the objective of model evaluation?
To determine how good the model is in predictive performance ## Footnote This includes understanding the model's accuracy and appropriateness for various objectives.
120
What does the classification accuracy rate measure?
Proportion of examples whose class is predicted accurately by the model ## Footnote Calculated as S/N, where S is the number of examples accurately classified and N is the total number of examples.
121
What is the consequence of measuring classification accuracy on training data?
It tends to result in an over-optimistic estimation of the model’s future performance ## Footnote This is because the model is evaluated on the same data it was trained on.
122
What should examples used to evaluate the model be?
Examples that were not used to induce the model and whose class is known ## Footnote This ensures accurate assessment of the model's predictive capabilities.
123
What is the common practice for splitting data into training and test sets?
2/3 of examples for training and 1/3 for testing ## Footnote This ensures a balance between training the model and evaluating its performance.
124
What is test accuracy?
An estimation of how well a model induced from training data will predict the class of examples in the population ## Footnote It is also known as generalization accuracy.
125
What is N-fold cross-validation?
An experiment to approximate the generalization performance of a model by partitioning data into N equally-sized sets ## Footnote It helps in evaluating the model's performance by averaging results across multiple training and test sets.
126
How does N-fold cross-validation work?
1. Partition data into N folds 2. Perform N experiments, each time holding out one fold as the test set 3. Average the performance results of all experiments ## Footnote This method provides a reliable estimate of the model's performance.
127
What are the advantages of N-fold cross-validation for small samples?
It allows for a training set size very similar to the original sample, leading to a model that is likely very similar to the one induced from the complete sample ## Footnote This minimizes discrepancies in model performance between small and full datasets.
128
What is a learning curve?
Characterizes how test accuracy improves as the training set size increases ## Footnote Particularly relevant for methods like classification trees and neural networks.
129
What is the implication of using a smaller training set?
It may lead to over pessimistic evaluation of the model's performance ## Footnote If learning has not plateaued, the model may not perform as well as it could with a larger training set.
130
What happens when the test set is too small?
It may not be representative of the population ## Footnote This can compromise the accuracy of the model's evaluation.
131
True or False: Overfitting cannot be detected if we evaluate the model using the training data.
True ## Footnote Evaluating on training data only shows that the model improves as it expands, without revealing overfitting.
132
What is the purpose of cross-validation?
To approximate how well a model will perform when applied to the population ## Footnote This involves using multiple folds to ensure a robust evaluation.
133
What is cross-validation?
A technique used to evaluate the performance of a model by partitioning data into training and test sets.
134
How can overfitting be detected?
By evaluating model performance on a representative test sample.
135
What is overfitting?
When a model performs well on training data but poorly on unseen data due to excessive complexity.
136
Why is measuring prediction error on the training set insufficient?
It does not reveal whether the model has overfitted the training data.
137
What happens to generalization performance as a model expands?
Generalization performance may decrease even if training performance increases.
138
What is the purpose of a validation set?
To decide which sub-trees to prune after growing the tree using the training set.
139
How is pruning performed on classification trees?
Bottom up; prune the corresponding subtree if its performance is not worse than that of the unpruned tree.
140
What is underfitting?
When a model is too simple to capture the complex patterns in the data.
141
What is precision in the context of model evaluation?
The proportion of true positive predictions among all positive predictions made by the model.
142
What is recall (True Positive Rate)?
The proportion of actual positive cases that are correctly predicted by the model.
143
What does a confusion matrix show?
The different types of errors that the model makes and their frequency.
144
What is a benchmark for a model's classification accuracy rate?
The majority base rate, which is the proportion of examples from the majority class.
145
What are asymmetric error costs?
Costs that differ based on the type of error made by a classifier.
146
How can cost-sensitive evaluation improve model assessment?
By considering the actual costs of different types of errors rather than treating all errors equally.
147
What is Class Probability Estimation (CPE)?
The estimated probability that an example belongs to a certain class provided by classification models.
148
What is the significance of ranking customers by predicted probability of response?
It helps in targeting the most likely responders in marketing campaigns.
149
What is the main goal of using a classification model in direct marketing?
To decide which customers to target for a campaign based on historical data.
150
What is the relationship between training accuracy and test accuracy?
Training accuracy may be high while test accuracy may be low if the model overfits.
151
Fill in the blank: Overfitting is particularly common in _______.
classification tree models.
152
True or False: A high classification accuracy always indicates a useful model.
False.
153
What is the purpose of using a model to rank customers for targeting?
To predict probability of response and rank customers by their likelihood to respond.
154
What does the y-axis represent in a lift chart?
The number (or percent) of responses.
155
What does the x-axis represent in a lift chart?
The number of solicitations (or percent of solicitations out of the total number of customers).
156
True or False: Lift charts can help determine whether a predictive model is better at ranking customers than random ranking.
True.
157
What is represented by the straight line in a lift chart?
Random ranking of customers.
158
Why are most lift charts concave?
As more customers are targeted, the incremental gain in responses tends to decrease.
159
What does 'lift' refer to in the context of lift charts?
The improvement in response rates achieved by using the model compared to random selection.
160
How can lift charts be evaluated?
By comparing the lift of different classifiers to determine which is better for ranking customers.
161
What is a profit lift chart?
A chart that factors in targeting costs and revenue, plotting cumulative profit against the number of solicitations.
162
What is the shape of a profit lift chart curve typically, and why?
It generally decreases because increased targeting can lead to diminishing returns.
163
What does the area under the ROC curve (AUC) indicate?
It assesses the impact of changes made to the model on the lift chart and ROC.
164
What is precision in the context of customer prediction models?
The proportion of predicted buyers that are actually buyers.
165
What is recall in customer prediction models?
The proportion of actual buyers that are predicted as such by the model.
166
Fill in the blank: A lift chart allows us to diagnose the effectiveness of a model at ranking customers by the likelihood they belong to an important class (e.g., _______ or switchers).
buyers
167
What is the tradeoff between precision and recall when increasing the threshold for targeting customers?
Increasing the threshold generally increases precision but decreases recall.
168
What is the classification accuracy rate?
The rate at which the model correctly predicts the class of customers.
169
What is the importance of estimating performance on an out-of-sample set?
It provides an unbiased assessment of the model's performance.
170
What is the significance of the Precision/Recall Curve (PRC)?
It shows the tradeoff between precision and recall for different thresholds.
171
What are the two possibilities for performance estimation?
Partitioning data into train/test sets or using cross-validation.
172
What should be considered when measuring performance in relation to business objectives?
The alignment between business objectives and the metrics being measured.
173
What is the recommendation for targeting customers with costly incentives?
Strategies with high precision are more desirable.
174
What is the role of confusion matrix in performance measurement?
It helps calculate costs of errors when error costs are asymmetric and known.
175
What is Machine Learning primarily used for?
Predictive techniques for business decisions.
176
What impact has Machine Learning had on business over the last two decades?
It has significantly improved predictions of future behaviors, values, and trends.
177
What types of data are commonly used in Machine Learning?
* Consumer behavior data * Financial data * Employee data * Health care data * Oil & gas, energy data
178
What are some examples of consumer behavior data?
* GPS * Internet use (weblogs) * Social media postings * Online purchases
179
What kinds of predictions can companies make using Machine Learning?
* Likelihood of customer response to products * Loan default probabilities * Fraudulent credit transactions detection * Employee satisfaction and retention predictions * Health predictions (e.g., diabetes risk)
180
What characterizes Machine Learning as a general-purpose technology?
It finds patterns in data and informs a wide variety of problems.
181
How does Machine Learning differ from traditional statistical models?
Machine Learning can handle various data types and patterns beyond just numerical data.
182
What is the goal of this course on Machine Learning?
* Develop understanding of ML fundamentals * Identify opportunities for business value * Evaluate ML solutions rigorously
183
What is WEKA?
An award-winning Java-based machine learning tool with a graphical user interface.
184
What are the course requirements for this Machine Learning course?
* Textbook and readings * Class notes * Individual/group assignments * In-class quizzes * Final Exam
185
What is the purpose of predictive models in Machine Learning?
To find relationships in data and predict unknown or future values.
186
Fill in the blank: A _______ predictive model uses conditions to predict customer behavior.
rule-based
187
What are major application areas for predictive modeling?
* Marketing * Finance and Risk Management * Healthcare * Fraud Detection * Cyber Security
188
What role does predictive analytics play in data-driven healthcare?
It produces a list of possible causes based on patient information.
189
True or False: Machine Learning is only applicable in the finance sector.
False
190
What is a common use of predictive analytics in finance?
Credit risk scoring.
191
What is the significance of the FICO Score?
It is a measure of credit risk.
192
What has led to the explosion of machine learning applications in recent years?
The impact of machine learning applications on practice has increased significantly over the past 5 years.
193
Why didn't the significant impact of machine learning occur 20 years ago?
The specific reasons are not detailed, but advancements in technology and data availability are implied.
194
What is fact-based decision-making?
Decisions made by analysis, often considered the best kind of decisions.
195
Who emphasized the importance of fact-based decision-making?
Jeff Bezos
196
What challenge did the telecom firm Telco face?
700K customers switched to competitors once their contracts expired.
197
What can machine learning predictions inform in marketing campaigns?
They can inform and benefit the campaign strategies.
198
What is the foundation of any machine learning project?
Careful and thoughtful problem formulation.
199
Who should be included in problem formulation for machine learning projects?
Problem owners and domain experts.
200
What are the two key functions of data preparation?
* Identifying informative data * Data cleaning, correction, and representation
201
What is an example of a predictor that may be useful for predicting churn?
Customer demographics, experience with the firm, recent life changes.
202
What percentage of an overall machine learning project can data preparation consume?
Can be up to 80% of the overall project's time.
203
What is a critical question to evaluate a machine learning model?
How good is the model?
204
What should be estimated before deploying a model?
The expected impact of the modeling solution on relevant business objectives.
205
What is essential to consider when evaluating a model?
The context and relevant measures.
206
Is machine learning a magic wand?
No, it offers a set of methodologies that must be used correctly.
207
What can lead to poor predictions despite high accuracy in a model?
The implicit assumption that past patterns will be valid in the future.
208
What is a potential issue with predictive models based on historical data?
They may not perform well if the economic conditions change.
209
What must training data represent?
The data to which the model will be applied.
210
What are some challenges associated with machine learning?
* Ethical challenges * Privacy challenges
211
Fill in the blank: Machine learning offers a set of _______.
[methodologies]
212
True or False: Data preparation is not a resource-intensive process.
False
213
What challenges do managers face regarding algorithms?
Managers ought to be diligent about the risks posed by algorithms ## Footnote Algorithms can exhibit bias, which is a significant concern in predictive modeling.
214
What types of data are relevant for predictive modeling?
Data from our social media interactions, emails, homes (like Nest), and GPS information ## Footnote These data sources are crucial for creating effective predictive models.
215
What should be assessed alongside the benefits of modeling?
How the modeling will be perceived and any potential resistance ## Footnote Understanding perception and resistance is vital for successful implementation.
216
What is integral to a business proposition involving predictive analytics?
Monetization of data ## Footnote This strategy can help acquire significant data that is hard for competitors to replicate.
217
What is the 'data race'?
The competition among entities to acquire and utilize data effectively ## Footnote This race is critical for businesses looking to leverage predictive analytics.
218
What should managers consider about the risks of algorithms?
Managers should consider the potential for bias in algorithms ## Footnote It is essential to mitigate these biases to ensure fair outcomes.
219
Fill in the blank: The __________ of data is crucial for predictive analytics.
monetization ## Footnote Monetization strategies can drive the acquisition of valuable data.
220
What might go wrong with predictive modeling?
Algorithms can exhibit bias ## Footnote Bias can lead to inaccurate predictions and reinforce existing inequalities.
221
What devices/apps collect potentially valuable data?
Examples include smart home devices, social media platforms, and GPS applications ## Footnote These tools can provide insights that enhance predictive modeling.
222
What is the sampling method used in bagging?
Draw N samples with replacement from the original training data set.
223
What does 'with replacement' mean in sampling?
Once an instance is drawn, it is placed back into the pool.
224
What does 'without replacement' mean in sampling?
Once an instance is drawn, it is removed from the pool.
225
Can an instance be drawn to the same sample more than once in bagging?
Yes, because sampling is done with replacement.
226
What is bagged trees?
An ensemble method that builds multiple trees from different samples of the data.
227
What is the process for making predictions with an ensemble of models?
Each tree generates a prediction, and the predictions are combined to produce a single prediction.
228
How are predictions combined in an ensemble?
By majority vote.
229
Why might bagging improve predictive accuracy?
It reduces the risk of overfitting by averaging multiple models.
230
What is the effect of outliers on an ensemble's prediction?
An ensemble's prediction can be adversely affected by outliers.
231
What is the probability of not selecting an outlier in a single draw?
999/1000.
232
What is the probability of not drawing the outlier at all in 1000 draws?
(999/1000)^1000 = 0.367.
233
What is the probability that a sample includes at least one copy of the outlier?
0.632.
234
What is the likelihood of the first 60 samples including the outlier?
The probability is (0.632)^60 * (0.367)^40.
235
How many combinations exist for 60 samples including the outlier and 40 samples not including it?
100!/(60! * 40!) = 1.37E+28.
236
What is the overall probability that the outlier is in 60 samples?
0.067.
237
What is a key benefit of bagging?
It reduces the risk of overfitting by filtering outliers.
238
Is the Bagging Model more effective at improving accuracy with large or small data sets?
Bagging is more effective with larger data sets.
239
What is the diminishing effect of outliers in bagging?
Bagging diminishes the adverse effects of outliers on the final model's prediction.
240
What is a key advantage of bagged classification trees?
Can capture complex patterns and predictions are less likely to be undermined by overfitting ## Footnote Bagged trees improve stability and accuracy by combining multiple trees.
241
What is a disadvantage of bagged classification trees?
Less simple model: a 'black box' that is not as comprehensible as a single tree model ## Footnote This complexity can hinder interpretability.
242
What are the implications of having a small number of examples and many attributes in labor data?
Increases the risk of overfitting ## Footnote The relationship between the number of attributes and the risk of overfitting is critical in model training.
243
What are the two necessary conditions for any modeling technique to overfit?
* The presence of outliers * The availability of attributes that allow capturing these patterns ## Footnote Outliers can distort the learning process, while too many attributes can lead to complex models that do not generalize well.
244
How does Random Forest reduce the risk of overfitting?
It combines alleviating the effect of outliers and reducing the risk that certain features contribute to overfitting ## Footnote Random Forest addresses both issues by using a subset of attributes for each tree.
245
What is the key difference between Random Forest and bagging?
In Random Forest, only a subset of randomly selected attributes is considered at each split ## Footnote This approach helps prevent the same attribute from being used to fit accidental patterns across multiple trees.
246
What is the rationale behind randomly removing attributes in Random Forest?
It is less likely that the same attribute will be used to fit an accidental pattern in the data by most trees in the ensemble ## Footnote This randomness can enhance the model's robustness.
247
In a Random Forest model, how are attributes selected?
4-6 attributes are randomly selected to be considered at each split ## Footnote This selection process is crucial for the model's performance.
248
What should be considered when determining a good number of trees to use in an ensemble?
The trade-off between computational efficiency and model accuracy ## Footnote More trees can lead to better performance but also increase computation time.
249
True or False: Bagging or Random Forest can improve a classification technique that does not tend to fit the data too well.
True ## Footnote These techniques are designed to enhance model performance and reduce overfitting.
250
What is typically higher, a model’s training accuracy or its test accuracy?
A model’s training accuracy is typically higher than its test accuracy ## Footnote This reflects that a model fits the training data better than unseen data.
251
True or False: A model’s training accuracy is always the same as the model’s test accuracy.
False ## Footnote Training and test accuracies usually differ.
252
When comparing the performances of two classification models, what does higher training accuracy imply?
It does not necessarily imply better predictive performance ## Footnote Higher training accuracy can indicate overfitting.
253
What should be ensured about a model’s test accuracy rate when evaluating its predictive accuracy?
It should be higher than the rate of the majority class ## Footnote This provides a useful benchmark for model performance.
254
In a predictive model for customer classification, which is a recommended practice?
Select the model with the highest training and test accuracy ## Footnote This ensures both generalization and fit to the training data.
255
True or False: A model can be evaluated strictly by its performance on a training set.
False ## Footnote Evaluation should focus on out-of-sample representative data.
256
What does classification tree pruning aim to improve?
A classification tree’s out-of-sample predictive performance ## Footnote This is achieved by removing sub-trees that overfit the training data.
257
When comparing classification models for credit risk, what is a relevant measure?
Classification accuracy rate ## Footnote This is pertinent if the costs of misclassifying good and bad risks are equivalent.
258
What does it indicate if a model’s training accuracy is higher than its test accuracy?
Some overfitting has likely occurred ## Footnote Overfitting captures patterns in the training data that do not generalize.
259
Fill in the blank: Overfitting occurs when a model captures patterns that are _______.
idiosyncratic to the training data ## Footnote This leads to improved training performance at the cost of test performance.
260
Which statement about training and test accuracies is generally true?
Training accuracy is often higher than test accuracy ## Footnote This is a common phenomenon in machine learning models.