Midterm Flashcards

Question

In the context of cross validation, what does partitioning a small set of data allow for?

Answer 1

It allows for experimentation without inducing a model

Answer 2

It helps mitigate outliers by averaging results

Answer 3

training, testing

Answer 4

It indicates the model is likely to perform well

Answer 5

Max depth of 5, min sample leaves of 50

Answer 6

Inducing a model and evaluating the model

Answer 7

It allows for evaluation without a separate training and testing split

Answer 8

The model remains largely the same in both evaluations

Answer 9

A separate training set and testing set

Answer 10

They serve to assess model performance effectively

Answer 11

During the evaluation phase of the model

Answer 12

S/N: S is the number accurately classified by the model, and N is the total number of examples.

Answer 13

Training accuracy is the model’s performance on training examples; test accuracy is the model’s performance on out-of-sample data.

Answer 14

It is common to use ⅔ of the data for training and ⅓ for testing.

Answer 15

It characterizes how test accuracy improves as the training set size increases.

Answer 16

CV is an experiment that provides a good approximation of generalization performance for a model.

Answer 17

* Randomly partition data into N equally sized sets (folds) * Perform N experiments of model building and evaluation * Hold out one fold as the test set in each experiment * Induce a model from the remaining folds * Evaluate performance on the test set * Average the performance of the N experiments

Answer 18

Overfitting occurs when a model captures not only regularities in the data but also peculiarities, undermining its predictive performance.

Answer 19

A validation set is used to decide which subtrees to prune in a model.

Answer 20

It indicates that the model is likely overfitting the training data.

Answer 21

Precision is the ratio of true positives to the total predicted positives: True Positives/(True Positives + False Positives).

Answer 22

Recall is the ratio of true positives to the total actual positives: True Positives/(True Positives + False Negatives).

Answer 23

As precision increases, recall tends to decrease.

Answer 24

A Lift Chart is used to determine if a model is better at ranking customers than random ranking.

Answer 25

The ROC curve illustrates the performance of a binary classifier as its discrimination threshold varies.

Answer 26

AUC summarizes the overall performance of a model; a value of 1.0 indicates perfect performance, while 0.5 indicates random guessing.

Answer 27

CPE shows the probability that a given example will belong to a certain class.

Answer 28

[max size]

Answer 29

The base rate, which classifies all examples to the majority class.

Answer 30

To detect overfitting and ensure the model generalizes well to unseen data.

Answer 31

A fact that typically includes a set of attributes and an output variable.

Answer 32

A set of examples.

Answer 33

Data used to induce (train) a model.

Answer 34

Independent variables.

Answer 35

The dependent variable.

Answer 36

To induce patterns common among customers who have terminated or extended their contracts.

Answer 37

A conclusion drawn from data that predicts an outcome based on certain conditions.

Answer 38

A method or algorithm used to induce a pattern from a set of examples.

Answer 39

An induction algorithm that predicts a dependent variable based on independent variables.

Answer 40

A general pattern induced from data that describes the data in concise form.

Answer 41

To estimate or predict an unknown value.

Answer 42

Model induction followed by inference using the model to predict.

Answer 43

Clustering/segmentation that organizes instances into cohesive groups without predicting an unknown value.

Answer 44

* What products are commonly bought together? * What is a customer likely to buy next? * How likely is a customer to respond to a marketing campaign?

Answer 45

A predictive model where the target variable is discrete (categorical).

Answer 46

The probability that the case belongs to each category.

Answer 47

A classification model that includes a set of IF/THEN rules.

Answer 48

A predictive model that predicts the value of a numerical variable.

Answer 49

Unsupervised learning that identifies distinct groups of similar instances.

Answer 50

To find relations among attributes in the data that frequently co-occur.

Answer 51

Finding patterns in time-stamped data.

Answer 52

[induction algorithm].

Answer 53

A concise description of a pattern (relationship) that exists in the data.

Answer 54

They predict (estimate) an unknown value of interest, which is a categorical variable.

Answer 55

* Customer retention (CRM) * Marketing * Risk management * Financial trading

Answer 56

A predictive model represented as a tree that is used for classification tasks.

Answer 57

They are easy to understand, computationally fast to induce from data, and are the basis of high-performing modeling techniques.

Answer 58

Tests on an attribute.

Answer 59

A prediction and a distribution over the classes.

Answer 60

A class prediction is made.

Answer 61

Each path from the root to a leaf node constitutes a rule.

Answer 62

To predict whether an incoming tax report is noncompliant.

Answer 63

To create subgroups that are purer with respect to the class than the original group.

Answer 64

Attributes that help partition the examples into purer sub-groups.

Answer 65

A measure that captures how informative an attribute is for distinguishing between instances of different classes.

Answer 66

The impurity in a dataset.

Answer 67

An algorithm used to construct decision trees from datasets.

Answer 68

A branching from a node that captures predictive patterns for a sub-population.

Answer 69

To achieve increasingly purer class distribution in subgroups.

Answer 70

* ID3 * C4.5 * CART

Answer 71

Start from the root of the tree.

Answer 72

They are attributes used to make decisions at each node.

Answer 73

Whether a customer will switch or stay.

Answer 74

The gain from splitting the population into groups based on purity.

Answer 75

Impurity in a group of examples.

Answer 76

Higher entropy indicates higher uncertainty about class membership.

Answer 77

Entropy = -Σ(Pi * log2(Pi)) where Pi is the proportion of class i.

Answer 78

Information Gain = Impurity(parent) – Weighted Avg. Impurity(children).

Answer 79

To improve predictive accuracy by creating purer subgroups.

Answer 80

* Maximum purity reached * All attributes used along the path * No information gain from additional splits

Answer 81

To predict with high certainty.

Answer 82

Finding incidental patterns in small subgroups that do not generalize.

Answer 83

* High variance inductive technique * Computationally cheap * Easy for stakeholders to understand

Answer 84

* Game location (Home/Away) * Starting time * Player positions and roles * Opponent's center height

Answer 85

A model built using recursive partitioning for predicting numerical variables.

Answer 86

Predict with high certainty.

Answer 87

To determine how good the model is in predictive performance ## Footnote This includes understanding the model's accuracy and appropriateness for various objectives.

Answer 88

Proportion of examples whose class is predicted accurately by the model ## Footnote Calculated as S/N, where S is the number of examples accurately classified and N is the total number of examples.

Answer 89

It tends to result in an over-optimistic estimation of the model’s future performance ## Footnote This is because the model is evaluated on the same data it was trained on.

Answer 90

Examples that were not used to induce the model and whose class is known ## Footnote This ensures accurate assessment of the model's predictive capabilities.

Answer 91

2/3 of examples for training and 1/3 for testing ## Footnote This ensures a balance between training the model and evaluating its performance.

Answer 92

An estimation of how well a model induced from training data will predict the class of examples in the population ## Footnote It is also known as generalization accuracy.

Answer 93

An experiment to approximate the generalization performance of a model by partitioning data into N equally-sized sets ## Footnote It helps in evaluating the model's performance by averaging results across multiple training and test sets.

Answer 94

1. Partition data into N folds 2. Perform N experiments, each time holding out one fold as the test set 3. Average the performance results of all experiments ## Footnote This method provides a reliable estimate of the model's performance.

Answer 95

It allows for a training set size very similar to the original sample, leading to a model that is likely very similar to the one induced from the complete sample ## Footnote This minimizes discrepancies in model performance between small and full datasets.

Answer 96

Characterizes how test accuracy improves as the training set size increases ## Footnote Particularly relevant for methods like classification trees and neural networks.

Answer 97

It may lead to over pessimistic evaluation of the model's performance ## Footnote If learning has not plateaued, the model may not perform as well as it could with a larger training set.

Answer 98

It may not be representative of the population ## Footnote This can compromise the accuracy of the model's evaluation.

Answer 99

True ## Footnote Evaluating on training data only shows that the model improves as it expands, without revealing overfitting.

Answer 100

To approximate how well a model will perform when applied to the population ## Footnote This involves using multiple folds to ensure a robust evaluation.

Answer 101

A technique used to evaluate the performance of a model by partitioning data into training and test sets.

Answer 102

By evaluating model performance on a representative test sample.

Answer 103

When a model performs well on training data but poorly on unseen data due to excessive complexity.

Answer 104

It does not reveal whether the model has overfitted the training data.

Answer 105

Generalization performance may decrease even if training performance increases.

Answer 106

To decide which sub-trees to prune after growing the tree using the training set.

Answer 107

Bottom up; prune the corresponding subtree if its performance is not worse than that of the unpruned tree.

Answer 108

When a model is too simple to capture the complex patterns in the data.

Answer 109

The proportion of true positive predictions among all positive predictions made by the model.

Answer 110

The proportion of actual positive cases that are correctly predicted by the model.

Answer 111

The different types of errors that the model makes and their frequency.

Answer 112

The majority base rate, which is the proportion of examples from the majority class.

Answer 113

Costs that differ based on the type of error made by a classifier.

Answer 114

By considering the actual costs of different types of errors rather than treating all errors equally.

Answer 115

The estimated probability that an example belongs to a certain class provided by classification models.

Answer 116

It helps in targeting the most likely responders in marketing campaigns.

Answer 117

To decide which customers to target for a campaign based on historical data.

Answer 118

Training accuracy may be high while test accuracy may be low if the model overfits.

Answer 119

classification tree models.

Answer 120

To predict probability of response and rank customers by their likelihood to respond.

Answer 121

The number (or percent) of responses.

Answer 122

The number of solicitations (or percent of solicitations out of the total number of customers).

Answer 123

Random ranking of customers.

Answer 124

As more customers are targeted, the incremental gain in responses tends to decrease.

Answer 125

The improvement in response rates achieved by using the model compared to random selection.

Answer 126

By comparing the lift of different classifiers to determine which is better for ranking customers.

Answer 127

A chart that factors in targeting costs and revenue, plotting cumulative profit against the number of solicitations.

Answer 128

It generally decreases because increased targeting can lead to diminishing returns.

Answer 129

It assesses the impact of changes made to the model on the lift chart and ROC.

Answer 130

The proportion of predicted buyers that are actually buyers.

Answer 131

The proportion of actual buyers that are predicted as such by the model.

Answer 132

Increasing the threshold generally increases precision but decreases recall.

Answer 133

The rate at which the model correctly predicts the class of customers.

Answer 134

It provides an unbiased assessment of the model's performance.

Answer 135

It shows the tradeoff between precision and recall for different thresholds.

Answer 136

Partitioning data into train/test sets or using cross-validation.

Answer 137

The alignment between business objectives and the metrics being measured.

Answer 138

Strategies with high precision are more desirable.

Answer 139

It helps calculate costs of errors when error costs are asymmetric and known.

Answer 140

Predictive techniques for business decisions.

Answer 141

It has significantly improved predictions of future behaviors, values, and trends.

Answer 142

* Consumer behavior data * Financial data * Employee data * Health care data * Oil & gas, energy data

Answer 143

* GPS * Internet use (weblogs) * Social media postings * Online purchases

Answer 144

* Likelihood of customer response to products * Loan default probabilities * Fraudulent credit transactions detection * Employee satisfaction and retention predictions * Health predictions (e.g., diabetes risk)

Answer 145

It finds patterns in data and informs a wide variety of problems.

Answer 146

Machine Learning can handle various data types and patterns beyond just numerical data.

Answer 147

* Develop understanding of ML fundamentals * Identify opportunities for business value * Evaluate ML solutions rigorously

Answer 148

An award-winning Java-based machine learning tool with a graphical user interface.

Answer 149

* Textbook and readings * Class notes * Individual/group assignments * In-class quizzes * Final Exam

Answer 150

To find relationships in data and predict unknown or future values.

Answer 151

rule-based

Answer 152

* Marketing * Finance and Risk Management * Healthcare * Fraud Detection * Cyber Security

Answer 153

It produces a list of possible causes based on patient information.

Answer 154

Credit risk scoring.

Answer 155

It is a measure of credit risk.

Answer 156

The impact of machine learning applications on practice has increased significantly over the past 5 years.

Answer 157

The specific reasons are not detailed, but advancements in technology and data availability are implied.

Answer 158

Decisions made by analysis, often considered the best kind of decisions.

Answer 159

Jeff Bezos

Answer 160

700K customers switched to competitors once their contracts expired.

Answer 161

They can inform and benefit the campaign strategies.

Answer 162

Careful and thoughtful problem formulation.

Answer 163

Problem owners and domain experts.

Answer 164

* Identifying informative data * Data cleaning, correction, and representation

Answer 165

Customer demographics, experience with the firm, recent life changes.

Answer 166

Can be up to 80% of the overall project's time.

Answer 167

How good is the model?

Answer 168

The expected impact of the modeling solution on relevant business objectives.

Answer 169

The context and relevant measures.

Answer 170

No, it offers a set of methodologies that must be used correctly.

Answer 171

The implicit assumption that past patterns will be valid in the future.

Answer 172

They may not perform well if the economic conditions change.

Answer 173

The data to which the model will be applied.

Answer 174

* Ethical challenges * Privacy challenges

Answer 175

[methodologies]

Answer 176

Managers ought to be diligent about the risks posed by algorithms ## Footnote Algorithms can exhibit bias, which is a significant concern in predictive modeling.

Answer 177

Data from our social media interactions, emails, homes (like Nest), and GPS information ## Footnote These data sources are crucial for creating effective predictive models.

Answer 178

How the modeling will be perceived and any potential resistance ## Footnote Understanding perception and resistance is vital for successful implementation.

Answer 179

Monetization of data ## Footnote This strategy can help acquire significant data that is hard for competitors to replicate.

Answer 180

The competition among entities to acquire and utilize data effectively ## Footnote This race is critical for businesses looking to leverage predictive analytics.

Answer 181

Managers should consider the potential for bias in algorithms ## Footnote It is essential to mitigate these biases to ensure fair outcomes.

Answer 182

monetization ## Footnote Monetization strategies can drive the acquisition of valuable data.

Answer 183

Algorithms can exhibit bias ## Footnote Bias can lead to inaccurate predictions and reinforce existing inequalities.

Answer 184

Examples include smart home devices, social media platforms, and GPS applications ## Footnote These tools can provide insights that enhance predictive modeling.

Answer 185

Draw N samples with replacement from the original training data set.

Answer 186

Once an instance is drawn, it is placed back into the pool.

Answer 187

Once an instance is drawn, it is removed from the pool.

Answer 188

Yes, because sampling is done with replacement.

Answer 189

An ensemble method that builds multiple trees from different samples of the data.

Answer 190

Each tree generates a prediction, and the predictions are combined to produce a single prediction.

Answer 191

By majority vote.

Answer 192

It reduces the risk of overfitting by averaging multiple models.

Answer 193

An ensemble's prediction can be adversely affected by outliers.

Answer 194

(999/1000)^1000 = 0.367.

Answer 195

The probability is (0.632)^60 * (0.367)^40.

Answer 196

100!/(60! * 40!) = 1.37E+28.

Answer 197

It reduces the risk of overfitting by filtering outliers.

Answer 198

Bagging is more effective with larger data sets.

Answer 199

Bagging diminishes the adverse effects of outliers on the final model's prediction.

Answer 200

Can capture complex patterns and predictions are less likely to be undermined by overfitting ## Footnote Bagged trees improve stability and accuracy by combining multiple trees.

Answer 201

Less simple model: a 'black box' that is not as comprehensible as a single tree model ## Footnote This complexity can hinder interpretability.

Answer 202

Increases the risk of overfitting ## Footnote The relationship between the number of attributes and the risk of overfitting is critical in model training.

Answer 203

* The presence of outliers * The availability of attributes that allow capturing these patterns ## Footnote Outliers can distort the learning process, while too many attributes can lead to complex models that do not generalize well.

Answer 204

It combines alleviating the effect of outliers and reducing the risk that certain features contribute to overfitting ## Footnote Random Forest addresses both issues by using a subset of attributes for each tree.

Answer 205

In Random Forest, only a subset of randomly selected attributes is considered at each split ## Footnote This approach helps prevent the same attribute from being used to fit accidental patterns across multiple trees.

Answer 206

It is less likely that the same attribute will be used to fit an accidental pattern in the data by most trees in the ensemble ## Footnote This randomness can enhance the model's robustness.

Answer 207

4-6 attributes are randomly selected to be considered at each split ## Footnote This selection process is crucial for the model's performance.

Answer 208

The trade-off between computational efficiency and model accuracy ## Footnote More trees can lead to better performance but also increase computation time.

Answer 209

True ## Footnote These techniques are designed to enhance model performance and reduce overfitting.

Answer 210

A model’s training accuracy is typically higher than its test accuracy ## Footnote This reflects that a model fits the training data better than unseen data.

Answer 211

False ## Footnote Training and test accuracies usually differ.

Answer 212

It does not necessarily imply better predictive performance ## Footnote Higher training accuracy can indicate overfitting.

Answer 213

It should be higher than the rate of the majority class ## Footnote This provides a useful benchmark for model performance.

Answer 214

Select the model with the highest training and test accuracy ## Footnote This ensures both generalization and fit to the training data.

Answer 215

False ## Footnote Evaluation should focus on out-of-sample representative data.

Answer 216

A classification tree’s out-of-sample predictive performance ## Footnote This is achieved by removing sub-trees that overfit the training data.

Answer 217

Classification accuracy rate ## Footnote This is pertinent if the costs of misclassifying good and bad risks are equivalent.

Answer 218

Some overfitting has likely occurred ## Footnote Overfitting captures patterns in the training data that do not generalize.

Answer 219

idiosyncratic to the training data ## Footnote This leads to improved training performance at the cost of test performance.

Answer 220

Training accuracy is often higher than test accuracy ## Footnote This is a common phenomenon in machine learning models.

Midterm Flashcards

(260 cards)