Midterm Flashcards
What is the classification accuracy rate?
The number of correctly predicted instances out of all instances in your data.
Formula: S/n, where S is the number of accurately classified examples and n is the total number of examples.
Why can classification accuracy be misleading?
It may show high accuracy on training data, which does not reflect the model’s performance on unseen data.
High training accuracy may indicate overfitting.
What do we call the examples that were not used to induce the model?
Testing data.
Testing data is crucial for evaluating model performance on unseen data.
What are the two main data partitions used in model training?
- Training data
- Testing data
What is generalization accuracy?
An estimation of how well your model predicts the class of examples from a different data set.
Also known as test accuracy.
What is the learning curve?
A graphical representation showing how model accuracy improves as the training set size increases.
X-axis: sample size of training data; Y-axis: accuracy of the model on testing data.
True or False: More data generally improves model performance.
True.
More data allows the model to learn better and reduces the risk of overfitting.
What happens to model accuracy as training data increases?
Model accuracy generally increases until it plateaus.
This indicates diminishing returns on accuracy with additional data.
What is one drawback of splitting data into training and testing sets?
It limits the amount of data available for training and testing, which can affect model performance.
Insufficient data can lead to non-representative samples.
What is a common solution to avoid over-optimistic evaluation in model testing?
Use a sufficiently large dataset to ensure representativeness after splitting.
This helps maintain data integrity for both training and testing phases.
What is the relationship between the size of the training data and the expected model performance?
Larger training data generally leads to better model performance.
More data helps the model generalize better to unseen data.
What is the drawback of partitioning data for training and testing?
Losing some data for the induction and testing process.
This can lead to a less reliable model if the dataset is small.
Why is more data desirable in model training?
To maintain reliability and avoid issues from limited data when making training and testing cuts.
A larger dataset helps in achieving better generalization.
What is cross validation?
A model evaluation technique used to approximate generalization accuracy without building a predictive model.
It involves partitioning data into subsets for training and testing.
How does cross validation improve model evaluation?
By conducting multiple experiments, it reduces the chance of bias from a single training/testing split.
This is especially useful when working with limited data.
What are the steps in performing 10-fold cross validation?
- Partition data into 10 folds. 2. Hold one fold for testing. 3. Use the remaining folds for training. 4. Repeat for each fold.
Each portion of data serves as both training and testing at different times.
What is the benefit of averaging the results in cross validation?
It mitigates the effects of outliers and provides a more reliable accuracy estimate.
Averaging across folds helps smooth out inaccuracies from any one fold.
What is the potential disadvantage of increasing the number of folds in cross validation?
It can lead to very small testing sets, which may not be representative of the entire dataset.
This diminishes the effectiveness of the cross validation process.
What happens in leave-one-out cross validation?
One record is held out as the test set while the rest are used for training.
This method can lead to very small testing sets, especially with limited data.
What is a key consideration when using limited data in cross validation?
Each model induced will be similar, but care must be taken to ensure the test set is adequately sized.
Smaller datasets could lead to biased results if the test set is too small.
True or False: Cross validation is used for building predictive models.
False.
Cross validation is primarily an evaluation technique.
Fill in the blank: Cross validation aims to approximate _______.
generalization accuracy.
This is crucial for assessing model performance on unseen data.
What is the main purpose of cross validation in model evaluation?
To assess the performance of a model using different subsets of data
True or False: Cross validation is an inducing technique for models.
False
In the context of cross validation, what does partitioning a small set of data allow for?
It allows for experimentation without inducing a model
What happens to the model’s performance when using cross validation?
It helps mitigate outliers by averaging results
Fill in the blank: In cross validation, you never use the same experiments for both ______ and ______.
training, testing
What is the significance of having a satisfactory cross validation accuracy?
It indicates the model is likely to perform well
What is an example of model parameters mentioned in the text?
Max depth of 5, min sample leaves of 50
What are the two main phases of model building as discussed?
Inducing a model and evaluating the model
How does cross validation improve the evaluation process when data is limited?
It allows for evaluation without a separate training and testing split
What happens to the model’s structure when evaluated on testing data versus cross validation?
The model remains largely the same in both evaluations
What does each fold in cross validation consist of?
A separate training set and testing set
True or False: All data is used for both training and testing in each fold of cross validation.
False
What can be concluded about the evaluation techniques discussed?
They serve to assess model performance effectively
When is cross validation typically applied in the model building process?
During the evaluation phase of the model
What is the formula for calculating classification accuracy?
S/N: S is the number accurately classified by the model, and N is the total number of examples.
What is the difference between training accuracy and test accuracy?
Training accuracy is the model’s performance on training examples; test accuracy is the model’s performance on out-of-sample data.
What is the common practice for partitioning data for model training and testing?
It is common to use ⅔ of the data for training and ⅓ for testing.
What is a learning curve in predictive analytics?
It characterizes how test accuracy improves as the training set size increases.
What is Cross Validation (CV)?
CV is an experiment that provides a good approximation of generalization performance for a model.
What are the steps involved in N-Fold Cross Validation?
- Randomly partition data into N equally sized sets (folds)
- Perform N experiments of model building and evaluation
- Hold out one fold as the test set in each experiment
- Induce a model from the remaining folds
- Evaluate performance on the test set
- Average the performance of the N experiments
What is overfitting in predictive modeling?
Overfitting occurs when a model captures not only regularities in the data but also peculiarities, undermining its predictive performance.
What is the purpose of a validation set?
A validation set is used to decide which subtrees to prune in a model.
What happens when training error decreases while validation error increases?
It indicates that the model is likely overfitting the training data.
Define precision in the context of classification models.
Precision is the ratio of true positives to the total predicted positives: True Positives/(True Positives + False Positives).
Define recall in the context of classification models.
Recall is the ratio of true positives to the total actual positives: True Positives/(True Positives + False Negatives).
What is the trade-off between precision and recall?
As precision increases, recall tends to decrease.
What is a Lift Chart used for?
A Lift Chart is used to determine if a model is better at ranking customers than random ranking.
What does the Receiver Operating Characteristic (ROC) curve illustrate?
The ROC curve illustrates the performance of a binary classifier as its discrimination threshold varies.
What does the area under the ROC curve (AUC) indicate?
AUC summarizes the overall performance of a model; a value of 1.0 indicates perfect performance, while 0.5 indicates random guessing.
What is the role of Class Probability Estimation (CPE)?
CPE shows the probability that a given example will belong to a certain class.
Fill in the blank: The training set is used to grow a tree to its _______.
[max size]
True or False: The validation set is the same as the test set.
False
What is the benchmark for classification accuracy?
The base rate, which classifies all examples to the majority class.
What is the importance of evaluating model performance on test samples?
To detect overfitting and ensure the model generalizes well to unseen data.
What is an example or instance in the context of data mining?
A fact that typically includes a set of attributes and an output variable.
What is a data set?
A set of examples.
What is training data?
Data used to induce (train) a model.
What are attributes in data mining?
Independent variables.
What is the target variable in data mining?
The dependent variable.
What is the purpose of analyzing customer data in predictive analytics?
To induce patterns common among customers who have terminated or extended their contracts.
Define ‘pattern’ in the context of data mining.
A conclusion drawn from data that predicts an outcome based on certain conditions.
What does induction or inductive learning refer to?
A method or algorithm used to induce a pattern from a set of examples.
What is linear regression in data mining?
An induction algorithm that predicts a dependent variable based on independent variables.
What is a model in data mining?
A general pattern induced from data that describes the data in concise form.
What is the objective of a predictive model?
To estimate or predict an unknown value.
What is supervised learning?
Model induction followed by inference using the model to predict.
Define unsupervised learning in data mining.
Clustering/segmentation that organizes instances into cohesive groups without predicting an unknown value.
What type of questions can data mining answer regarding customer behavior?
- What products are commonly bought together? * What is a customer likely to buy next? * How likely is a customer to respond to a marketing campaign?
What does classification refer to in data mining?
A predictive model where the target variable is discrete (categorical).
What does a classification model provide as a by-product?
The probability that the case belongs to each category.
What is a classification tree?
A classification model that includes a set of IF/THEN rules.
What is regression in data mining?
A predictive model that predicts the value of a numerical variable.
What is clustering/segmentation analysis?
Unsupervised learning that identifies distinct groups of similar instances.
What is the purpose of association rules in data mining?
To find relations among attributes in the data that frequently co-occur.
What is sequence analysis in data mining?
Finding patterns in time-stamped data.
Fill in the blank: A learner in data mining is also known as a _______.
[induction algorithm].
True or False: Supervised learning is used to predict unknown values.
True.
True or False: Unsupervised learning requires labeled data.
False.
What is a model?
A concise description of a pattern (relationship) that exists in the data.
What do classification models predict?
They predict (estimate) an unknown value of interest, which is a categorical variable.
Examples of classification tasks include:
- Customer retention (CRM)
- Marketing
- Risk management
- Financial trading
What is a classification tree?
A predictive model represented as a tree that is used for classification tasks.
Why are classification trees popular?
They are easy to understand, computationally fast to induce from data, and are the basis of high-performing modeling techniques.
What do non-terminal nodes in a classification tree represent?
Tests on an attribute.
What do terminal nodes (leaves) in a classification tree provide?
A prediction and a distribution over the classes.
In a classification tree, what is the outcome when a leaf node is reached?
A class prediction is made.
How are rules extracted from a classification tree?
Each path from the root to a leaf node constitutes a rule.
What is the classification tree model used for in tax compliance?
To predict whether an incoming tax report is noncompliant.
What is the purpose of partitioning in classification tree induction?
To create subgroups that are purer with respect to the class than the original group.
What are good predictors in classification tree induction?
Attributes that help partition the examples into purer sub-groups.
What is Information Gain (IG)?
A measure that captures how informative an attribute is for distinguishing between instances of different classes.
What does entropy measure in the context of classification trees?
The impurity in a dataset.
What is a classification tree induction algorithm?
An algorithm used to construct decision trees from datasets.
Fill in the blank: A classification model predicts a categorical variable, known as a _______.
[class]
True or False: Classification trees are computationally slow to induce from data.
False
What is a subtree in a classification tree?
A branching from a node that captures predictive patterns for a sub-population.
What is the goal of partitioning customers in classification tree induction?
To achieve increasingly purer class distribution in subgroups.
What are some examples of popular tree induction algorithms?
- ID3
- C4.5
- CART
What is the first step in applying a classification tree to predict a class?
Start from the root of the tree.
What is the significance of the average monthly pay and age in a classification tree?
They are attributes used to make decisions at each node.
What does a classification tree model predict regarding customer behavior?
Whether a customer will switch or stay.
What does Information Gain (IG) quantify?
The gain from splitting the population into groups based on purity.
What does Entropy measure?
Impurity in a group of examples.
What is the relationship between Entropy and predictive accuracy?
Higher entropy indicates higher uncertainty about class membership.
How is Entropy calculated?
Entropy = -Σ(Pi * log2(Pi)) where Pi is the proportion of class i.
What is the formula for Information Gain?
Information Gain = Impurity(parent) – Weighted Avg. Impurity(children).
What is the goal of recursive partitioning in classification trees?
To improve predictive accuracy by creating purer subgroups.
What are some stopping rules for tree partitioning?
- Maximum purity reached
- All attributes used along the path
- No information gain from additional splits
What is the objective of recursive partitioning?
To predict with high certainty.
What is a potential risk of recursive partitioning?
Finding incidental patterns in small subgroups that do not generalize.
What are key strengths of classification trees?
- High variance inductive technique
- Computationally cheap
- Easy for stakeholders to understand
What are the attributes considered in the basketball prediction example?
- Game location (Home/Away)
- Starting time
- Player positions and roles
- Opponent’s center height
What is a regression tree?
A model built using recursive partitioning for predicting numerical variables.
Fill in the blank: Entropy captures how _______ are the sub-groups compared to the original group.
[purer]
True or False: The order of attributes split on in classification trees does not matter.
False
What does a classification tree model aim to achieve at prediction time?
Predict with high certainty.
What is the objective of model evaluation?
To determine how good the model is in predictive performance
This includes understanding the model’s accuracy and appropriateness for various objectives.
What does the classification accuracy rate measure?
Proportion of examples whose class is predicted accurately by the model
Calculated as S/N, where S is the number of examples accurately classified and N is the total number of examples.
What is the consequence of measuring classification accuracy on training data?
It tends to result in an over-optimistic estimation of the model’s future performance
This is because the model is evaluated on the same data it was trained on.
What should examples used to evaluate the model be?
Examples that were not used to induce the model and whose class is known
This ensures accurate assessment of the model’s predictive capabilities.
What is the common practice for splitting data into training and test sets?
2/3 of examples for training and 1/3 for testing
This ensures a balance between training the model and evaluating its performance.
What is test accuracy?
An estimation of how well a model induced from training data will predict the class of examples in the population
It is also known as generalization accuracy.
What is N-fold cross-validation?
An experiment to approximate the generalization performance of a model by partitioning data into N equally-sized sets
It helps in evaluating the model’s performance by averaging results across multiple training and test sets.
How does N-fold cross-validation work?
- Partition data into N folds
- Perform N experiments, each time holding out one fold as the test set
- Average the performance results of all experiments
This method provides a reliable estimate of the model’s performance.
What are the advantages of N-fold cross-validation for small samples?
It allows for a training set size very similar to the original sample, leading to a model that is likely very similar to the one induced from the complete sample
This minimizes discrepancies in model performance between small and full datasets.
What is a learning curve?
Characterizes how test accuracy improves as the training set size increases
Particularly relevant for methods like classification trees and neural networks.
What is the implication of using a smaller training set?
It may lead to over pessimistic evaluation of the model’s performance
If learning has not plateaued, the model may not perform as well as it could with a larger training set.
What happens when the test set is too small?
It may not be representative of the population
This can compromise the accuracy of the model’s evaluation.
True or False: Overfitting cannot be detected if we evaluate the model using the training data.
True
Evaluating on training data only shows that the model improves as it expands, without revealing overfitting.
What is the purpose of cross-validation?
To approximate how well a model will perform when applied to the population
This involves using multiple folds to ensure a robust evaluation.
What is cross-validation?
A technique used to evaluate the performance of a model by partitioning data into training and test sets.
How can overfitting be detected?
By evaluating model performance on a representative test sample.
What is overfitting?
When a model performs well on training data but poorly on unseen data due to excessive complexity.
Why is measuring prediction error on the training set insufficient?
It does not reveal whether the model has overfitted the training data.
What happens to generalization performance as a model expands?
Generalization performance may decrease even if training performance increases.
What is the purpose of a validation set?
To decide which sub-trees to prune after growing the tree using the training set.
How is pruning performed on classification trees?
Bottom up; prune the corresponding subtree if its performance is not worse than that of the unpruned tree.
What is underfitting?
When a model is too simple to capture the complex patterns in the data.
What is precision in the context of model evaluation?
The proportion of true positive predictions among all positive predictions made by the model.
What is recall (True Positive Rate)?
The proportion of actual positive cases that are correctly predicted by the model.
What does a confusion matrix show?
The different types of errors that the model makes and their frequency.
What is a benchmark for a model’s classification accuracy rate?
The majority base rate, which is the proportion of examples from the majority class.
What are asymmetric error costs?
Costs that differ based on the type of error made by a classifier.
How can cost-sensitive evaluation improve model assessment?
By considering the actual costs of different types of errors rather than treating all errors equally.
What is Class Probability Estimation (CPE)?
The estimated probability that an example belongs to a certain class provided by classification models.
What is the significance of ranking customers by predicted probability of response?
It helps in targeting the most likely responders in marketing campaigns.
What is the main goal of using a classification model in direct marketing?
To decide which customers to target for a campaign based on historical data.
What is the relationship between training accuracy and test accuracy?
Training accuracy may be high while test accuracy may be low if the model overfits.
Fill in the blank: Overfitting is particularly common in _______.
classification tree models.
True or False: A high classification accuracy always indicates a useful model.
False.
What is the purpose of using a model to rank customers for targeting?
To predict probability of response and rank customers by their likelihood to respond.
What does the y-axis represent in a lift chart?
The number (or percent) of responses.
What does the x-axis represent in a lift chart?
The number of solicitations (or percent of solicitations out of the total number of customers).
True or False: Lift charts can help determine whether a predictive model is better at ranking customers than random ranking.
True.
What is represented by the straight line in a lift chart?
Random ranking of customers.
Why are most lift charts concave?
As more customers are targeted, the incremental gain in responses tends to decrease.
What does ‘lift’ refer to in the context of lift charts?
The improvement in response rates achieved by using the model compared to random selection.
How can lift charts be evaluated?
By comparing the lift of different classifiers to determine which is better for ranking customers.
What is a profit lift chart?
A chart that factors in targeting costs and revenue, plotting cumulative profit against the number of solicitations.
What is the shape of a profit lift chart curve typically, and why?
It generally decreases because increased targeting can lead to diminishing returns.
What does the area under the ROC curve (AUC) indicate?
It assesses the impact of changes made to the model on the lift chart and ROC.
What is precision in the context of customer prediction models?
The proportion of predicted buyers that are actually buyers.
What is recall in customer prediction models?
The proportion of actual buyers that are predicted as such by the model.
Fill in the blank: A lift chart allows us to diagnose the effectiveness of a model at ranking customers by the likelihood they belong to an important class (e.g., _______ or switchers).
buyers
What is the tradeoff between precision and recall when increasing the threshold for targeting customers?
Increasing the threshold generally increases precision but decreases recall.
What is the classification accuracy rate?
The rate at which the model correctly predicts the class of customers.
What is the importance of estimating performance on an out-of-sample set?
It provides an unbiased assessment of the model’s performance.
What is the significance of the Precision/Recall Curve (PRC)?
It shows the tradeoff between precision and recall for different thresholds.
What are the two possibilities for performance estimation?
Partitioning data into train/test sets or using cross-validation.
What should be considered when measuring performance in relation to business objectives?
The alignment between business objectives and the metrics being measured.
What is the recommendation for targeting customers with costly incentives?
Strategies with high precision are more desirable.
What is the role of confusion matrix in performance measurement?
It helps calculate costs of errors when error costs are asymmetric and known.
What is Machine Learning primarily used for?
Predictive techniques for business decisions.
What impact has Machine Learning had on business over the last two decades?
It has significantly improved predictions of future behaviors, values, and trends.
What types of data are commonly used in Machine Learning?
- Consumer behavior data
- Financial data
- Employee data
- Health care data
- Oil & gas, energy data
What are some examples of consumer behavior data?
- GPS
- Internet use (weblogs)
- Social media postings
- Online purchases
What kinds of predictions can companies make using Machine Learning?
- Likelihood of customer response to products
- Loan default probabilities
- Fraudulent credit transactions detection
- Employee satisfaction and retention predictions
- Health predictions (e.g., diabetes risk)
What characterizes Machine Learning as a general-purpose technology?
It finds patterns in data and informs a wide variety of problems.
How does Machine Learning differ from traditional statistical models?
Machine Learning can handle various data types and patterns beyond just numerical data.
What is the goal of this course on Machine Learning?
- Develop understanding of ML fundamentals
- Identify opportunities for business value
- Evaluate ML solutions rigorously
What is WEKA?
An award-winning Java-based machine learning tool with a graphical user interface.
What are the course requirements for this Machine Learning course?
- Textbook and readings
- Class notes
- Individual/group assignments
- In-class quizzes
- Final Exam
What is the purpose of predictive models in Machine Learning?
To find relationships in data and predict unknown or future values.
Fill in the blank: A _______ predictive model uses conditions to predict customer behavior.
rule-based
What are major application areas for predictive modeling?
- Marketing
- Finance and Risk Management
- Healthcare
- Fraud Detection
- Cyber Security
What role does predictive analytics play in data-driven healthcare?
It produces a list of possible causes based on patient information.
True or False: Machine Learning is only applicable in the finance sector.
False
What is a common use of predictive analytics in finance?
Credit risk scoring.
What is the significance of the FICO Score?
It is a measure of credit risk.
What has led to the explosion of machine learning applications in recent years?
The impact of machine learning applications on practice has increased significantly over the past 5 years.
Why didn’t the significant impact of machine learning occur 20 years ago?
The specific reasons are not detailed, but advancements in technology and data availability are implied.
What is fact-based decision-making?
Decisions made by analysis, often considered the best kind of decisions.
Who emphasized the importance of fact-based decision-making?
Jeff Bezos
What challenge did the telecom firm Telco face?
700K customers switched to competitors once their contracts expired.
What can machine learning predictions inform in marketing campaigns?
They can inform and benefit the campaign strategies.
What is the foundation of any machine learning project?
Careful and thoughtful problem formulation.
Who should be included in problem formulation for machine learning projects?
Problem owners and domain experts.
What are the two key functions of data preparation?
- Identifying informative data
- Data cleaning, correction, and representation
What is an example of a predictor that may be useful for predicting churn?
Customer demographics, experience with the firm, recent life changes.
What percentage of an overall machine learning project can data preparation consume?
Can be up to 80% of the overall project’s time.
What is a critical question to evaluate a machine learning model?
How good is the model?
What should be estimated before deploying a model?
The expected impact of the modeling solution on relevant business objectives.
What is essential to consider when evaluating a model?
The context and relevant measures.
Is machine learning a magic wand?
No, it offers a set of methodologies that must be used correctly.
What can lead to poor predictions despite high accuracy in a model?
The implicit assumption that past patterns will be valid in the future.
What is a potential issue with predictive models based on historical data?
They may not perform well if the economic conditions change.
What must training data represent?
The data to which the model will be applied.
What are some challenges associated with machine learning?
- Ethical challenges
- Privacy challenges
Fill in the blank: Machine learning offers a set of _______.
[methodologies]
True or False: Data preparation is not a resource-intensive process.
False
What challenges do managers face regarding algorithms?
Managers ought to be diligent about the risks posed by algorithms
Algorithms can exhibit bias, which is a significant concern in predictive modeling.
What types of data are relevant for predictive modeling?
Data from our social media interactions, emails, homes (like Nest), and GPS information
These data sources are crucial for creating effective predictive models.
What should be assessed alongside the benefits of modeling?
How the modeling will be perceived and any potential resistance
Understanding perception and resistance is vital for successful implementation.
What is integral to a business proposition involving predictive analytics?
Monetization of data
This strategy can help acquire significant data that is hard for competitors to replicate.
What is the ‘data race’?
The competition among entities to acquire and utilize data effectively
This race is critical for businesses looking to leverage predictive analytics.
What should managers consider about the risks of algorithms?
Managers should consider the potential for bias in algorithms
It is essential to mitigate these biases to ensure fair outcomes.
Fill in the blank: The __________ of data is crucial for predictive analytics.
monetization
Monetization strategies can drive the acquisition of valuable data.
What might go wrong with predictive modeling?
Algorithms can exhibit bias
Bias can lead to inaccurate predictions and reinforce existing inequalities.
What devices/apps collect potentially valuable data?
Examples include smart home devices, social media platforms, and GPS applications
These tools can provide insights that enhance predictive modeling.
What is the sampling method used in bagging?
Draw N samples with replacement from the original training data set.
What does ‘with replacement’ mean in sampling?
Once an instance is drawn, it is placed back into the pool.
What does ‘without replacement’ mean in sampling?
Once an instance is drawn, it is removed from the pool.
Can an instance be drawn to the same sample more than once in bagging?
Yes, because sampling is done with replacement.
What is bagged trees?
An ensemble method that builds multiple trees from different samples of the data.
What is the process for making predictions with an ensemble of models?
Each tree generates a prediction, and the predictions are combined to produce a single prediction.
How are predictions combined in an ensemble?
By majority vote.
Why might bagging improve predictive accuracy?
It reduces the risk of overfitting by averaging multiple models.
What is the effect of outliers on an ensemble’s prediction?
An ensemble’s prediction can be adversely affected by outliers.
What is the probability of not selecting an outlier in a single draw?
999/1000.
What is the probability of not drawing the outlier at all in 1000 draws?
(999/1000)^1000 = 0.367.
What is the probability that a sample includes at least one copy of the outlier?
0.632.
What is the likelihood of the first 60 samples including the outlier?
The probability is (0.632)^60 * (0.367)^40.
How many combinations exist for 60 samples including the outlier and 40 samples not including it?
100!/(60! * 40!) = 1.37E+28.
What is the overall probability that the outlier is in 60 samples?
0.067.
What is a key benefit of bagging?
It reduces the risk of overfitting by filtering outliers.
Is the Bagging Model more effective at improving accuracy with large or small data sets?
Bagging is more effective with larger data sets.
What is the diminishing effect of outliers in bagging?
Bagging diminishes the adverse effects of outliers on the final model’s prediction.
What is a key advantage of bagged classification trees?
Can capture complex patterns and predictions are less likely to be undermined by overfitting
Bagged trees improve stability and accuracy by combining multiple trees.
What is a disadvantage of bagged classification trees?
Less simple model: a ‘black box’ that is not as comprehensible as a single tree model
This complexity can hinder interpretability.
What are the implications of having a small number of examples and many attributes in labor data?
Increases the risk of overfitting
The relationship between the number of attributes and the risk of overfitting is critical in model training.
What are the two necessary conditions for any modeling technique to overfit?
- The presence of outliers
- The availability of attributes that allow capturing these patterns
Outliers can distort the learning process, while too many attributes can lead to complex models that do not generalize well.
How does Random Forest reduce the risk of overfitting?
It combines alleviating the effect of outliers and reducing the risk that certain features contribute to overfitting
Random Forest addresses both issues by using a subset of attributes for each tree.
What is the key difference between Random Forest and bagging?
In Random Forest, only a subset of randomly selected attributes is considered at each split
This approach helps prevent the same attribute from being used to fit accidental patterns across multiple trees.
What is the rationale behind randomly removing attributes in Random Forest?
It is less likely that the same attribute will be used to fit an accidental pattern in the data by most trees in the ensemble
This randomness can enhance the model’s robustness.
In a Random Forest model, how are attributes selected?
4-6 attributes are randomly selected to be considered at each split
This selection process is crucial for the model’s performance.
What should be considered when determining a good number of trees to use in an ensemble?
The trade-off between computational efficiency and model accuracy
More trees can lead to better performance but also increase computation time.
True or False: Bagging or Random Forest can improve a classification technique that does not tend to fit the data too well.
True
These techniques are designed to enhance model performance and reduce overfitting.
What is typically higher, a model’s training accuracy or its test accuracy?
A model’s training accuracy is typically higher than its test accuracy
This reflects that a model fits the training data better than unseen data.
True or False: A model’s training accuracy is always the same as the model’s test accuracy.
False
Training and test accuracies usually differ.
When comparing the performances of two classification models, what does higher training accuracy imply?
It does not necessarily imply better predictive performance
Higher training accuracy can indicate overfitting.
What should be ensured about a model’s test accuracy rate when evaluating its predictive accuracy?
It should be higher than the rate of the majority class
This provides a useful benchmark for model performance.
In a predictive model for customer classification, which is a recommended practice?
Select the model with the highest training and test accuracy
This ensures both generalization and fit to the training data.
True or False: A model can be evaluated strictly by its performance on a training set.
False
Evaluation should focus on out-of-sample representative data.
What does classification tree pruning aim to improve?
A classification tree’s out-of-sample predictive performance
This is achieved by removing sub-trees that overfit the training data.
When comparing classification models for credit risk, what is a relevant measure?
Classification accuracy rate
This is pertinent if the costs of misclassifying good and bad risks are equivalent.
What does it indicate if a model’s training accuracy is higher than its test accuracy?
Some overfitting has likely occurred
Overfitting captures patterns in the training data that do not generalize.
Fill in the blank: Overfitting occurs when a model captures patterns that are _______.
idiosyncratic to the training data
This leads to improved training performance at the cost of test performance.
Which statement about training and test accuracies is generally true?
Training accuracy is often higher than test accuracy
This is a common phenomenon in machine learning models.