Midterm Flashcards
(260 cards)
What is the classification accuracy rate?
The number of correctly predicted instances out of all instances in your data.
Formula: S/n, where S is the number of accurately classified examples and n is the total number of examples.
Why can classification accuracy be misleading?
It may show high accuracy on training data, which does not reflect the model’s performance on unseen data.
High training accuracy may indicate overfitting.
What do we call the examples that were not used to induce the model?
Testing data.
Testing data is crucial for evaluating model performance on unseen data.
What are the two main data partitions used in model training?
- Training data
- Testing data
What is generalization accuracy?
An estimation of how well your model predicts the class of examples from a different data set.
Also known as test accuracy.
What is the learning curve?
A graphical representation showing how model accuracy improves as the training set size increases.
X-axis: sample size of training data; Y-axis: accuracy of the model on testing data.
True or False: More data generally improves model performance.
True.
More data allows the model to learn better and reduces the risk of overfitting.
What happens to model accuracy as training data increases?
Model accuracy generally increases until it plateaus.
This indicates diminishing returns on accuracy with additional data.
What is one drawback of splitting data into training and testing sets?
It limits the amount of data available for training and testing, which can affect model performance.
Insufficient data can lead to non-representative samples.
What is a common solution to avoid over-optimistic evaluation in model testing?
Use a sufficiently large dataset to ensure representativeness after splitting.
This helps maintain data integrity for both training and testing phases.
What is the relationship between the size of the training data and the expected model performance?
Larger training data generally leads to better model performance.
More data helps the model generalize better to unseen data.
What is the drawback of partitioning data for training and testing?
Losing some data for the induction and testing process.
This can lead to a less reliable model if the dataset is small.
Why is more data desirable in model training?
To maintain reliability and avoid issues from limited data when making training and testing cuts.
A larger dataset helps in achieving better generalization.
What is cross validation?
A model evaluation technique used to approximate generalization accuracy without building a predictive model.
It involves partitioning data into subsets for training and testing.
How does cross validation improve model evaluation?
By conducting multiple experiments, it reduces the chance of bias from a single training/testing split.
This is especially useful when working with limited data.
What are the steps in performing 10-fold cross validation?
- Partition data into 10 folds. 2. Hold one fold for testing. 3. Use the remaining folds for training. 4. Repeat for each fold.
Each portion of data serves as both training and testing at different times.
What is the benefit of averaging the results in cross validation?
It mitigates the effects of outliers and provides a more reliable accuracy estimate.
Averaging across folds helps smooth out inaccuracies from any one fold.
What is the potential disadvantage of increasing the number of folds in cross validation?
It can lead to very small testing sets, which may not be representative of the entire dataset.
This diminishes the effectiveness of the cross validation process.
What happens in leave-one-out cross validation?
One record is held out as the test set while the rest are used for training.
This method can lead to very small testing sets, especially with limited data.
What is a key consideration when using limited data in cross validation?
Each model induced will be similar, but care must be taken to ensure the test set is adequately sized.
Smaller datasets could lead to biased results if the test set is too small.
True or False: Cross validation is used for building predictive models.
False.
Cross validation is primarily an evaluation technique.
Fill in the blank: Cross validation aims to approximate _______.
generalization accuracy.
This is crucial for assessing model performance on unseen data.
What is the main purpose of cross validation in model evaluation?
To assess the performance of a model using different subsets of data
True or False: Cross validation is an inducing technique for models.
False