lecture 3 Flashcards
What are the basic steps in offline machine learning?
- Abstract the problem to a standard task (Classification, Regression, etc.). 2. Choose instances and features. 3. Choose a model class. 4. Search for a good model.
What is binary classification?
A classification task with two classes: positive and negative.
What is classification error?
The proportion of misclassified examples.
What is classification accuracy?
The proportion of correctly classified examples.
Why do we compare models?
To determine the best model for production use.
What is an example of hyperparameter tuning in kNN?
Choosing the number of neighbors (k).
What is the simplest way to compare two classifiers?
Train both, compute their errors, and pick the one with the lowest error.
Why is evaluating on training data misleading?
Because the model may overfit, performing well on training data but poorly on unseen data.
What is the purpose of a test set?
To evaluate model performance on unseen data.
What is the recommended minimum size for a test set?
At least 500 examples; ideally 10,000 or more.
What is the danger of testing many models on the same test set?
Overfitting to the test set due to multiple testing.
What is overfitting in model selection?
Choosing a model that performs well on a specific test set but generalizes poorly.
What is the modern approach to model evaluation?
- Split data into train and test sets. 2. Choose model and hyperparameters using training data. 3. Test the model only once on test data.
Why shouldn’t test data be reused?
Reusing test data leads to selecting the wrong model and inflating performance estimates.
What is the purpose of a validation set?
To tune model hyperparameters without using the test set.
What is cross-validation?
A technique where training data is split into multiple subsets (folds) to validate model performance.
What is walk-forward validation used for?
For time-series data, ensuring training data precedes test data chronologically.
What is the difference between validation and evaluation?
Evaluation simulates production, validation simulates evaluation.
What are common hyperparameter tuning methods?
Trial-and-error (intuition), grid search, and random search.
Why is random search often better than grid search?
Random search explores more parameter values efficiently in high-dimensional spaces.
Why is statistical testing controversial in ML?
Large datasets often make statistical tests unnecessary, and replication is the best validation.
What is the difference between true accuracy and sample accuracy?
True accuracy is the actual probability of correct classification, while sample accuracy is the proportion of correctly classified test samples.
What does a confidence interval represent?
The range within which the true metric likely falls in repeated experiments.
What is the impact of test set size on confidence intervals?
Larger test sets produce narrower confidence intervals, increasing reliability.
What is Alpaydin’s 5x2 F test used for?
To test statistical significance when test sets are small.
What is the standard error of the mean (SEM)?
A measure of how much sample means vary from the true mean.
What is the 95% confidence interval formula for the mean?
Mean ± 1.96 × SEM.
What are common meanings of error bars?
They can represent standard deviation, standard error, or confidence intervals.
What does overlap in error bars indicate?
If error bars overlap, the difference between models is likely not statistically significant.
What should you avoid when interpreting confidence intervals?
Saying the probability that the true mean is in the interval is 95%. Instead, say that in 95% of repeated experiments, the true mean would fall in the interval.