H2022 ML Flashcards
Explain the difference between classification and regression in machine learning. Include at least one example of each.
Classification: Predicts categories. Example: Predicting if an email is “spam” or “not spam.”
Regression: Predicts numerical values. Example: Predicting the price of a house based on its size.
In machine learning, one typically divides one’s dataset into three subsets: a training set, a validation set, and a test set. Why? What are the roles of each of these subsets?
Training Set: Used to train the model by fitting patterns.
Validation Set: Used to tune hyperparameters and prevent overfitting.
Test Set: Used to evaluate the model’s final performance on unseen data.
Which of the following tasks are examples of multilabel classification? Note that there may be more than one correct answer.
Classify a recording of speech as belonging to an authorized or a non-authorized user
Classify whether an article is about either sports, politics, science, culture, or finance
Classify rating and profit for a new product that you’re considering launching
Classify customers into those that are interested in a product or not
Classify images as being of either dogs or cats
Classify rating and profit for a new product that you’re considering launching
Multilabel Classification: A task where each instance can belong to multiple classes simultaneously.
Example: Tagging a movie as “Action,” “Comedy,” and “Drama.”
Which of the following statements are true for k-fold cross-validation? Note that there may be more than one correct answer.
Fits multiple models on different splits of the data
Provides a more robust estimate of generalization performance than hold-out validation
Is less computationally expensive than hold-out validation
It makes it unnecessary to have a test set
Fits multiple models on different splits of the data
Provides a more robust estimate of generalization performance than hold-out validation
Which of the following statements are true about decision trees? Note that there may be more than one correct answer.
Their predictions are relatively simple to interpret
They and the models based on them require less preprocessing of data than most other models
They rarely overfit the training data
They are sensitive to minor changes in the training data
Their predictions are relatively simple to interpret
They and the models based on them require less preprocessing of data than most other models
They are sensitive to minor changes in the training data
Say you have trained a decision tree and achieved an accuracy of 85% on the training set and 47% on the validation set. Which of the following ideas would you pursue to improve the model’s performance? Note that there may bemore than one correct answer.
Reduce the maximum allowed depth of the tree ( max_depth
)
Increase the maximum allowed depth of the tree ( max_depth
)
Reduce the minimum number of samples required to allow a node to be split ( min_samples_split
)
Reduce the maximum allowed depth of the tree ( max_depth
)
Explain how a random forest is constructed from decision trees.
Random Forest Construction: A random forest combines multiple decision trees, each trained on a random subset of data and features, using bagging. Predictions are made by averaging (regression) or majority vote (classification).
Explain the difference between machine learning and machine learning engineering
Machine Learning: Focuses on developing models and algorithms to analyze data and make predictions.
Machine Learning Engineering: Focuses on deploying, scaling, and maintaining machine learning models in production systems.
Describe some challenges faced when putting machine learning-based systems into production. Include at least one concrete example.
Data Drift: Model accuracy drops as input data changes over time.
Example: A fraud detection model trained on past patterns fails with new fraud tactics.
Scalability: Ensuring models handle large user requests efficiently.
Monitoring: Tracking model performance in real-time to detect failures or bias.
Integration: Merging models into existing systems seamlessly.