MLML Flashcards
What is supervised learning?
In supervised learning, algorithms learn from training data where the desired solutions, called labels, are included, and the goal is to make predictions on future data based on these examples.
Two different voting schemes are common among voting classifiers. What are they?
Briefly explain how they work.
- In hard voting (also known as majority voting), every individual classifier votes for a class, and the majority wins.
- In soft voting, every individual classifier provides a probability value that a specific data point belongs to a particular target class. The predictions are weighted by the classifier’s importance and summed up. Then the target label with the greatest sum of weighted probabilities wins the vote
What issue arises when models are automatically trained on data collected during production?
Staleness
Data skews
Feedback loops
Feedback loops
Do different gradient descent methods always converge to similar points?
True
False
False
What kind of problems does regularization solve? Give an example of a regularization technique.
Regularization is any modification we make to a learning algorithm that is
intended to reduce its generalization error but not its training error.
Ridge Regression, Lasso Regression, and Elastic Net implement three different ways to constrain the weights.
What is the difference between a validation set and a test set? What roles do they play?
Validation data is used to select the model and hyperparameters, in other words, it provides a performance estimate during model construction and model selection. The validation data can for example be 10-20% of the training set. However, this depends on the size and other characteristics of the dataset. When you have fine-tuned your model, you train a final model on the entire training set (training data + validation data) before predicting on test data.
- Test data: Used to evaluate the generalization performance of the selected model on unseen data. A common rule-of-thumb for splitting data is 80/20 - where 80% of the data is used for training a model, while 20% is used for testing it. However, the percentage split will in practice vary a lot, depending on the size and heterogeneity of the relevant data set. Note, that it’s important to not touch this data until you have
fine-tuned your model to get an unbiased evaluation. If you tune hyperparameters using the test set, you risk overfitting the test set, and the generalization error you measure will be optimistic (you may launch a model that performs worse than you expect)
What typically happens after a while in operation to an offline-trained model dealing with new real, live data?
The model becomes stale
The model adapts to new patterns
The model abruptly forgets all previously learned information
The model becomes stale
7
Which of the following is true of cross-validation?
Fits multiple models on different splits of the training data
Increases generalization ability and reduces computational complexity
Removes the need for training and test sets
Fits multiple models on different splits of the training data
How can you handle missing or corrupted data in a dataset? (Select all that apply)
Drop missing rows or columns
Replace missing values with mean/median
Replace missing values with the smallest/largest value
Drop missing rows or columns
Replace missing values with mean/median
Suppose that you have a very accurate model for a social app that uses several features to predict whether a user is a spammer or not. You trained the model with a particular idea of what a spammer was, for example, a user who sends ten messages in one minute. Over time, the app grew and became more popular, but the outcome of the
predictions has drastically changed. As people are chatting and messaging more, now sending ten messages in a minute becomes normal and not something that only spammers do. What kind of drift causes this spam-detection model’s predictive ability to decay?
Data drift
Concept drift
Concept drift
Which of the following are advantages to using decision trees over other models? (Select all that apply)
Trees are naturally resistant to overfitting
Trees often require less preprocessing of data
Trees are easy to interpret and visualize
Trees are robust to small changes in the data
Trees often require less preprocessing of data
Trees are easy to interpret and visualize
Regarding bias and variance, which of the following statements are true? (Select all that apply)
Models which overfit have a high bias
Models which overfit have a low bias
Models which underfit have a high variance
Models which underfit have a low variance
Models which overfit have a low bias
Models which underfit have a low variance
What is stacking? Select the alternative that best characterizes stacking.
You use different versions of machine learning algorithms
You use several machine learning algorithms to boost your results
The predictions of one model become the inputs to another model
You stack your training set and test set togheter
The predictions of one model become the inputs to another model
What is the main advantage of using feature selection?
Speeding up the training of an algorithm
Fine-tuning the model’s performance
Remove noisy features
Speeding up the training of an algorithm