Basics Flashcards
What is the difference between supervised and unsupervised learning? Provide examples of each.
Supervised learning involves learning a function that maps inputs to known outputs using labeled data. An example is predicting house prices (regression). Unsupervised learning involves finding patterns in data without predefined labels, like clustering customers into segments (clustering).
Explain the bias-variance tradeoff and how it relates to overfitting and underfitting.
The bias-variance tradeoff refers to the balance between the model’s ability to fit training data and generalize to new data. High bias models are too simple and underfit, while high variance models are too complex and overfit. The goal is to find the right balance.
What are the main differences between regression and classification problems? Provide examples.
Regression involves predicting a continuous output (e.g., predicting house prices), while classification involves predicting categorical outputs (e.g., classifying emails as spam or not). The key difference is in the nature of the target variable.
Define model parameters and hyperparameters, and explain their roles in machine learning models.
Model parameters are learned during training (e.g., weights in a neural network). Hyperparameters control the learning process (e.g., learning rate or k in k-NN). Hyperparameters are set before training and optimized using validation data.
How do training, validation, and test data sets differ, and why are they important?
Training data is used to fit the model, validation data helps tune hyperparameters, and test data is used to evaluate final performance. Separating these sets prevents overfitting and ensures generalization to new data.
What is cross-validation, and why is it important in machine learning model evaluation?
Cross-validation is a technique where the data is split into k subsets, and the model is trained and validated on different combinations of these subsets. It helps provide a more reliable estimate of a model’s performance.
Describe how the k-nearest neighbors (k-NN) algorithm works. What are its key strengths and weaknesses?
k-NN works by finding the k closest data points to a query point and predicting the majority class (classification) or average value (regression). Strengths: simple, interpretable. Weaknesses: sensitive to noise, computationally expensive for large datasets.
How can you identify overfitting in a machine learning model? What steps can be taken to mitigate it?
Overfitting can be identified by a model performing well on training data but poorly on test data. Techniques like cross-validation, regularization, and reducing model complexity can mitigate overfitting.
Explain the purpose of the validation set in the context of hyperparameter tuning.
The validation set is used to tune hyperparameters to optimize the model’s generalization performance. It helps prevent overfitting by providing a way to evaluate models on data not seen during training.
Why is the mean squared error (MSE) commonly used as a loss function in regression problems?
MSE measures the average squared difference between predicted and actual values. It penalizes large errors more than small ones, making it useful for regression problems where large deviations are particularly undesirable.
How does increasing the complexity of a model typically affect bias and variance? Explain with examples.
Increasing complexity typically reduces bias but increases variance. For example, a simple linear model may underfit the data (high bias, low variance), while a very flexible neural network may overfit (low bias, high variance).
What is the difference between in-sample and out-of-sample performance? Why is the latter more important?
In-sample performance refers to how well the model fits the training data, while out-of-sample performance measures its generalization to new data. Out-of-sample performance is more important for ensuring the model works in real-world scenarios.
Describe the process of k-fold cross-validation and its advantages over simple holdout validation.
In k-fold cross-validation, data is split into k parts, and the model is trained and validated k times, each time using a different fold as validation data. This reduces the risk of overfitting and provides a more robust performance estimate.
How does the choice of the hyperparameter ‘k’ in k-NN affect the bias-variance tradeoff?
A small value of k in k-NN leads to low bias and high variance, making the model sensitive to noise. A larger k increases bias but reduces variance, leading to smoother decision boundaries but possibly missing finer patterns.
What is the difference between model training error and generalization error?
Training error refers to the error on the data used to fit the model, while generalization error refers to the error on new, unseen data. A model may have low training error but high generalization error if it overfits.