Basics Flashcards

1
Q

What is the difference between supervised and unsupervised learning? Provide examples of each.

A

Supervised learning involves learning a function that maps inputs to known outputs using labeled data. An example is predicting house prices (regression). Unsupervised learning involves finding patterns in data without predefined labels, like clustering customers into segments (clustering).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Explain the bias-variance tradeoff and how it relates to overfitting and underfitting.

A

The bias-variance tradeoff refers to the balance between the model’s ability to fit training data and generalize to new data. High bias models are too simple and underfit, while high variance models are too complex and overfit. The goal is to find the right balance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the main differences between regression and classification problems? Provide examples.

A

Regression involves predicting a continuous output (e.g., predicting house prices), while classification involves predicting categorical outputs (e.g., classifying emails as spam or not). The key difference is in the nature of the target variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define model parameters and hyperparameters, and explain their roles in machine learning models.

A

Model parameters are learned during training (e.g., weights in a neural network). Hyperparameters control the learning process (e.g., learning rate or k in k-NN). Hyperparameters are set before training and optimized using validation data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How do training, validation, and test data sets differ, and why are they important?

A

Training data is used to fit the model, validation data helps tune hyperparameters, and test data is used to evaluate final performance. Separating these sets prevents overfitting and ensures generalization to new data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is cross-validation, and why is it important in machine learning model evaluation?

A

Cross-validation is a technique where the data is split into k subsets, and the model is trained and validated on different combinations of these subsets. It helps provide a more reliable estimate of a model’s performance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe how the k-nearest neighbors (k-NN) algorithm works. What are its key strengths and weaknesses?

A

k-NN works by finding the k closest data points to a query point and predicting the majority class (classification) or average value (regression). Strengths: simple, interpretable. Weaknesses: sensitive to noise, computationally expensive for large datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you identify overfitting in a machine learning model? What steps can be taken to mitigate it?

A

Overfitting can be identified by a model performing well on training data but poorly on test data. Techniques like cross-validation, regularization, and reducing model complexity can mitigate overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain the purpose of the validation set in the context of hyperparameter tuning.

A

The validation set is used to tune hyperparameters to optimize the model’s generalization performance. It helps prevent overfitting by providing a way to evaluate models on data not seen during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Why is the mean squared error (MSE) commonly used as a loss function in regression problems?

A

MSE measures the average squared difference between predicted and actual values. It penalizes large errors more than small ones, making it useful for regression problems where large deviations are particularly undesirable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How does increasing the complexity of a model typically affect bias and variance? Explain with examples.

A

Increasing complexity typically reduces bias but increases variance. For example, a simple linear model may underfit the data (high bias, low variance), while a very flexible neural network may overfit (low bias, high variance).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between in-sample and out-of-sample performance? Why is the latter more important?

A

In-sample performance refers to how well the model fits the training data, while out-of-sample performance measures its generalization to new data. Out-of-sample performance is more important for ensuring the model works in real-world scenarios.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe the process of k-fold cross-validation and its advantages over simple holdout validation.

A

In k-fold cross-validation, data is split into k parts, and the model is trained and validated k times, each time using a different fold as validation data. This reduces the risk of overfitting and provides a more robust performance estimate.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How does the choice of the hyperparameter ‘k’ in k-NN affect the bias-variance tradeoff?

A

A small value of k in k-NN leads to low bias and high variance, making the model sensitive to noise. A larger k increases bias but reduces variance, leading to smoother decision boundaries but possibly missing finer patterns.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the difference between model training error and generalization error?

A

Training error refers to the error on the data used to fit the model, while generalization error refers to the error on new, unseen data. A model may have low training error but high generalization error if it overfits.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is the concept of irreducible error important when evaluating the performance of machine learning models?

A

Irreducible error represents the noise or randomness in the data that no model can capture. It’s the lower bound on the error, regardless of how well the model is trained.

17
Q

In what scenarios would a classification model be preferred over a regression model? Provide examples.

A

Classification models are preferred when the target variable is categorical. For example, classifying whether a transaction is fraudulent (yes/no) or identifying the species of a plant (species type) are classification tasks.

18
Q

How do you select the optimal value of hyperparameters in a machine learning model?

A

Hyperparameters are typically selected using cross-validation, where different combinations are tested, and the one that results in the best validation performance is chosen.

19
Q

What are the benefits and drawbacks of using k-NN as a classifier in high-dimensional spaces?

A

In high-dimensional spaces, k-NN suffers from the ‘curse of dimensionality,’ where distances between points become less meaningful. This can make the algorithm less effective and more prone to overfitting.