H2023 ML Flashcards
Which of the following statements is true about linear regression?
In linear regression, the coefficients of the model are determined by minimizing the sum of the squared differences between the observed and predicted values.
Linear regression can only be used for classification problems, not for predicting continuous outcomes.
The primary assumption of linear regression is that the relationship between the input variables and the target variable is non-linear.
Linear regression is highly recommended for datasets with a large number of categorical variables, as it naturally handles categorical interactions.
In linear regression, the coefficients of the model are determined by minimizing the sum of the squared differences between the observed and predicted values.
In linear regression, the goal is to find the line (or hyperplane in higher
dimensions) that best fits the data. This is achieved by minimizing the sum of the squared differences between the observed values (actual responses) and the predicted values (estimated from the linear model). This method is known as the least squares approach.
Which of the following statements accurately describes cross-validation techniques used in machine learning? Note that there may be more than one correct answer.
A: Cross-validation is primarily used when you have a large data set.
B: In k-fold cross-validation, the dataset is divided into k subsets, and the model evaluation process is repeated k times. Each iteration uses a different subset as the test set and the remaining subsets as the
training set.
C: K-fold cross-validation typically requires more computational time compared to hold-out validation, as the model is trained and evaluated k times, once for each fold.
B: In k-fold cross-validation, the dataset is divided into k subsets, and the model evaluation process is repeated k times. Each iteration uses a different subset as the test set and the remaining subsets as the
training set.
C: K-fold cross-validation typically requires more computational time compared to hold-out validation, as the model is trained and evaluated k times, once for each fold.
Response B: This is an accurate description of k-fold cross-validation. The dataset is partitioned into ‘k’ equal-sized
subsets or folds. In each of the ‘k’ iterations, a different fold is used as the test set, and the remaining ‘k-1’ folds are
used for training. This process helps ensure that every data point is used for training and testing, providing a
comprehensive evaluation of the model’s performance.
Response C: This response correctly highlights the increased computational time associated with k-fold crossvalidation compared to hold-out validation. In hold-out validation, the model is trained and tested only once. In
contrast, in k-fold cross-validation, this process is repeated multiple times (k times), leading to increased
computational demands.
Which of the following statements is true about model ensembling techniques in machine learning?
A: Ensemble methods always reduce the risk of overfitting, as they combine the predictions from multiple models.
B: Boosting is an ensemble technique where models are trained sequentially with each model trying to correct the errors of the previous ones.
C: Bagging and boosting are identical in their approach, as both involve training multiple models on the same dataset and averaging their predictions.
D: Ensemble methods are less computationally intensive than training a single model, as they divide the training process across multiple, simpler models
B: Boosting is an ensemble technique where models are trained sequentially with each model trying to correct the errors of the previous ones
Response B: Boosting is indeed an ensemble technique where models are trained sequentially. Each subsequent model focuses on correcting the errors made by the previous model. This iterative process aims to improve the overall performance of the ensemble. Therefore, this statement is accurate
Which of the following statements are true about decision trees and random forests? Note that there may be more than one correct answer.
A
In decision trees, the Gini impurity helps to estimate the likelihood of incorrect classification of a randomly selected instance. It does this based on how the labels are distributed in that part of the tree.
B
Entropy, used in decision trees, is a measure of the disorder or uncertainty and is used to determine the homogeneity of a dataset.
C
Random forests reduce overfitting by averaging the predictions from multiple decision trees, each trained on a different subset of the training data.
D
In random forests, each decision tree is trained on a completely different set of features from the training data, which significantly reduces the correlation among the trees and improves the overall accuracy.
A
In decision trees, the Gini impurity helps to estimate the likelihood of incorrect classification of a randomly selected instance. It does this based on how the labels are distributed in that part of the tree.
B
Entropy, used in decision trees, is a measure of the disorder or uncertainty and is used to determine the homogeneity of a dataset.
C
Random forests reduce overfitting by averaging the predictions from multiple decision trees, each trained on a different subset of the training data.
Response A: Correct
In decision trees, the Gini impurity is indeed a measure used for assessing the likelihood of a random error in
classification. This measure is calculated for a subset of data and indicates the probability of misclassifying a
randomly chosen instance if it were randomly labeled according to the distribution of labels in the subset. A lower Gini
impurity implies a better separation of classes within the subset.
Response B: Correct
Entropy is another measure used in decision trees to quantify the disorder or uncertainty in a dataset. It’s a measure of the homogeneity of the dataset, with higher entropy indicating less homogeneity. Entropy helps in deciding which feature to split on at each step in building the tree, aiming to maximize the information gain – essentially reducing the
uncertainty or disorder after the split.
Response C: Correct
Random forests address the overfitting problem common with decision trees. They do this by creating multiple decision trees, each trained on a different subset of the training data, and then averaging their predictions. This ensemble approach not only reduces overfitting but also generally improves model accuracy due to the averaging of individual trees’ predictions, which can vary significantly.
Response D: Incorrect
This statement contains a common misconception about random forests. In random forests, each decision tree is indeed trained on a different subset of the training data, but the feature set is not completely different for each tree. Rather, a random subset of features is used at each split within each tree. This process, known as feature bagging, reduces the correlation among the trees but does not involve training each tree on an entirely separate set of features.
In the context of Machine Learning Engineering (MLE), which of the following statements accurately reflects the fundamentals of model deployment, monitoring, lifecycle management, and challenges faced in ML deployment? Select all that apply.
A
In MLE, the primary focus is on developing complex models, as simpler models are generally less effective in real-world applications.
B
MLE involves regular re-evaluation of the model’s performance metrics against business objectives.
C
Model drift refers to the changes in model predictions due to updated software dependencies.
D
Automating model training and deployment processes is a fundamental aspect of MLE.
E
Model lifecycle management includes phases like development, deployment, monitoring, and retirement of models.
F
Continuous monitoring of deployed models is a critical component of MLE to identify performance degradation over time.
G
Data privacy and ethical considerations are peripheral concerns in the MLE lifecycle.
B
MLE involves regular re-evaluation of the model’s performance metrics against business objectives.
D
Automating model training and deployment processes is a fundamental aspect of MLE.
E
Model lifecycle management includes phases like development, deployment, monitoring, and retirement of models.
F
Continuous monitoring of deployed models is a critical component of MLE to identify performance degradation over time.
A: Incorrect. This statement is misleading in the context of MLE. While developing complex models is a part of MLE, the focus is not exclusively on complexity. Often, simpler models are preferred due to their efficiency, interpretability, and easier maintenance and deployment. This option fails to capture the balance required in model selection and development in real-world applications.
B: Correct. Regular re-evaluation of model performance against business objectives is integral to MLE. This process ensures that the model continues to meet the intended goals and provides value in a practical setting.
C: Incorrect. Model drift actually refers to the change in model performance over time due to changes in the underlying data, not software dependencies. The concept of model drift is crucial for understanding how models behave in dynamic real-world environments.
D: Correct. Automating model training and deployment processes is indeed a fundamental aspect of MLE. Automation improves efficiency, consistency, and scalability in deploying machine learning models
E: Correct. This option correctly identifies the phases of model lifecycle management in Machine Learning Engineering, which includes development, deployment, monitoring, and eventual retirement of models. It’s a fundamental concept in understanding how models are managed over their entire lifespan.
F: Correct. Continuous monitoring of deployed models is essential in MLE for identifying any performance degradation or changes in data patterns over time. This helps maintain the model’s accuracy and relevance in a realworld setting.
G: Incorrect. Data privacy and ethical considerations are central, not peripheral, to the MLE lifecycle. This statement is misleading as it undervalues the importance of ethical practices and adherence to privacy laws in developing and deploying machine learning models.
When preparing data for a machine learning model, various preprocessing steps are often required to enhance the
performance and accuracy of the model. Which of the following statements correctly describes these preprocessing steps? Select all that apply.
A
Missing values in a dataset should always be filled with the mean value of the respective feature to ensure model accuracy.
B
One-hot encoding of categorical variables increases the dimensionality of the dataset and may lead to problems.
C
All missing values in a dataset signify errors in data collection and should be removed before model training.
B
One-hot encoding of categorical variables increases the dimensionality of the dataset and may lead to problems.
One-hot encoding transforms categorical variables into a form that can be provided to machine learning algorithms to improve predictions. However, it creates a separate binary column for each category, which can
significantly increase the dimensionality of the dataset (a phenomenon known as the “curse of dimensionality”). This can lead to problems such as increased computational cost and the risk of overfitting, especially if there are many categories or if some categories have very few instances
A: Incorrect. While filling missing values with the mean (or median) of a feature can be a valid approach in some cases, it’s not a one-size-fits-all solution. The appropriateness of using the mean value depends on the nature of the data and the specific circumstances. In some cases, using the mean might introduce bias or distort the distribution of the data, especially if the missing data is not randomly distributed. Additionally, other methods like imputation, using a constant value, or even modeling the missing values themselves can sometimes be more suitable.
B: Correct. One-hot encoding transforms categorical variables into a form that can be provided to machine learning algorithms to improve predictions. However, it creates a separate binary column for each category, which can significantly increase the dimensionality of the dataset (a phenomenon known as the “curse of dimensionality”). This
can lead to problems such as increased computational cost and the risk of overfitting, especially if there are many categories or if some categories have very few instances.
C: Incorrect. While missing values can sometimes be due to errors in data collection, this is not always the case. Missing data can arise for various reasons, including the nature of the data collection process, refusal to respond in surveys, or the absence of an applicable response. Automatically removing all missing values can lead to a significant loss of information, especially if the missingness is informative or if a substantial proportion of data is missing. Instead, a more nuanced approach is required, where the cause of the missing data is considered, and
appropriate techniques (such as imputation or model-based approaches) are used to handle it.
While training a machine learning model using scikit-learn in Python, you encounter the following error message:
ValueError: could not convert string to float.
Based on this error, which of the following steps should you take
to resolve it? Select all that apply.
A
Ensure that all categorical variables in your dataset are properly encoded into numerical formats before training.
B
Convert all data in your dataset to string format, as scikit-learn models can handle string inputs.
C
Check for and handle any non-numeric values in your dataset, especially in columns that are expected to be numeric.
D
Use a data imputation technique to fill any missing values that might be causing this error.
E
Implement feature scaling to standardize or normalize the numerical values in your dataset.
F
Verify the data types of each column in your DataFrame to ensure they match the expected types for your model.
G
Increase the computational resources allocated to your Python environment, as this error may be due to insufficient processing power.
A
Ensure that all categorical variables in your dataset are properly encoded into numerical formats before training.
C
Check for and handle any non-numeric values in your dataset, especially in columns that are expected to be numeric.
F
Verify the data types of each column in your DataFrame to ensure they match the expected types for your model.
A: Correct. This error often occurs when a machine learning algorithm encounters categorical data in string format. Most machine learning models in scikit-learn require numerical input. Therefore, categorical variables should be encoded into numerical formats, such as using one-hot encoding or label encoding, before training the model.
B: Incorrect. Converting all data to string format is not a solution; in fact, it exacerbates the problem. Scikit-learn models typically cannot handle string inputs for numerical computations. This approach would lead to further errors rather than resolving the existing one.
C: Correct. The error message suggests that the model is attempting to convert a string to a float, indicating the presence of non-numeric values in a column expected to be numeric. Identifying and handling these non-numeric values is a crucial step in resolving the error.
D: Incorrect. While data imputation is an important step in handling missing values, this error message does not specifically indicate a problem with missing values. Imputation might be helpful in other contexts but is unlikely to resolve this specific error.
E: Incorrect. Feature scaling, such as standardization or normalization, is essential for many machine learning algorithms, but it is not relevant to the error of converting a string to a float. This step is more about adjusting the range or distribution of numeric values, not converting data types.
F: Correct. Verifying the data types of each column is an important diagnostic step. This error often arises when a column that is supposed to be numeric is instead interpreted as a string. Ensuring that each column has the correct data type is crucial in preprocessing for machine learning models.
G: Incorrect. This error message is related to data types and formats, not computational resources. Increasing computational resources will not resolve a data type mismatch and is not a relevant solution in this context.
Understanding overfitting and underfitting is crucial in developing effective machine learning models. Which of the following statements accurately describes these concepts, provides examples of both, and suggests ways to address these issues? Select all that apply.
A
Overfitting occurs when a model learns the noise and random fluctuations in the training data to the extent that it negatively impacts the performance on new data. An example is a high-degree polynomial
regression model fitting random noise in data.
B
Underfitting is typically addressed by increasing the complexity of the model, such as adding more features or using a more sophisticated model.
C
A model that performs equally poorly on both training and test data is an example of overfitting.
D
Regularization techniques, like L1 and L2 regularization, are used to prevent overfitting by constraining the model’s complexity.
E
Simplifying the model or reducing the number of features can be effective strategies to address overfitting.
F
If a model has very high accuracy on training data but poor accuracy on validation data, it is likely underfitting.
G
Adding more training data can help in reducing overfitting, as it provides the model with a broader representation of the underlying problem.
H
Boosting model complexity indiscriminately, without proper validation, is a recommended way to combat
underfitting.
A
Overfitting occurs when a model learns the noise and random fluctuations in the training data to the extent that it negatively impacts the performance on new data. An example is a high-degree polynomial
regression model fitting random noise in data.
B
Underfitting is typically addressed by increasing the complexity of the model, such as adding more features or using a more sophisticated model.
D
Regularization techniques, like L1 and L2 regularization, are used to prevent overfitting by constraining the model’s complexity.
E
Simplifying the model or reducing the number of features can be effective strategies to address overfitting.
G
Adding more training data can help in reducing overfitting, as it provides the model with a broader representation of the underlying problem.
A: Correct. Overfitting indeed occurs when a model learns not only the underlying patterns but also the noise and random fluctuations in the training data. This leads to poor generalization to new, unseen data. The example of a high-degree polynomial regression model fitting noise illustrates this concept well. Such models can capture complex
patterns but also tend to learn noise, especially when the degree of the polynomial is unnecessarily high.
B: Correct. Underfitting happens when a model is too simple to capture the complexities of the data. This issue is often addressed by increasing the model’s complexity, such as adding more features or using a more sophisticated algorithm. This can help the model to better capture the underlying patterns in the data.
C: Incorrect. A model performing poorly on both training and test data is an indication of underfitting, not overfitting. Overfitting is characterized by high performance on training data but poor performance on unseen data.
D: Correct. Regularization techniques like L1 and L2 are indeed used to prevent overfitting. They work by adding a penalty to the model’s complexity, effectively constraining it and discouraging it from learning noise and unnecessary details from the training data.
E: Correct. Simplifying the model or reducing the number of features are common strategies to combat overfitting. By reducing complexity, the model is less likely to learn noise and random fluctuations in the training data.
F: Incorrect. High accuracy on training data but poor accuracy on validation data is a sign of overfitting, not underfitting. Underfitting would manifest as poor performance on both training and validation data.
G: Correct. Adding more training data can help reduce overfitting. More data provides a broader representation of the underlying problem, making it harder for the model to learn and memorize noise specific to a smaller dataset.
H: Incorrect. Boosting model complexity indiscriminately is not a recommended way to combat underfitting. While increasing complexity can help address underfitting, it needs to be done judiciously and in tandem with proper validation to ensure that the model does not swing to the other extreme of overfitting.
Consider a confusion matrix for a binary classifier:
__________
|TP| FN|
|FP| TN|
Which of the following is true?
TN is the number of positives correctly identified by the classifier
FP is the number of negatives incorrectly labeled as positive
FN is the number of positives incorrectly labeled as negative.
TP is the number of negatives correctly identified by the classifier.
FP is the number of negatives incorrectly labeled as positive
FN is the number of positives incorrectly labeled as negative.
In a confusion matrix for a binary classifier:
True Negatives (TN) are the correctly identified negative cases
False Positives (FP) are the negative cases incorrectly labeled as positive,
False Negatives (FN) are the positive cases incorrectly labeled as negative
True Positives (TP) are the correctly identified positive cases
In a precision-recall trade-off, which of the following is generally true?
Increasing precision always increases recall.
Increasing recall always increases precision.
Increasing precision often decreases recall and vice versa.
There is no relationship between precision and recall.
Increasing precision often decreases recall and vice versa.
Increasing precision often decreases recall and vice versa. This trade-off
occurs because increasing precision usually requires being more conservative about labeling positive cases, which can result in missing some actual positives (decreasing recall). Conversely, increasing recall (capturing more actual positives) can lead to more false positives, thus lowering precision.
Imagine you are working on a machine learning project involving a regression model for predicting housing prices.
After deploying the model, you observe that it, which exhibited strong performance on your training and validation datasets, experiences a significant drop in accuracy when applied to real-world data. Describe a systematic approach to diagnosing and addressing this issue.
In your answer, you may want to address the following points:
Initial Assessment: Begin with your initial steps to investigate the discrepancy in performance. How would you verify
that the drop in performance is genuine and not a result of reporting errors or implementation issues?
Data Analysis: Discuss potential data-related issues that could lead to this decrease in performance. How would you
examine whether there’s been a change in the data distribution between your training/validation sets and the realworld data?
Model Analysis: Consider model-specific factors that might contribute to this issue. What tools or techniques would you use to determine if the model is overfitting or if certain features fail to generalize?
Improvement Strategies: Suggest strategies to enhance the model’s performance in real-world scenarios. Would retraining the model, altering features, or applying methods like regularization be appropriate?
Validation of Improvements: Describe how you would validate the effectiveness of your improvements. Which metrics would you choose, and how would you confirm that the refined model performs better with real-world data?
Documentation and Communication: Stress the importance of documenting your process and communicating the findings to stakeholders. What essential points would you include in your communication for clarity and actionable insights?
A systematic approach is required to address the observed drop in accuracy of the regression model for predicting housing prices. This would involve a series of steps, each aimed at identifying and rectifying potential issues. Note that your answer may highlight other issues than those exemplified below.
**Initial Assessment
Verification of the Problem: The first step is to ensure that the drop in performance is not due to reporting errors or implementation issues. This involves:
* Re-examining the deployment pipeline for any discrepancies.
* Verifying that the model version in production matches the one tested.
* Validating data logging and performance metrics for accuracy.
*Comparing performance metrics (like RMSE or MAE) in the real-world application against those obtained during testing.
**Data Analysis
Investigating Data-Related Issues: Data is often the culprit for such discrepancies. Steps include:
* Data Drift Analysis: Checking if the real-world data has drifted from the training/validation data. This involves comparing distributions of features and target variables.
* Quality Check: Ensuring there are no issues in data quality like missing values, outliers, or incorrect data types in the real-world data.
* Feature Relevance: Verifying if all features used in the model are still relevant and accurately captured in the real-world scenario.
**Model Analysis
Model-Specific Factors Investigation: To determine if the model is overfitting or if some features are not generalizing well:
*Overfitting Check: Analyze the model’s performance on training versus validation sets. A significant discrepancy might indicate overfitting.
* Feature Importance Analysis: Utilize tools to assess the impact of each feature on the model’s predictions. This can reveal if any feature is dominating the prediction erroneously.
**Improvement Strategies
Enhancing Model Performance: Depending on the findings, several strategies could be employed:
* Retraining the Model: If new patterns are identified in real-world data, retraining with a more representative dataset might be necessary.
* Feature Engineering: Modifying existing features or creating new ones that better capture the complexities of real-world data.
* Regularization Techniques: Applying methods like Lasso or Ridge regression to prevent overfitting.
* Data Augmentation: If the real-world data is significantly different, supplementing the training data to better reflect these differences could help.
**Validation of Improvements
Effectiveness of Improvements: To validate the enhanced model:
* Re-evaluation with Updated Metrics: Use the same metrics (like RMSE) to measure performance on a new or updated validation set that closely mimics real-world data.
* A/B Testing: Comparing the new model’s performance against the old model in a controlled real-world environment
**Documentation and Communication
Clear Documentation and Communication: Ensuring transparency and clarity in findings is critical.
* Documentation: Detailing each step taken, from initial problem identification to the implementation of solutions.
* Stakeholder Communication: Clearly communicate the findings, the rationale behind chosen strategies, and the impact of changes. Include visualizations to aid understanding.
This approach requires a balance between technical analysis and practical application, ensuring that any changes made directly address the identified issues while also being mindful of the model’s complexity and the data’s integrity.