Model Evaluation Flashcards
Technical interview study
What are the three types of error in a ML model? Briefly explain each.
- Bias: Error caused by choosing an algorithm that cannot accurately model the signal in the data, i.e. the model is too general or was incorrectly selected. e.g. selecting simple linear regression to model a highly nonlinear data would result in error due to bias.
- Variance: Error from an estimator being too specific (flexible) and learning relationships that are specific to the training set but do not generalize well to new samples. Variance can come from fitting too closely to noise in the data, and models with high variance are highly sensitive to changing inputs. e.g., creating a decision tree that splits the training set until every leaf node only contains 1 sample.
- Irreducible Error: Error caused by noise in the data that cannot be removed through modeling. e.g.: inaccuracy in data collection causes irreducible error.
What is the bias-variance trade off?
Bias refers to error from an estimator that is too general (inflexible) and does not learn relationships from a data set that would allow it to make better predictions.
Variance refers to error from an estimator being too specific (overly flexible) which learns relationships that are specific to the training set but will not generalized well to new data.
In short, bias-variance trade off is the trade off between underfitting and overfitting. As we decrease variance, we tend to increase bias. As we increase variance, we tend to decrease bias.
Our goal is to create models that minimize the overall error by careful model selection and tuning to ensure there is a balance between bias and variance: general enough to make good predictions on new data but specific (flexible) enough to pick up as much signal as possible.
What are some naive approaches to classification that can be used as a baseline for results?
- Predict Only the Most Common Class: if the majority of samples have a target of 1, predict 1 for the entire validation set. This is extremely useful as a baseline for imbalanced data sets.
- Predict a Random Class: if we have two classes 0 and 1, randomly select either 1 or 0 for each sample in the validation set.
- Randomly Draw from a Distribution Created by the Target Variable Training Set: if we have two classes, 70% of the training samples are A and 30% of training samples are B, then we randomly samples this distribution to create predictions for our validation set.
These baseline examples are good to calculate at the start and we should include at least one when making any assertions about the efficacy of the model, e.g., claiming “our model was 50% more accurate than the naive approach of suggesting all customers buy the most popular car.”
Explain the classification metrics Area Under the Curve (AUC) and Gini.
AUC is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one. It can have values between 0 and 1, with values closer to 1 indicating a more predictive model.
An uninformative model (guessing 0 or 1 at random on a balanced data set) will yield a score of 0.5, and a model that always predicts wrong will have a score of 0.
Gini is a similar metric that scales AUC between -1 and 1 so that 0 represents a model that makes random predictions. Gini = 2*AUC-1.
What is the difference between boosting and bagging?
Bagging and boosting are both ensemble methods, where they combine weak predictors to create a strong predictor. One key difference is that bagging builds independent models in parallel, whereas boosting build them sequentially, at each step emphasizing the observations that were misclassified in previous steps.
How can we tell if our model is underfitting the data?
If our training and validation errors are relatively equal and very high, then our model is most likely underfitting our training data.
How can we tell if our model is overfitting the training data?
If our training error is low and our validation error is high, then our model is most likely overfitting our training data.
Name and briefly explain several evaluation metrics that are useful for classification problems.
- Accuracy: measures the percentage of the time we correctly classify samples.
Accuracy = (true positive+true negative)/all samples
- Precision: measures the percentage of the predicted members that were correctly classified.
Precision = true positives / (true positives + false positives)
- Recall: measures the percentage of the true members that were correctly classified by the algorithm.
Recall = true positives / (true positives + false negatives)
- F1: measurement that balances accuracy and precision (or you can think of it as balancing Type I and Type II errors).
- AUC: describes the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one.
- Gini: a scaled and centered version of AUC.
- Log-loss: similar to accuracy but increases the penalty for incorrect classifications that are “further” away from their true class. For log-loss, lower values are better.
Name and explain several metrics that are useful for regression problems.
- Mean Squared Error (MSE): the average of the squared error of each prediction. 1/n * sum(yhat - yi)^2.
- Root Mean Squared Error (RMSE): square root of MSE.
- Mean Average Error (MAE): the average of the absolute error of each prediction, sum |yi - yhat| /n
- Coefficient of Determination (R^2): proportion of variance in the target variable that is predictable from the features.
where,
SS_reg = sum(yhat - ybar)^2
SS_res = sum(yi - yhat)^2
SS_total = sum(yi - ybar)^2 = SS_res + SS_reg
R2 = 1-(SS_res/SS_total) ~= SS_reg/SS_tot
Why use ROC?
ROC summarizes all confusion matrices that each DECISION THRESHOLD produced, thus indicates the decision threshold value which returns best prediction result for a given clf.
So when we want to control the False-Negative outcome when having false negative would be catastrophic, e.g. if classifying ebola infection; a false-negative misclassification would risk an outbreak (higher accuracy of True-Negatives at cost of more misclassified False-Positives).
ROC x-axis is False-Positive Rate = 1-Specificity = FP/(FP + TN) = proportion of negative class that were misclassified as positives
ROC y-axis is True Positive Rate = sensitivity = TP /(TP + FN = proportion of positive class samples correctly clf’d)
Why use AUC?
The AUC allows us to COMPARE one ROC from a classifier to another ROC from a different classifier.
So if AUC_logistic > AUC_SVM, then we would select logistic regression over an SVM classifier!
Explain how to visualize how the performance of a model changes as the value of a hyperparameter changes
Many training algos contain hyperparameters that must be chosen before training begins. For example, one hyperparam in random forest is the number of weak trees in the forest (ensemble) and it is useful to viz how RF performance changes as a hyperparam value changes.
In sklearn, we can calc a validation curve which contains three important parameters:
param_name: the name of the hyperparam to vary
param_range: value of the hyperparam to use
scoring: the evaluation metric
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_section import validation_curve
rf = RandomForestClassifier()
param_range=np.arange(10,100,5)
# compute accuracy scores array over hyperprams train_scores, test_scores = validation_curve(rf, Xtrain, ytrain, param_name="n_estimators", param_range=param_range, cv=3, scoring="accuracy")
# compute mean for train scores train_mean = np.mean(train_scores, axis=1)
# compute mean for test scores test_mean = np.mean(test_scores, axis=1)
plot mean train and test scores with corresp. std
plt. plot(param_range, train_mean, label=”train_score”)
plt. plot(param_range, test_mean, label=”test_score”)