Concepts Flashcards

1
Q

Machine Learning

A

The process of creating an algorithm that predicts an outcome from data and can improve its performance through experience.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Supervised learning algorithms are…

A
  • Trained on labeled data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Unsupervised learning algorithms are…

A
  • trained on unlabeled data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Regularization is…

A

Any process that reduces generalization error (i.e. testing error) but not training error. It controls a model’s capacity (it’s ability to fit a wide variety of functions) and therefore prevents overfitting.

Examples include:
- L1/L2 in linear/logistic regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hyperparameters are…

A

Parameters that can be used to control an algorithm’s behavior but are not learned. These should be tuned on a validation set.

Examples include:
- alpha in linear regression, or “learning rate”, which controls the step size in the gradient descent algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Difference between regression and classification

A

Regression is a process to predict continuous output values.

Classification is a process to predict categorical output values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a cost function?

A

A cost function measures the accuracy of our prediction. It quantifies how well our predicted outcome matches the actual outcome.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Linear regression

  • When is it used?
  • What is the hypothesis?
  • What is the cost function?
  • Are there any assumptions?
A

Linear regression is used to predict a continuous outcome (e.g. house prices) from one or more input variables. These input variables can be continuous or categorical, but they must be represented numerically.

The hypothesis is a linear model:
y = theta_0 + theta_1 * x ; y = transpose of theta * x (first column of x is all 1’s)

A common cost function is Mean Squared Error (MSE). This function is parabolic in the univariate case.

Assumptions (check?):
- linearity
- normality
- independence
- no multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the Mean Standard Error (MSE) cost function for linear regression?

A

J = (1/2n) * sum from 1 to n(y_predicted - y_actual)^2
= (1/2n) * sum from 1 to n(theta_0 + theta_1 * x_i - y_i)^2
= (1/2n) * (x * theta - y) transpose * (x * theta - y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

There are two ways to determine the coefficients for a linear regression model. What are they?

A

The Mean Standard Error (MSE) cost function can be minimized using gradient descent.

In the special case of linear regression, the cost function can also be minimized analytically.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain gradient descent.

A

Gradient descent is an algorithm that updates the coefficients to minimize the cost function.

In the case of linear regression:

  • initial values of theta are chosen
  • these are updated iteratively based on the slope of the cost function; we take steps along the cost function in the direction of greatest descent
  • the size of the steps is controlled by hyperparameter alpha (“learning rate”)
  • this occurs until a minimum is found (stopping conditions?)

The update equations look something like:
theta_updated = theta_current - alpha * partial derivative of the cost function with respect to theta_current

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Discuss the effect of learning rate (hyperparameter alpha) in linear regression.

A

Alpha controls the rate of gradient descent when minimizing a cost function for linear regression. A larger value of alpha produces a larger step size, while a smaller value of alpha produces a smaller step size. If alpha is too small, you can find the minimum very precisely, but the algorithm may take a long time to converge. If alpha is too large, you may overshoot the minimum, and the algorithm may fail to converge or even diverge.

Note that alpha naturally gets smaller as the number of iterations increases (and thus the slope of the cost function approaches zero).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Things to keep in mind when preparing data for Machine Learning.

A

Gradient descent will work best if all of the input values x are between -1 and 1, or even -0.5 and 0.5. To achieve this:

  • Feature scaling: divide all input values by the range of input values to achieve a range of 1
  • Mean normalization: subtract the average value of each input variable from the values of that input variable to achieve an average of 0
  • Standardization: subtract the average value of each input variable from the values of that input variable and divide by the standard deviation of that input variable to achieve an average of 0 and a standard deviation of 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How might you assess how well gradient descent is working?

A

Plot the value of the cost function at each iteration by the interaction number. The function should be strictly decreasing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Define MAE

A

Mean Absolute Error - a metric for accessing the accuracy of a regression model.

It is the average of the absolute differences of the residuals.

MAE = (1/n) * sum from 1 to n ( abs( y_i, actual - y_i, predicted) )

Smaller values of MAE indicate better model performance.

MAE places bounds on root mean squared error (RMSE):
MAE <= RMSE <= MAE*sqrt(n)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the need for validation and test sets.

A

You need a validation set to tune hyperparameters without letting your model see your “test” set.

You need a test set because, once you’ve trained a model, you need to be able to assess its performance in the real world (i.e. on data it’s never seen before).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Explain regularization in the context of linear regression.

A

Regularization increases generalizability by penalizing non-zero coefficients. (I.e., you want to discourage having more parameters than you need in your model).

L1 (or lasso) penalizes by the sum of the absolute values of the coefficients (Manhattan distance)

L2 (or ridge) penalizes by the sum of the squared coefficients * 1/2 (Euclidean distance)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Explain L1 regularization

A

L1 regularization, or lasso regularization adds the following term to a cost function: J = J + lambda*(sum from 1 to n of the absolute value of the coefficients). This is sum is also known as the Manhattan distance.

It has the effect of setting small coefficients to 0, thereby doing feature reduction. This improves interpretability.

Hyperparameter lambda controls how strong this penalty is.

The default lambda value in sklearn (lambda is termed “alpha” in sklearn.linear_model.Lasso) is 1.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Explain L2 regularization

A

L2 regularization, or ridge regularization adds the following term to a cost function: J = J + (1/2)lambda(sum from 1 to n of the coefficients^2). This is sum is also known as the Euclidean distance.

It has the effect of setting small coefficients to almost 0, so it does not eliminate any features. As a result, models using L2 regularization often have a high number of features and can be prone to overfitting.

Hyperparameter lambda controls how strong this penalty is.

The default lambda value in sklearn (lambda is termed “alpha” in sklearn.linear_model.Lasso) is 1.

L2 regularization requires the data to be scaled (? And L1 doesn’t?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Logistic regression

  • When is it used?
  • What is the hypothesis?
  • What is the cost function?
  • Are there any assumptions?
A

Logistic regression is the simplest classification algorithm. It predicts the probability of a positive outcome based on a set of input features. These features can be continuous or categorical, but they must be represented numerically. In mathematical terms, it predicts p(y = 1 | x). The output is 0 or 1, depending on whether the predicted probabilities are greater or less than some threshold (usually 0.5).

The hypothesis is that the log-odds of a positive outcome is a linear combination of input features:
p(x) = (e^(b + thetax))/(1 + e^(b + thetax))

The cost function is: Binary cross entropy (also called log loss)

Assumptions:
- binary predictions
- independence
- log odds of the output can be modeled as a linear combination of the inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Define odds and log odds

A
odds = p(x) / (1 - p(x))
log(odds) = log(p(x))
if p(x) = (e^(b + theta*x))/(1 + e^(b + theta*x))
then log(odds) = b + theta*x
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How are the coefficients of logistic regression typically estimated?

A

The cost function for logistic regression is usually minimized through Maximum Likelihood Estimation (MLE).

MLE is usually implemented using the Quasi-newton method.

If you’re implementing by hand, it’s easier to use gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Explain Maximum Likelihood Estimation (MLE)

A

MLE is a method used to estimate the parameters of a model. It picks the parameter values such that they maximize the likelihood that the process described by the model produced the data that were actually observed.

I.e., it estimates which curve (e.g. a normal curve) was most likely responsible for generating the data points observed. In the case that we believe a normal distribution was the process that generated the data, MLE will find the values of mu and sigma that describe the curve that best fits the observed data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Explain the decision tree learning algorithm.

A

Decision trees can be used on either categorical or continuous data and can predict either a categorical or a continuous output variable.

In a decision tree, a set of rules are chosen relating to the input features, and the rows of data that meet the rules are passed on to the next level of the tree. The rules and which features they operate on are usually chosen by default by the model; a parameter that is commonly set by the modeler, however, is the depth of the tree. It is usual to try a deeper tree to begin with, and then to scale back if the model overfits.

It is common to use a depth of 10, which will result in 2^10 leaf nodes at the bottom of the tree. But if each node only contains a few examples, the model will be prone to overfitting (not enough data to make generalizable conclusions). A sensible parameter to tune on a validation set in sklearn to handle this problem is max_leaf_nodes (e.g. try between 5 and 500).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Explain the random forest learning algorithm.

A

Random forests can be used on both continuous and categorical data, and they can predict either a continuous or a categorical outcome.

In this algorithm, a number of decision trees are generated for the same data.

There are two techniques to prevent all trees from reaching the same conclusions:

  • Bootstrap the data (sample with replacement) and assign a sample to each tree
  • Assign each tree only a subset of input features (usual to pick the sqrt of the number of input features)

To reach a prediction, the results of all trees are averaged (in the continuous case). In the categorical case, the trees vote.

Decision trees often work well with default parameters in sklearn. Params to tune: n_estimators (100), criterion (‘squared_error’), min_samples_split (2), max_depth (None)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

When should you choose a decision tree versus a random forest?

A
  • Decision trees are interpretable and easy to visualize.
  • Decision trees are highly reproducible and perform well on large data sets because they are quick to run
  • Decision trees are very prone to overfitting, especially if the tree is deep. We can limit tree depth, but this increases the risk of a biased model.
  • Random forests are able to reduce overfitting while not dramatically increasing error due to bias.
  • Random forests are also more robust to outliers and general variation in the data, because they are an ensemble method where multiple trees must reach consensus.

My sense is that decision trees are almost never used in practice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

When should you choose a random forest versus linear/logistic regression?

A
  • Decision trees and random forests can outperform linear/logistic regression if the output is not well-represented by linear combinations of the input variables (tree-based methods are non-parametric and learn interactions without them having to be explicitly modeled).
  • Random forests perform well in the case when the number of variables is close to or exceeds the number of observations, a regime in which linear/logistic regression breaks down.
  • Random forests are more robust to outliers because they are an ensemble method.
  • Random forests are generally less interpretable (easy to explain) than regression models, and they take more time and memory to run.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Explain support vector machines (SVMs).

A

SVMs can be used for regression or classification, but they’re usually used for the latter, and usually only for binary classification problems (true?). The SVM itself is a complex, multidimensional surface, also called a hyperplane, that separates classes. The goal of this algorithm is to determine the SVM such that is separates classes as successfully as possible.

In the SVM algorithm, a hyperplane is first identified that completely separates class A from class B. Then the distance from the support vectors (also called the “margin”) is maximized. (Which algorithm?)

Kernels can be used to create non-linear boundaries. In sklearn, there are options including ‘linear’, ‘rbf’, ‘poly’, and ‘sigmoid.’ Linear is usually best when you have a large number of features (> 1000) because it helps you avoid overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What are “support vectors” in Support Vector Machines (SVMs)?

A

This is easiest to think about in the context of binary classification.

The points that are the closest to the interface between clusters. The actual Support Vector Machine (SVM) is a complex multidimensional surface, called a hyperplane, that separates classes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Explain the hyperparameters in the SVM algorithm.

A

1) Gamma is the kernel coefficient if a non-linear boundary is used (i.e. rbf, sigmoid, etc). A high value of gamma (e.g. 100) will likely result in overfitting.
2) C is a penalty parameter that controls the trade-off between correct classification and smooth boundaries.

Is there regularization in SVM?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What are the pros and cons of using the SVM algorithm?

A

Pros:

  • It works well when there is a clear margin of separation between classes
  • It is effective in high dimensional spaces, even when the number of dimensions is greater than the number of samples
  • It is memory-efficient because only the support vectors are used to tune the location of the hyperplane

Cons:

  • Training time can be large on large data sets
  • It performs poorly on noisy data (i.e. there is not clear separation between classes)
  • SVM does not directly provide probability estimates. These must be obtained using k-fold cross-validation.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Explain the k-means algorithm.

A

K-means is an unsupervised learning algorithm that groups similar points together to reveal underlying patterns.

The algorithm looks for a fixed number (k) of clusters in the data, where k defines the number of centroids you want to find.

  • Starts with randomly located group of centroids
  • Calculates the distance between each data point and all k centroids
  • Assigns the data point to the closest centroid
  • After all the data points have been assigned, updates the location of each centroid to the average location of all data points assigned to that centroid.
  • The algorithm stops when centroid locations are not changing much between iterations, or when a certain number of iterations is reached.

The model output is the cluster each data record belongs to.

K is a critically-important hyperparameter.

Recommendation systems often use k-means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Define “centroid”

A

A centroid is the center of a cluster of data. Formally, it’s the average location of all the data points assigned to a cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are some considerations for the k-means algorithm?

A
  • The choice of initial positions for the centroids is important, and a poor choice can result in the algorithm failing to stabilize. Rather than assigning entirely random centroid locations, one option is to initialize the centroids at the locations of actual data points.
  • The selection of hyperparameter k matters a lot
  • Data must be normalized in order for k means to make sense. This can be done in sklearn.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How might you choose hyperparameter k in the k-means algorithm?

A
  • Sometimes k can be estimated by eye

- You can also use an elbow plot

36
Q

What is an elbow plot in the context of k-means?

A

It’s a plot that shows k on the x-axis and the within-clusters sum of squares on the y-axis.

To calculate the within-clusters sum of squares, for each cluster, calculate the Euclidean distance between each point in the cluster and the cluster’s centroid. Then sum the distances and divide by the number of points.

37
Q

Explain hierarchical clustering.

A

There are two kinds of hierarchical clustering: agglomerative and divisive. The latter is rarely used, so we’ll focus on the former.

Algorithm

  • Each data point starts out in its own cluster
  • A proximity matrix is calculated which describes how far each point is from all the others
  • The two points that are closest together are clustered, and then the proximity matrix is re-calculated (using the centroids of multi-point clusters, I assume?)
  • This process repeats until k clusters are achieved
38
Q

What are some metrics for measuring the distance between clusters and determining which clusters are closest together?

A
  • Minimum (linkage=’single’ in sklearn): All the distances between points in cluster1 and points in cluster2 are calculated; the minimum distance is taken to be the measure of proximity.
    o Performs well on non-globular clusters
    o Performs poorly on noisy clusters
  • Maximum (linkage=’complete’ in sklearn): All the distances between points in cluster1 and points in cluster2 are calculated; the maximum distance is taken to be the measure of proximity.
    o Performs well on noisy clusters
    o Performs best on globular clusters
    o Tends to break large clusters
  • Group average (linkage=’average’ in sklearn): All the distances between points in cluster1 and points in cluster2 are calculated; the average of these is taken to be the measure of proximity.
    o Performs well on noisy clusters
    o Performs best on globular clusters
    o Variations: distance between centroids, Ward’s method (linkage=’ward’)

In addition, in calculating the above, “distance” can be measured as: Euclidean distance (affinity=’euclidian’ in sklearn, required if linkage=’ward’), squared Euclidian distance, Manhattan distance (affinity=’manhattan’ in sklearn) and others.

39
Q

When would you use hierarchical clustering versus k-means?

A
  • As in k-means, it is important to normalize data prior to modeling in hierarchical clustering.
  • Hierarchical clustering allows k to be well-approximated before running the algorithm by visually inspecting a dendrogram. Dendrograms can be plotted using sklearn.
  • Agglomerative hierarchical clustering cannot be used on large data sets; both space and time complexity are large, and importantly, larger than k-means.
40
Q

What are some validation metrics for regression models?

A
MSPE?
MSAE?
MAE
RMSE
R^2
Adjusted R^2
41
Q

What are some validation metrics for classification models?

A
Precision-Recall
ROC/AUC
Accuracy
Log-loss
F1 score
42
Q

What are some validation metrics for unsupervised models?

A

Rand index

Mutual information

43
Q

What are some other validation metrics?

A

CV error
Heuristic methods to find k
BLEU score (NLP)

44
Q

What is a confusion matrix?

A

A confusion matrix is a tool to evaluate the performance of a classification model. It’s an n x n matrix where n is the number of classes you are predicting.

In the simplest case (binary classification), the matrix has four squares that capture the number of examples that were true positives, false positives, true negatives, and false negatives.

45
Q

What is a true positive?

A

In a binary classification model, a true positive is an example where the model predicted the positive class (1) and the actual data was from the positive class (1).

46
Q

What is a false positive?

A

In a binary classification model, a false positive is an example where the model predicted the positive class (1) and the actual data was from the negative class (0).

This is also known as Type 1 error.

47
Q

What is a true negative?

A

In a binary classification model, a true negative is an example where the model predicted the negative class (0) and the actual data was from the negative class (0).

48
Q

What is a false negative?

A

In a binary classification model, a false negative is an example where the model predicted the negative class (0) and the actual data was from the positive class (1).

This is also known as Type 2 error.

49
Q

Define recall.

A

Recall = number of positives identified / true number of actual positives
= TP / (TP + FN)

Also called sensitivity.

Maximize this if you want the fewest false negatives.

Examples:

  • You want to grant as many loans as possible, so you minimize the number of borrowers wrongly identified as risky
  • Amazon Customer Service: You want to find as many coaching opportunities as possible, so you minimize the number of interactions wrongly identified as lacking coaching opportunities
50
Q

Define specificity.

A

Specificity = number of negatives identified / total number of actual negatives
= TN / (TN + FP)

51
Q

Define precision.

A

Precision = number of positives identified / total number of classified positives
= TP / (TP + FP)

Also called positive predictive value (PPV)

Maximize this when you want the fewest possible false positives.

Examples:

  • You want to correctly identify when individuals have cancer, but you want to minimize the number of individuals you incorrectly diagnose with cancer
  • Amazon Customer Service: You want to correctly identify coaching opportunities, but you also want to minimize the number of interactions when you say there’s a coaching opportunity but there isn’t
52
Q

Define accuracy.

A

Accuracy = the proportion of the total predictions that were correct
= (TP + TN) / (TP + TN + FP + FN)

Most intuitive metric.

Is misleading when classes are imbalanced.

53
Q

Define F1 score

A

F1 score is the harmonic mean of recall and precision.

It tries to balance recall and precision - minimizing all false conclusions (FP, FN)

It’s a good alternative to accuracy when classes are imbalanced.

Examples:

  • When FP and FN are equally harmful or benign. Maybe choosing which YouTube video to automatically play after the current one - labeling a good video as bad or a bad video as good might have approximately the same effect (the user doesn’t watch the next video).
  • Amazon Customer Service: If labeling an interaction as a coaching opportunity when it wasn’t (wastes manager’s time) and missing a coaching opportunity (failing to help customer service associates improve) were equal outcomes.
54
Q

What is a ROC curve?

A
  • A Receiver Operating Characteristic curve is one way of visualizing the performance of a classification model.
  • It plots the false positive rate (1 - specificity) on the x-axis against the true positive rate (recall/sensitivity) on the y-axis.
  • False positive rate = 1 - specificity = FP / (FP + TN)
  • To create the plot, FPR and TPR are calculated at different values of some model parameter, and then plotted. For example, in logistic regression, the model parameter is the probability threshold above which the model gives a positive outcome.
  • The Area Under the ROC Curve (AUC) is a measure of model accuracy. The lowest possible AUC is 0.5, when the ROC curve is indistinguishable from the 45-degree random line. The highest possible AUC is 1, when the curve is pushed as high up into the left hand corner of the plot as possible (??).
  • ROC curves can be misleading diagnostics in the case of very imbalanced data sets. In this case, a precision-recall curve is preferred.
55
Q

What is a precision-recall curve?

A
  • Precision-recall curves are like ROC curves, except recall is plotted on the x-axis and precision is plotted on the y-axis.
  • Pairs of (recall, precision) values are calculated at different values of some model parameter. For example, in logistic regression, the model parameter is the probability threshold above which the model gives a positive outcome.
  • The AUC is a measure of model accuracy, with the lowest possible AUC occurring when the precision-recall curve is indistinguishable from the 45-degree random line (running from the top left to the bottom right).
  • The highest possible AUC is 1, when the curve is pushed as high up into the upper right hand corner of the plot as possible.
  • This option is better than an ROC curve for imbalanced classes.
56
Q

Define root mean squared error (RMSE).

A

RMSE is a common validation metric for regression models.

It is the standard deviation of the residuals, where residual r_i = y_i, predicted - y_i, actual

RMSE = sqrt( 1/n * sum from 1 to n ((y_i, actual - y_i, predicted)^2))

Smaller values of RMSE are better.

RMSE cannot be smaller than Mean Absolute Error (MAE).

57
Q

When should you choose RMSE versus MAE?

A
  • RMSE penalizes large residuals (i.e. it is higher when the data contain large residuals). In other words, being off by 10 is more than twice as bad as being off by 5.
  • RMSE is very popular when used as a loss (or cost) function because it makes taking the derivative easier as in algorithms such as gradient descent. - However, MAE is more robust to outliers, and so can perform better on noisy data.
  • In addition, MAE can be more interpretable.
58
Q

Define R^2.

A

R^2 is a common validation metric for regression models.

It also can be used for explanatory purposes: it gives an estimate of the amount of variation in the dependent variable (y) that can be explained by the independent variable(s) (x).

It is basically 1 - MSE/variance. So, the higher the mean standard error (MSE), the lower R^2 and the poorer the model.

How to calculate:

  • If the residuals e_i = y_i, actual - y_i, predicted
  • Total sum of squares = SS_tot = the sum of (y_i, actual - mean of y)^2
  • Residual sum of squares = SS_res = sum of (e_i)^2
  • R^2 = 1 - (SS_res/SS_total)

If R = 0, none of the variance in y is explained by x.
If R = 1, all of the variance in y is explained by x.

59
Q

Define adjusted R^2.

A

Adjusted R^2 is a common validation metric for regression models.

It also can be used for explanatory purposes: it gives an estimate of the amount of variation in the dependent variable (y) that can be explained by the independent variable(s) (x).

It differs from R^2 in that it penalizes including more parameters in your model.

R_adj^2 = 1 - [ (1-R^2)(n-1) / ( n - k - 1) ] where n = the number of observations and k = the number of parameters.

Adjusted R^2 will always be <= R^2 (??)

If R_adj^2 = 0, none of the variance in y is explained by x.
If R_adj^2 = 1, all of the variance in y is explained by x.

Larger values of R_adj^2 are better.

60
Q

When should you use R^2 versus adjusted R^2?

A

R^2 will increase as you add parameters to a model whether or not those parameters are informative.

In contrast, adjusted R^2 will increase when you add useful terms, and decrease if you add less useful terms.

So, generally, you should use adjusted R^2 unless your model includes only one term.

61
Q

When should you use adjusted R^2 versus RMSE?

A

It depends on what you need to find out.

RMSE on its own doesn’t actually tell you how good a model is – it only tells you if one model is better than another.

In contrast, adjusted R2 has meaning even when it isn’t in comparison with another option.

The best R2 value is always 1. On the low end, it is possible to get infinitely large, negative R2 values, but it doesn’t usually occur.

62
Q

Explain cross-validation

A

Cross-validation is useful in particular on small data sets. (What constitutes a small data set?)

A small test set means more variation in estimated test error, and therefore it is more difficult to claim that one algorithm works better than another.

Cross-validation allows you to use all the data to estimate an average test error.

63
Q

Explain k-fold cross-validation.

A

K-fold cross-validation is a procedure that allows you to determine testing error. It is especially useful when a dataset is small. It provides a more conservative estimate of testing error when sample sizes are small?

To perform k-fold cross-validation:

  1. Randomly shuffle your rows
  2. Divide your rows into k equal groups
  3. Designate one group as the test set, and use the remaining groups as the training set
  4. Perform any data preparation (e.g. normalization) and parameter tuning on the test set - if you do this on the full data set, you risk data leakage
  5. Train the model
  6. Calculate the testing error
  7. Repeat from step 3 until you’ve used each of the k groups once as the test set (i.e., each example has been in the test set once and the training set k times)
  8. Average your testing error across the k iterations. It’s best practice to also calculate a standard deviation.

K-fold cross-validation can be computationally expensive.

64
Q

Explain Principle Components Analysis (PCA).

A

Principle component analysis (PCA) is an analytical (non-iterative) method to reduce the dimensionality (or number of features) in a data set while maintaining as much of the variation present in the data set as possible.

The principle components are the new, reduced features produced by the analysis. They are the eigenvectors of the covariance matrix from the original features, and hence are orthogonal (independent).

The first principle component captures the most variation from the original data set, the second captures the second most, and so on.

PCA only works well on scaled data. Relationships between features are assumed to be linear.

PCA is commonly used to compress big data while losing as little important information as possible, and to visualize high-dimensional data (especially for unsupervised learning applications). It should NOT be used to fix overfitting. (Why??)

65
Q

What might you do with data where the classes are very imbalanced?

A

There are several techniques for handling classification problems when there are imbalanced classes. Some include:

  • Selecting appropriate metrics (precision, recall or F1 rather than accuracy)
  • Oversampling instances of the minority class or undersampling instances of the majority class. Note that oversampling can result in overfitting because it can produce duplicate instances. SMOTE avoids this by creating new minority class instances by combining existing ones (SMOTE resource: https://beckernick.github.io/oversampling-modeling/). On the other hand, undersampling can leave out important instances (reduces the overall amount of data)
  • In extreme cases, it can be good to consider classification in the context of anomaly detection (anomaly detection algorithms include clustering methods, one-class SVMs and isolation forests)
66
Q

Explain covariance, esp in the context of a covariance matrix.

A

Covariance is the degree to which corresponding elements from two features tend to move in the same direction. For example, if two of your features to predict the temperature of the asphalt in a neighborhood parking lot were air temperature and amount of light, you would expect those to covary.

Hence, a covariance matrix captures which features are relatively redundant (i.e. they have high covariance) and which are information-rich (i.e. they have low covariance).

67
Q

What are eigenvectors (in the context of PCA)?

A

Eigenvectors are vectors that do not change direction when transformed (multiplied) by the covariance matrix.

They may, however, change size, which is indicated by the eigenvalue.

They represent the principal axes of maximum variance.

The eigenvalues provide the order of importance of these axes (first principal component, second principal component, etc.)

68
Q

How might you choose k for k-fold cross-validation?

A

It is common practice to choose k to be 10. 5 is also a common choice. A value of 10 is commonly recommended if you’re struggling.

It’s preferable to choose k such that your groups have equal (or nearly equal) numbers of examples.

As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller.

To summarize, there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

69
Q

Explain bias.

A

Biases are the simplifying assumptions made by a model to make the target function easier to learn.

Generally, linear algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn, they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias.

Models with high bias are prone to underfitting.

Models with high bias: Linear regression, linear discriminant analysis, logistic regression
Models with low bias: Decision Trees, k-Nearest Neighbors and Support Vector Machines

70
Q

Explain variance (in the context of bias-variance trade-off).

A

Variance is the amount that the model will change if different training data was used. I.e., how responsive the model is to new training examples.

All models should have some variance, but a good model should not change too much from one training data set to the next, because it’s good at identifying underlying patterns and correctly mapping between inputs and outputs.

Machine learning algorithms that have a high variance are strongly influenced by the specifics of the training data. They are prone to overfitting.

Low variance models: Linear Regression, Linear Discriminant Analysis and Logistic Regression.
High variance models: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

71
Q

Explain the bias-variance trade-off.

A

The aim of all ML models is to achieve both low bias and low variance. But as bias decreases, variance increases, and vice versa. So, the goal is to find the model where the model is responsive to the training data, but not too responsive.

Examples:

  • The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model.
  • The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
72
Q

What is overfitting, and how do you diagnose it?

A

Overfitting is when a model is too specific to the training data. You can identify this when the training error is low but the testing error is much higher.

73
Q

Data cleaning check-list

A

Missing values - sklearn will throw an error if you try to train a model on data with missing values. (Options: drop column (possibly better than imputation if > half values are missing), imputation/fillna, imputation + new column indicating which values were missing)

Encode non-numeric variables - sklearn expects numeric values in columns (Options: drop column, original encoding (each value gets its own integer; for variables with inherent order), one-hot-encoding (each value gets its own column; does not assume order; does not work well if the variable assumes > 15 values).

Things to consider: missing values in new cols in validation/test data; new options for categorical variables in validation/test data

74
Q

What is ensemble learning?

A

Ensemble learning is a general approach to machine learning that combines the predictions from multiple models to improve performance. The idea is that a set of weak learners can come together to produce one strong learner.

The three main ensemble learning strategies are
1. Bagging
2. Stacking
3. Boosting

75
Q

Explain the bagging learning strategy.

A

Bagging is short for “bootstrap aggregation.” It creates a diverse ensemble of models by varying the training data. A single algorithm is typically used, usually a decision tree. Each decision tree is training on a subset of the training data, produced by sampling the rows with replacement (i.e. bootstrapping). The results from each decision tree are either averaged or counted to form a final model output.

Random forest is the obvious example of an ensemble method that uses bagging. (RF expands on basic bagging by selecting a subset of features to split on for each split in each tree). Extra trees is another example.

76
Q

Explain the stacking learning strategy.

A

In stacking, a diverse ensemble is created by varying the types of models used. A stacking model typically has two levels.

Level 0 contains all of the models that make predictions on the training data. It is desirable to use a wide variety of models with different assumptions on this level.

Level 1 contains the model that aggregates the predictions into a final answer (i.e., it is trained on the predictions from the level 0 models). The level 1 model is often simple, such as linear or logistic regression. This encourages the complexity of the model to reside in level 0.

Examples of stacking algorithms include Stacked models, Blending, and Super Ensemble.

77
Q

Explain the boosting learning strategy.

A

Boosting creates a diverse ensemble by sequentially adding models that focus on examples that were not well-classified by the other ensemble members.

Typically, this involves the use of very simple decision trees that are added to the model sequentially. Training examples for each model are weighted to indicate whether they were accurately classified by the preceding models. Ensemble output is aggregated through weighted averaging or voting.

Examples of boosting algorithms include AdaBoost, Gradient Boosting Machines, and Stochastic Gradient Boosting Machines (e.g. XGBoost, LightGBM).

These algorithms are currently among the most successful on tabular data.

78
Q

Explain AdaBoost.

A

AdaBoost was the first widely successful boosting algorithm. It uses a set of single-split decision trees (“decision stumps”) added sequentially. Difficult-to-classify examples receive larger and larger weights until the algorithm identifies a stump that can classify them. The final outcome is an average of the outcome for each decision stump, weighted by that stump’s accuracy.

AdaBoost is most successful in binary classification applications.

Three things to remember about AdaBoost:
1. AdaBoost combines a lot of weak learners (stumps) to make decisions.
2. Some stumps get more say in the classification than others.
3. Each stump is trained taking the previous stump’s mistakes into account.

Reference: https://www.youtube.com/watch?v=LsK-xG1cLYA

79
Q

Explain Gradient Boosting.

A

Gradient Boosting Machines (GBMs) re-cast boosting as a numerical optimization problem where the goal is to minimize a loss function and new trees (or new “weak learners”) are added via a gradient descent-like procedure. New learners are added one at a time, and the existing weak learners are not updated.

Unlike prior boosting algorithms, GBMs can use any differentiable loss function. This expanded the types of problems that could be solved via boosting beyond binary classification.

Decision trees are used as the weak learners in GBMs. These trees are typically constrained (i.e. in their depth), and they learn in a greedy way. Each subsequent decision tree is designed to “correct” large residuals from the previous set of decision trees. A tree is added and then the parameters are tuned such that it minimizes the overall loss of the ensemble.

Training stops when
1. a fixed number of trees have been added OR
2. loss reaches an acceptable level OR
3. performance no longer improves on an external validation set.

80
Q

Describe the 4 main enhancements to basic gradient boosting.

A
  1. Adding tree constraints - It’s important that each decision tree have some skill, but remain weak overall. To do this, we can tune the number of trees (keep adding trees until improvement is no longer observed), tree depth (4-8 levels), the number of leaves, the minimum number of training observations per split, and the minimum improvement to loss per split.
  2. Weighted updates - The predictions of each tree are added together sequentially, and the contribution of each tree to this sum can be weighted by a learning rate. Smaller learning rates require more trees. “Shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model.” Typical learning rates are 0.1-0.3, or even smaller than 0.1.
  3. Stochastic gradient boosting - Instead of each learner being fit on the full training data, in SGB, a subset of the data is randomly selected (without replacement). This can include a subset of rows before creating each tree, a subset of columns before creating each tree, or a subset of columns before creating each split. Aggressive sub-sampling - such as 50% of the data - has been shown to be the most effective.
  4. Penalized gradient boosting - Classical decision trees like CART are not usually used as weak learners. Instead, regression trees are typically used, which have numerical values for each leaf. These leaf values can be regularized using usual L1 and L2 functions. This helps avoid over-fitting.
81
Q

What is XGBoost?

A

Extreme Gradient Boosting or XGBoost is an efficient (i.e. fast) and effective (i.e. accurate) open-source implementation of the gradient boosting algorithm. It is the implementation that really caught on with the ML community, and it is a go-to method and often part of the winning solution in ML competitions.

Because randomness is involved in model training, a slightly different model will be created each time the model is trained. Because of this, it is best to evaluate model performance over multiple runs or across multiple rounds of cross-validation (e.g. repeated stratified k-fold cross-validation).

Consider tuning:
- Number of trees
- Tree depth
- Learning rate
- Number of samples
- Number of features

Code example: https://machinelearningmastery.com/extreme-gradient-boosting-ensemble-in-python/

82
Q

What is LightGBM?

A

LightGBM is an efficient (i.e. fast) and effective (i.e. accurate) open-source implementation of the gradient boosting algorithm. It extends traditional gradient boosting by adding a type of automatic feature selection (EFB) and by focusing on boosting examples with larger gradients (GOSS). This can result in a dramatic speed-up of training. Like XGBoost, it is a state-of-the-art model for problems on tabular data, and is a staple of winning solutions in ML competitions.

Consider tuning:
- Number of trees
- Tree depth
- Learning rate
- Boosting type

Code example: https://machinelearningmastery.com/light-gradient-boosted-machine-lightgbm-ensemble/

83
Q

Explain EFB in LightGBM.

A

EFB, or Exclusive Feature Bundling, is an addition to the Gradient Boosting Machine algorithm as implemented in LightGBM. It is a method to automatically reduce features and it can dramatically speed up training. It does this by combining (bundling) features that are sparse (mostly zero) and exclusive (they are never non-zero in the same place).

Example:
F1 -> [0, 0, 1, 0, 0, 2]
F2 -> [3, 3, 0, 0, 0, 0]
F1 and F2 bundled -> [3, 3, 1, 0, 0, 2]

84
Q

Explain GOSS in LightGBM.

A

GOSS, or Gradient-based One-Side Sampling is an addition to the Gradient Boosting Machine algorithm as implemented in LightGBM. It focuses attention on training examples that result in a larger gradient, and it can dramatically speed up training.

85
Q

Explain OOB in Random Forest

A

The Out-Of-Bag score, or OOB, is calculated for free as part of the random forest algorithm. Each decision tree in the ensemble is trained on a bootstrapped subset of the data, meaning that not all examples are given to all decision trees. If an example is not given to a particular decision tree, it’s considered out-of-bag for that tree.

At the end of training, for each example, a prediction is made by all the trees where the example was out-of-bag. The OOB score is the number of correctly predicted rows from the OOB examples.

Reference: https://towardsdatascience.com/what-is-out-of-bag-oob-score-in-random-forest-a7fa23d710

86
Q

How is OOB different from test/validation scores in Random Forest, and when should it be used?

A

The Out-Of-Bag score is computed on only a subset of examples and trees (the out-of-bag examples and the trees for which those examples were out-of-bag). Therefore, it is better to use metrics evaluated on a test/validation set, where all examples are evaluated by all decision trees in the ensemble.

However, OOB score may be used in cases where very little data is available for training, and you don’t want to split the data into training/test/val sets.

Reference: https://towardsdatascience.com/what-is-out-of-bag-oob-score-in-random-forest-a7fa23d710

87
Q

What are Shapley values and how are they calculated?

A

Shapley values are a way to measure how much each feature contributes to a model’s prediction.

To calculate the Shapley value for feature f:
1. Create all possible combinations of features (excluding feature f). These sets of features are called “coalitions”.
2. Calculate the average prediction of the model across all examples.
3. For each coalition, calculate how different the prediction is from the average prediction WITH feature f.
4. For each coalition, calculate how different the prediction is from the average prediction WITHOUT feature f.
5. Calculate the “marginal contribution” of feature f, which is step 4 - step 3
6. The Shapley value for feature f is the average marginal contribution of f across all coalitions.

In practice, this algorithm’s run time increases exponentially with the number of features. Instead, SHapley Additive exPlanations (SHAP values) are used.

Reference: https://www.aidancooper.co.uk/how-shapley-values-work/
Guide to interpreting SHAP analyses: https://www.aidancooper.co.uk/a-non-technical-guide-to-interpreting-shap-analyses/?xgtab&