Machine Learning Flashcards
Precision/Recall Tradeoff
Firstly, both precision and recall are terms used within a classification context in which a model predicts a particular response (in this case, a binary one, such as has disease / does not have disease). Thus, possible outcomes can be false positive, true positive, false negative, and false negative, with the first part of the term referring to whether the model prediction was correct (true) or incorrect (false) and the second part referring to the prediction output by the model (positive or negative). So, in a false positive, for example, the model predicts a positive but was incorrect in doing so. (For example, it predicted the presence of a disease when in reality there was no disease.)
Precision is the number of true positives divided by the sum of true positives and false positives, i.e. TP / TP/FP
. It is the proportion of the positive classifier predictions that were, in fact, positive. Maximizing this is important since we want our model to be correct when predicting a positive class.
Recall is the number of true positives divided by the sum of true positives and false negatives, i.e. TP/TP+FN
TP+FN
TP
. Therefore, it is the percentage of total positive cases captured. Recall is important to maximize since we want our model to capture all of the positive cases in the relevant universe.
Both metrics are relevant in evaluating a classification algorithm, but we need to do that carefully, because a change in one leads to a change in the other, and so maximizing both is not an easy task. The tradeoff between the two metrics arises from what aspect of the problem being analyzed is more important in the context of the classification problem at hand.
For example, in the prior example, by not identifying a life-threatening disease in a patient (a false negative), the patient’s life is being put at risk. Likewise with a false positive (deeming someone as having the disease when they do not), the patient may undergo treatment having unpleasant, harmful side effects. Since there may be relative weightings between these outcomes, a tradeoff between precision and recall needs to be considered.
At Uber, one important attribute for every ride booked is the MSA (Metropolitan Statistical Area) of where the ride originated. An MSA encompasses one or more economically tied cities, and it’s encompassing areas, and can even expand beyond state lines.
For example, the Washington-Baltimore-Arlington MSA covers DC and parts of Maryland and Virginia.
As opposed to using city and state name, MSA’s are a good feature for a variety of models at Uber because rider and driver market conditions tend to be similar within an MSA.
However, in the US, there are hundreds of MSA’s, and internationally, there are thousands of MSAs (and their country-specific equivalents).
How would you encode this MSA variable when building a Machine Learning model?
If you were building a regression model, target encoding is one option. You’d replace each MSA with the mean of the target variable for that category.
For example, if you were building a model to predicting the average number of minutes a rider had to wait to be picked up, for the NYC-Jersey MSA you’d just encode 6.2 minutes, but for the DC-Baltimore-Arlington you’d encode 9.8 minutes.
If this was a binary classification problem – like trying to predict whether a ride request would be cancelled before they were picked up – you could encode the column by using the conditional probability that they rider cancelled.
For example, the NYC-Jersey MSA might be encoded as 0.3, but the DC-Baltimore-Arlington MSA would be encoded as 0.15.
One-hot encoding could be another option, because it converts a categorical value into numerical values.
You would take the MSA column, and convert it into thousands of new columns, and assign a binary 1 or 0 to these column.
Assumptions of Linear Regression
The main assumptions underlying linear regression are the following:
a) Linearity: The relationship between the feature set XX and the target variable YY is linear.
b) Homoscedasticity: The variance of the residuals is constant.
c) Independence: All observations are independent of one another.
d) Normality: The distribution of Y is assumed to be Normal.
With respect to independence and normality, use of the term “i.i.d.” (independent and identically distributed) is common. If any of these assumptions are violated, any forecasts or confidence intervals based on the results of using the model will most likely be misleading or biased. The linear regression is likely to perform poorly out of sample as a result.
Bias-Variance Tradeoff [Fa
The bias-variance tradeoff is expressed as the following: Total model error = Bias + Variance + Irreducible error.
Flexible models tend to have low bias and high variance, whereas more rigid models have high bias and low variance. The bias term comes from the error that occurs when a model underfits data. Having a high bias means that the machine learning model is too simple and may not adequately capture the relationship between the features and the target. An example would be using linear regression when the underlying relationship is nonlinear.
The variance term represents the error that occurs when a model overfits data. Having a high variance means that a model is susceptible to changes in training data, meaning that it is capturing and so reacting to too much noise. An example would be using a very complex neural network when the true underlying relationship between the features and the target is simply a linear one.
The irreducible term is the error that cannot be addressed directly by the model, such as noise in data measurement.
When creating a machine learning model, we want both bias and variance to be low, because we want to be able to have a model that predicts well but that also doesn’t change much when it is fed new data. The key point here is to prevent overfitting and, at the same time, to attempt to retain sufficient accuracy.
Say you need to employ a binary classifier for the purpose of fraud detection. What metrics would you look at, how is each defined, and what is the interpretation of each?
Some main metrics to compute for a binary classifier are precision, recall, and the AUC (Area Under the Curve) of the ROC curve. Let TP = true positive, FP = false positive, TN = true negative, and FN = false negative.
Then, precision is defined as TP/(TP + FP). It answers the question, “What percent of fraudulent predictions were correctly identified?”, and is important to maximize since you want your classifier to be as correct as possible in identifying a transaction as fraudulent.
Recall, on the other hand, is defined by TP/(TP + FN). It answers the question, “What percent of fraudulent cases were caught?”, and it is important to maximize since you want your classifier to have caught as many fraudulent cases as possible.
AUC of the ROC curve is defined by the area under the ROC curve, which is constructed by plotting the true positive rate TP/(TP + FN) versus the false positive rate FP/(FP + TN) for various thresholds, which can determine whether a data point should be labeled fraudulent or not fraudulent. The area under this curve, or AUC, is a measure of separability of the model, and the closer to 1 that it is, the higher is the measure. It answers the question, “Is my classifier able to discriminate between fraud and not-fraud effectively” and is important to maximize since you want your classifier to separate the two classes properly.
Describe why overfitting is a major concern for neural networks. What are some ways overfitting can be addressed?
Neural networks are prone to overfitting since they can approximate any function by involving many complex parts including their interactions (various activation functions, possibly many hidden layers, etc.) and hence are, in general, high variance and low bias.
Overfitting can be a concern for any ML model because it will most likely lead to poor out-of-sample predictions. With respect to neural networks, this issue can be even more serious because they tend to adapt very well to the training data, since they are capable of approximating any arbitrarily complex function. To overcome this problem, we can do the following:
a) Reduce model complexity by changing the network structure or/and the parameters b) Train the network using an increased amount of data, thereby making the model better able to generalize c) Use regularization methods, such as adding a penalty to the loss function in order to “force” a decrease in the feature weight values, using dropout (i.e., stochastically removing inputs during training), or adding noise to the inputs during training
In addition, it helps to normalize both input values and to apply batch normalization so that the weights do not vary drastically during training.
Compare and contrast gradient boosting and random forests.
In both gradient boosting and random forests, an ensemble of decision trees are used. Additionally, both are flexible models and don’t need much data pre-processing.
However, there are two main differences. The first main difference is that, in gradient boosting, trees are built one at a time, such that successive weak learners learn from the mistakes of preceding weak learners. In random forests, the trees are built independently at the same time.
The second difference is in the output: Gradient boosting combines the results of the weak learners with each successive iteration, whereas, in random forests, the trees are combined at the end (through either averaging or majority).
Because of these structural differences, gradient boosting is often more prone to overfitting than are random forests due to their focus on mistakes over training iterations and the lack of independence in tree building. Additionally, gradient boosting hyperparameters are harder to tune than those of random forests.
Finally, gradient boosting may take longer to train than random forests because the trees of the latter are built sequentially. In real-life applications, gradient boosting generally excels when used on unbalanced datasets (fraud detection, for example), whereas random forests generally excel at multi-class object detection with noisy data (computer vision, for example).
Describe the motivation behind random forests. What are two ways in which they improve upon individual decision trees?
Random forests are used since individual decision trees are usually prone to overfitting. Not only can these utilize multiple decision trees and then average their decisions, but they can be used for either classification or regression. There are a few main ways in which they allow for stronger out-of-sample prediction than do individual decision trees.
a) As in other ensemble models, using a large set of trees created in a resample of the data (bootstrap aggregation) will lead to a model yielding more consistent results. More specifically, and in contrast to decision trees, it leads to diversity in training data for each tree and so contributes to better results in terms of bias-variance tradeoff (particularly with respect to variance).
b) Using only m < pm<p features at each split helps to de-correlate the decision trees, thereby preventing very important features from always appearing at the first splits of the trees (which would happen on standalone trees due to the nature of information gain). c) They’re fairly easy to implement and fast to run. d) They can produce very interpretable feature-importance values, thereby improving model understandability and feature selection.
Points A and B are the main keys for improving upon single decision trees.
Describe the idea behind boosting. Give an example of one boosting method, and describe one advantage and one disadvantage of this method.
Boosting is a type of ensemble model that trains a sequence of “weak” models (such as small decision trees), where each one sequentially compensates for the weaknesses of the preceding models. Such weaknesses can be measured by the current model’s error rate and can be used to weight/define on which observations the next models should focus. Each training point within a dataset is assigned a particular weight and then is continually re-weighted in an iterative fashion; in other words, points that are predicted incorrectly take on higher weights in each iteration. In this way, more emphasis is placed on and more training involves points that are harder to predict.
One popular example is AdaBoost (adaptive boosting), which is used for classification. This method sequentially combines decision trees with a single split and works as follows:
Weights are uniformly set for all data points.
For MM classifiers, perform the following steps:
a) Construct a classifier for which the data are weighted by the previous steps’ weights
b) The weighted error for that classifier is computed.
c) Data points are re-weighted according to whether each was classified correctly or incorrectly by the classifier. After MM iterations, we can sum the weighted predictions of each classifier to obtain a final prediction.
One advantage of boosting is that it is powerful and flexible, since it can accommodate a variety of loss functions and can handle data in various forms (including imputation). On the flip side, disadvantages it has are that the training process can over-emphasize outliers and noise and hence cause overfitting. In addition, the whole process can be computationally intensive.
Say you are running a simple logistic regression to solve a problem but find the results to be unsatisfactory.
What are some ways you might improve your model and which other models might you consider using instead?
There are several ways to improve the performance of a logistic regression:
a) Normalizing features: The features should be normalized such that particular weights do not dominate within the model.
b) Adding more features: Depending on the problem, it may simply be the case that there aren’t enough useful features. In general, logistic regression is high bias, so adding more features should be helpful.
c) Addressing outliers: Identify and decide whether to retain or remove them.
d) Selecting variables: See if any features have introduced too much noise into the process.
e) Cross-validation and hyperparameter tuning: Using k-fold cross-validation along with hyperparameter tuning (for example, introducing a penalty term for regularization purposes) should help improve the model.
f) The classes may not be linearly separable (logistic regression produces linear decision boundaries), and, therefore, it would be worth looking into SVMs, tree-based approaches, or neural networks instead.
Say you just ran a linear regression model. Describe the diagnostics you would use to assess the validity of your results?
For linear regression, simply looking at R^2R
2
can be misleading. It is important to check for the following diagnostics, which can be easily accessed visually.
a) Residual values vs. fitted values: We can look at a scatterplot of residual values versus fitted values. If there are nonlinear patterns here, then they indicate nonlinear relationships within the data. In addition, if the variance of the residuals is not constant, this indicates the presence of heteroscedasticity.
b) QQ-plot: Plotting quantiles of standardized residuals versus theoretical quantiles. If the residuals are normally distributed, this plot should be close to a straight, dashed line. If the QQ-plot indicates that the residuals are not normally distributed, then a transformation (for example, a log transformation of the response) may be necessary.
c) Cook’s distance: Cook’s distance and leverage can identify highly influential data points (outliers). Conceptually, it calculates the regression model with and without any given data point and observes how the regression model changes. If highly influential data points are present, then it is worth considering whether to remove or cap those particular values.
d) Scale-location plot: Plots standardized residuals vs the fitted values - if there is a horizontal line with equally spread points then it indicates homoscedasticity, otherwise there is heteroscedasticity.
When performing these diagnostics, identifying what may have gone wrong is important. For example, lookout for nonlinearity of the relationship being modeled, bias due to omitted variables, poor sampling of observations in the distribution’s tails, and so on.
Motivation for cross-validation
Cross-validation is a technique used to assess the performance of an algorithm in several resamples/subsamples of training data. It consists of running the algorithm on subsamples of the training data, such as the original data less some of the observations comprising the training data, and evaluating model performance on the portion of the data that was not present in the subsample.
This process is repeated many times for the different subsamples, and the results are combined at the end. This step is very important in ML because it reveals the quality and consistency of the model’s true performance.
The process is as follows:
Randomly shuffle data into kk equally-sized blocks (folds).
For each ii in fold 1…k1…k, train the model on all the data except for fold ii, and evaluate the validation error using block ii.
Average the kk validation errors from step 2 to get an estimate of the true error.
This process aids in accomplishing the following: 1) Avoiding training and testing on the same subsets of data points, which would lead to overfitting, and 2) Avoiding using a dedicated validation set, with which no training can be done. The second of these points is particularly important in cases where very little training data is available or the data collection process is expensive. One drawback of this process, however, is that roughly kk times more computation is needed than using a dedicated hold-out validation set. In practice, cross-validation works very well for smaller datasets.
Generative and Discriminative Models
Both generative and discriminative models are used in the context of classification in assigning a given data point to one of K possible classes. Generative models look at the joint probability distribution between XX and YY. For any given input, we want to classify an arbitrary data point xx with the following class label:
\hat{y} = \argmax_{k} p(x, Y=k)
y
=
k
argmax
p(x,Y=k)
The joint distribution between XX and YY is given by:
p(X, Y) = p(Y|X)p(X)
p(X,Y)=p(Y∣X)p(X)
and for each class kk we have
p_k(X) = p(X|k) p(k)
p
k
(X)=p(X∣k)p(k)
such that
\hat{y} = \argmax_{k} p(Y=k|x)
y
=
k
argmax
p(Y=k∣x)
This type of model is concerned with how the feature set is generated in order to produce classifications, explaining why it models the joint distribution of the features and the target. We call it “generative” because it can also be useful to generate new data points. A common type of generative model is the naive Bayes algorithm.
Discriminative models directly learn a decision boundary by choosing a class that maximizes the following:
\hat{y} = \argmax_{k} p(Y=k|x)
y
=
k
argmax
p(Y=k∣x)
Classification models are frequently discriminative models, since they’re concerned with finding the probability of each class given a known feature set. Examples of such models are logistic regression and neural networks.
Information Gain and Entropy
Because information gain is based on entropy, we’ll discuss entropy first.
The formula for entropy is
Entropy = Sum of P(Y=k)logP(Y=k)
The equation above yields the amount of entropy present and shows exactly how homogeneous a sample is (based on the attribute being split). Consider a case where k = 2k=2. Let aa and bb be two outputs/labels that we are trying to classify. Given these values, the formula considers the proportion of values in the sample that are aa and the proportion that are bb, with the sample being split on a different attribute.
A completely homogeneous sample will have an entropy of 0. For instance, if a given attribute has values aa and bb, then the entropy of splitting on that given attribute would be(The below is log based 2)
Entropy=−1∗Log(1)−0∗Log(0)=0
whereas a completely split (50%-50%) would result in an entropy of 1. A lower entropy means a more homogeneous sample.
Information gain is based on the decrease in entropy after splitting on an attribute.
IG(X_j, Y) = H(Y) - H(Y|X_j)
This concept is better explained with a simple numerical example. Consider the above case again with k=2k=2. Let’s say there are 5 instances of value aa and 5 instances of value bb. Then, we decide to split on some attribute XX. When X=1X=1, there are 5 aa’s and 1 bb’s whereas when X = 0X=0 there are 4 bb’s and 0 aa’s.
Now, by splitting on XX, we have two classes: X = 1X=1 and X = 0X=0. However, by splitting on this attribute, we now have X =1X=1, which has 5 aa’s and 1 bb, while X = 0X=0 has 4 bb’s and 0 aa’s.
Entropy(After)=(4/10)∗Entropy(X=0)−(6/10)∗Entropy(X=1)
The entropy value for X = 0X=0 is 0, since the sample is homogeneous (all bb’s, no aa’s).
Entropy(X=0)=0
\text{Entropy}(X = 1) = -(1/6)Log_2(1/6) - (5/6)Log_2(5/6) = .65
Entropy(X=1)=−(1/6)∗Log
2
(1/6)−(5/6)∗Log
2
(5/6)=.65
(1/6)−(5/6)∗Log
2
(5/6)=.65
Plug these into the entropy(after) formula to obtain the following:
\text{Entropy}(\text{After}) = (4/10)0 - (6/10).65 = .39
Entropy(After)=(4/10)∗0−(6/10)∗.65=.39
Finally, we can go back to our original formula and obtain information gain:
IG= 1 - .39 = .61
IG=1−.39=.61
This results makes intuitive sense, since, ideally, we want to split on an attribute that splits the output perfectly. Therefore we ideally want to split on something that is homogeneous with regards to the output, and this something would thus have an entropy equal to 0.
Identifying synonyms
To find synonyms, we can first find word embeddings through a corpus of words. Word2vec is a popular algorithm for doing so. It produces vectors for words based on the words’ contexts. Vectors that are closer in Euclidean distance are meant to represent words that are also closer in context and meaning.
The generated word embeddings are weights on the resulting vectors. The distance between these vectors can be used to measure similarity, for example, via cosine similarity or some other similar measure.
Once we have these word embeddings, we can then run an algorithm such as K-means clustering to identify clusters within word embeddings or run a K-nearest neighbors algorithm to find a particular word for which we want to find synonyms.
However, there are some edge cases since Word2Vec can produce similar vectors even in the case of antonyms; consider the words “hot” and “cold”, for example, which have opposite meanings but appear in many similar contexts (related to temperature or in a Katy Perry song).