ML Flashcards
ML: What are loss functions?
Functions that numerically compare your predictions Y* to the true values Y to provide feedback on the accuracy of your trained model.
ML: What is the typical loss function for classification? For regression?
Classification: 0/1 loss
Regression: Mean-Squared Error (MSE)
ML: What is a generative model?
It models P(Y|X) using P(X|Y)P(Y)
(derived from Bayes rule)
ML: What is a discriminative model?
Models P(Y|X) directly, as P(Y|X)
ML: What are some advantages of discriminative model over generative? What about generative over discriminative?
Discriminative models (Y|X) don’t need to make assumptions about the distribution of X|Y, and can be simpler as a result.
Generative models can be used to generate samples. They are also sometimes more intuitive.
ML: What is a Bayes classifier?
ML: What are 2 important characteristics of a Bayes classifier?
It is the best possible classifier
Its error is thus the irreducible error of a classification problem.
ML: At a high level, how would a Bayes Classifier be constructed in a case where errors are asymmetric, meaning some errors are worse than others?
We would weight each possible error with its own amount of loss Li,j, pertaining to what you said the answer is and what the answer should’ve been.
ML: What is a simplistic view of logistic regression when Y is binary?
We are just doing a linear regression on our X’s, then squishing the outcome into [0,1]
ML: In binary logistic regression, what is the formula for P(Y=1|X=x)?
For x of any dimension, it is as follows:
ML: What is the formula for the inverse logit function, aka sigmoid? And what does it accomplish?
It is shown below. It squishes all real numbers x into a range of [0,1]
ML: In binary logistic regression, given the formula for P(Y=1|X=x), how do we choose our predicted outcome?
We of course simply choose the outcome with the higher probability.
ML: What decision boundary does binary logistic regression yield?
It yields the linear separator:
ML: For binary logistic regression, how do we learn the coefficients of the model B0, B1, B2… in order to make predictions?
We estimate them using simple maximum likelihood. So we find the coefficients B0, B1, B2… that have the highest likelihood of producing the observed training data with the observed labels.
However, this optimization function using 0/1 loss is computationally hard, so we solve it iteratively using something like stochastic gradient descent.
ML: How to we extend binary logistic regression to multinomial logistic regression? So our X’s are still n-dimensional vectors, but now our Y’s are scalars in [1,K]?
ML: What issue arises when logistic regression is performed on data that is perfectly linearly separable? And how can we combat this?
The learned weights will go to infinity, and we will overfit.
We can combat this by regularizing!
ML: What does having a high number of parameters in your model do to the bias/variance tradeoff?
Having a lot of parameters, or a complex model, decreases bias (as we have more flexibility to fit whatever truly is the underlying model) but variance increases
ML: What does having a high number of features in your model do to the bias/variance tradeoff?
Having a lot of featuers decreases bias (as we are able to fit more of the possible underlying models), but variance increases due to the curse of dimensionality.
ML: In general, what kinds of changes to your model will decrease bias, but increase variance?
When you allow your algorithm the potential ability to fit a higher amount of potential underlying models, or more types of underlying models.
This of course decreases bias as you are more likely to hit the real model, but it increases variance because (I think) when you have more possible options and the same amount of data, it becomes more likely that one of them appears as the best-fitting model due simply to noise.
ML: In general, what sorts of changes to your algorithm will decrease variance, but increase bias?
Changes that make your algorithm less influenced by outliers or noise.
ML: What is a definition of model bias that is useful to think about when considering the bias/variance tradeoff?
Bias in a model is a lack of ability to fit the underlying model.
In other words, a lack of robustness with which to fit potential underlying models
ML: What is a definition of model variance that is useful to think about when considering the bias/variance tradeoff?
Model variance is its amount of susceptibility to changes due to noise.
ML: What is the naive bayes assumption, both in words and mathematically?
In words, the naive bayes assumption is that there are no interactions between the features of n-dimensional feature vector X, or that the context of one of the features Xi, and how it interacts with other features Xj, is not relevant to prediction.
(An example is assuming seeing “Donald Trump” is has the same predictive value as seeing those two words 10 words apart and in reverse order.)
Mathematically, we have for D-dimensional X:
ML: Is Naive Bayes a discriminative or generative model?
Generative
ML: What is the form of P(Y=y|X=x) under the Naive Bayes classification algorithm?
P
ML: How do we learn a Naive Bayes classifier? What parameters need to be estimated, and what options do we have to estimate them?
For each Y=y, we need to estimate P(Y=y), as well as P(Xi=xi|Y=y) for every possible value of Xi.
P(Y=y) is generally estimated through MLE. So, we just estimate it as the proportion of the training examples for which Y=y.
For P(Xi=xi|Y=y) it can vary. You could also do basic population MLE (The proportion of Xi=xi among training examples where Y=y), or you could learn a gaussian N(µ,σ2) as with gaussian naive bayes, or something else!
ML: What are the advantages and disadvantages of Naive Bayes?
The disadvantage is that we cannot capture context, which is often very important. Hence this algorithm being “naive”.
The advantage is that it’s simple, and is much easier to train: because we ignore interactions, we only need to learn a bunch of univariate distributions rather than one giant joint distribution, which is much harder (I believe exponentially harder).
ML: What do the decision boundaries look like in multinomial logistic regression?
We’re finding n linear (or hyperplane) separators, one for each of the K possible Y=k which models P(Y=k) vs P(Y not k). As such, it looks like a bunch of k linear (or hyperplane) separators working together on the cartesian grid.
ML: What is the most common way to train a decision tree, without yet worrying about pruning?
At each step, simply greedily add the split that most decreases classification error on the training set (or MSE in the case of regression).
ML: What is the common way to validate a decision tree and combat overfitting?
Pruning. Once you have your full tree trained on training data, at each step, greedily remove the branch that causes the largest improvement on validation error. Continue until you can’t find any more improvements to validation error.
ML: How do we predict on a new point X using a trained decision tree, for both classification and regression?
Simply work your way a branch until you get to the bottom. For classification, return the most common label among the training examples at that branch. For regression, return the average training label.
ML: What is the main advantage of decision trees? (Not in ensembles or anything, just by themselves.)
They’re very interpretable, especially to non-technical people
ML: What are some disadvantages of decision trees? (Not in ensembles or anything, just by themselves.)
They don’t predict very well.
Variance is very high, as small changes in training data can cause big differences in how the splits occur.
They also have a fair amount of bias, as they can only capture certain types of decision bounds (no “diagonal” decision bounds).
They also handle categorical predictor variables poorly if there are too many predictors.
ML: What issue do decision trees have with categorical predictor variables?
If the variable has N possible valuest each split, it must consider each of the 2N-1 subsets of the values as a splitting rule, which is of course very computationally hard.
ML: As the depth of a decision tree grows, what happens to bias and variance?
Bias decreases (as we increase complexity), and variance increases.
ML: How do ensemble classifiers generally work?
Several models are learned, and they vote on what the prediction is.
ML: What is boosting? How does it generally work?
In boosting, you fit several (typically very basic) classifiers iteratively, with each new classifier trying to do well on the training points that the previous classifiers did poorly on.
ML: How does boosting improve the performance of the classifiers its using? In otherwords, how does bias and/or variance impacted before and after the use of boosting?
Because it uses very simple predictors (for example, stumps of decision trees), each one individually has very high bias. But, by fitting a lot of them and correcting for previous mistakes, bias is decreased.
ML: How does prediction work for a KNN classifier or regressor?
For each new point, you find the K points in the training set closest to it (by some measure such as euclidean distance); then for classification you’d return the most common label among the nearest neighbors, and for regression you’d return the average label of the nearest neighbors.
ML: What is the (somewhat high-level) algorithm for Adaboost?
We start with each training example Xi having equal weights wi. Then, repeat:
- Fit a new classifier that minimizes weighted training error, based on the weights wi. (Points on which earlier classifiers have performed poorly have higher weights, and errors on those points cost more)
- Based on the weighted error of that classifier, find a weight alphaj for the classifier for when it votes later on. The lower its weighted error, the higher its alphaj, and thus the higher its voting power.
- Update each of the point weights wi based on this round’s results.
ML: How (at a high level) does gradient boosting work?
At each step, we find the current gradient of the loss function. We then find our next classifier to “approximate this gradient vector”.
So we in some way optimally take into account the gradient of the loss function at each step as we choose our classifiers.
ML: What issue do we run into when trying to minimize 0/1 loss for our classification problems? How can we solve this issue?
Because the 0/1 loss function is not smooth, as well as nonconvex, optimizing the solution is very hard, and you also can’t approximate it with iterative techniques like gradient descent (which is generally the solution when finding the true global minimum is impossible).
To combat this, you can instead iteratively minimize some similar loss function that puts an upper bound on 0/1 loss, such as hinge loss (pink), or exponential loss (yellow).
ML: How does bagging work, at a fairly high level? How do we train a bagging ensemble given a training set of n Xi’s?
Resample several samples from your training data; say m bootstrapped samples of size n, with replacement.
Train a (typically complex) model on each of the m bootstrapped samples. Have these models vote on the prediction.
ML: How do bias and variance work with bagging? So what is the level bias and variance from one of the individual classifiers in the ensemble, and then what happens to bias and variance when we have several classifiers vote?
The classifiers in bagging are complex; if they’re trees, they’re deep trees. As a result, the individual classifiers have low bias, but high variance.
However, when we use several classifiers and have them vote, we can drive down this variance because we’re averaging votes from classifiers trained on different bootstrapped samples. (As n goes infinity, these samples become independent.)
ML: What are the high-level differences between boosting and bagging, and how they achieve the goal of making several classifiers work effectively together?
Boosting fits several models that are not complex, and thus have high bias. But by fitting several and correcting errors from previous iterations, bias is reduced.
Bagging fits several complex models, so bias is lower but variance is high. But by averaging many predictions on slightly different datasets, we can lower variance.
ML: What is an improvement we can make to bagging (specifically classification bagging) when we are calculating our predicted class?
Rather than just picking the class with the most votes from each of the classifiers, average the probabilities of each class from each of the classifiers, then predict the class with the highest probability.
(If we get to the end of a branch in one of the classifiers, the “probability” of class j predicted by that classifier is the proportion of training examples that had class j).
This improvement makes performance more consistent, decreasing variance a bit.
ML: How do Random Forests work?
Random Forests work basically just by using bagging with deep decision trees: bootstrapping several resampled training sets, training a deep tree on each of them, and having them vote for a classification or regressed value.
The key change from traditional bagging is this: if there are p features of a training example X, at each split in a decision tree, only m < p features are considered for that split. (Typically m = sqrt(p))
ML: Why is the Random Forest alteration to bagging often advantageous?
By only considering m < p features at each split of a decision tree, the trees become more independent, which helps the random forest algorithm decrease the variance in the overall prediction (which is the general goal of bagging).
ML: What are two cons of bagging/random forests?
The models lack interpretibility.
Training and prediction both take a long time.
ML: What method is typically used to evaluate random forests instead of cross-validated error? Why is this other approach typically better for random forests?
You use out-of-bag error. Because each classifier is trained on a bootstrapped training set, each point in the original training set was not used in training some of the classifiers (about 36.8%). So, get a prediction on this point from these classifiers, and find the error on all the points from that method.
This is preferable to cross-validation because you don’t need to retrain the model for each of the folds; training takes a long time for random forests, so we want to avoid it. But with enough trees, out-of-bag error tends to be very similar to CV error.
ML: What do variable-importance measures look at? On which algorithm are they most often used and why?
Variable-importance measures look at how important a variable is to the predictions of the model.
They’re commonly used for random forests, because random forests lack interpretability without them.
And more recently in my experience, they’re big for xgboost. Ensemble tree methods just lend themselves to variable importance metrics, because you can look at how often a tree chooses to split on a given variable, how the model performs with and without trees that use a certain variable, etc.
ML: For random forests, what are the 2 most common ways to calculate variable importance metrics?
- Find all of the splits which use that variable in all of the decision trees, and then find how much on average those splits improve the predictions, using some measure of prediction performance like Gini index.
- Randomly permute each variable one at a time, and see how much model performance decreases for each. (I’m guessing this means to calculate out-of-bag error, and for each point, before getting the prediction, choose the value of the variable in question randomly.)
ML: What is clustering, and how does it differ from classification?
Clustering is an unsupervised learning technique, so our training data are not labeled, and we try to divide the data into groups that are in some way “similar”. So basically, we try to label each point such that similar points have the same label.
Classification is similar because each point has a label, but classification is supervised: we are given a labeling scheme and want to learn how to predict it as best as possible. With clustering, we aren’t given a labeling scheme, and want to find a sensible one.
ML: What does k-means clustering approximately minimize? And how does its approximation of this minimization work on a theory level?
ML: How do you run the k-means clustering algorithm?
After choosing your number of classes k, randomly initialize the locations of your k cluster centers. Then, repeat:
- Assign all training points to the cluster center that is closest by euclidean distance
- Set cluster center as the average of all points in the cluster
ML: What are some drawbacks to k-means clustering?
- It’s nondeterministic due to the random initializations, and is not guarenteed to reach the global optimum.
- Cluster center isn’t interpretable, as it’s an average of a bunch of points
ML: What is an advantage to k-means clustering?
It will always converge, as each step will decrease the in-cluster variance (until it’s at a local minimum)
ML: How does the k-medoids algorithm work? Specifically, what alterations are made to the k-means algorithm?
Here the cluster centers are actual points in the dataset, rather than an average of a bunch of points.
So we initialize k points randomly to be the initial cluster centers, then repeat:
- Assign all training points to the cluster center (a real point) that is closest to it
- For each cluster, choose the point in the cluster which minimizes mean squared distance from other points as the new cluster center. (In other words, minimize in-cluster variation.)
ML: What is a pro of k-medoids over k-means? What is a con?
K-medoids provides more interpretable cluster centers, as they are actual points in the dataset rather than an average.
However, this is at the cost of having slightly higher in-cluster variation for k-medoids, as you simply have “fewer choices” of where to put your cluster center.
ML: At a high level (so for an arbitrary linkage), how does the hierarchical clustering algorithm work?
Each training point starts as its own cluster. Then at each step, we merge the 2 clusters with the highest similarity into one. We use linkages to determine similarity.
ML: What are a few advantages to hierarchical clustering?
- It is deterministic, and thus has lower variance
- You get clusterings for all values of k
- You have hierarchical clusters! If your clusters naturally have a sub-cluster architecture, this is useful.
ML: At a very high level, how do mixture models work?
When looking at our training data, we assume they come from a probability distribution that is a weighted sum of several simple underlying distributions, such as several gaussians.
ML: What advantage do mixture models have for clustering?
Rather than having each point discretely in one cluster, we compute the probability that each point is in each cluster. This is good for overlapping clusters, or for clusterings where some points have an unclear affiliation.