ML Flashcards

Question 1

Q

ML: What are loss functions?

Answer

A

Functions that numerically compare your predictions Y* to the true values Y to provide feedback on the accuracy of your trained model.

Question 2

Q

ML: What is the typical loss function for classification? For regression?

Answer

A

Classification: 0/1 loss

Regression: Mean-Squared Error (MSE)

Question 3

Q

ML: What is a generative model?

Answer

A

It models P(Y|X) using P(X|Y)P(Y)
(derived from Bayes rule)

Question 4

Q

ML: What is a discriminative model?

Answer

A

Models P(Y|X) directly, as P(Y|X)

Question 5

Q

ML: What are some advantages of discriminative model over generative? What about generative over discriminative?

Answer

A

Discriminative models (Y|X) don’t need to make assumptions about the distribution of X|Y, and can be simpler as a result.

Generative models can be used to generate samples. They are also sometimes more intuitive.

Question 6

Q

ML: What is a Bayes classifier?

Question 7

Q

ML: What are 2 important characteristics of a Bayes classifier?

Answer

A

It is the best possible classifier

Its error is thus the irreducible error of a classification problem.

Question 8

Q

ML: At a high level, how would a Bayes Classifier be constructed in a case where errors are asymmetric, meaning some errors are worse than others?

Answer

A

We would weight each possible error with its own amount of loss L_i,j, pertaining to what you said the answer is and what the answer should’ve been.

Question 9

Q

ML: What is a simplistic view of logistic regression when Y is binary?

Answer

A

We are just doing a linear regression on our X’s, then squishing the outcome into [0,1]

Question 10

Q

ML: In binary logistic regression, what is the formula for P(Y=1|X=x)?

Answer

A

For x of any dimension, it is as follows:

Question 11

Q

ML: What is the formula for the inverse logit function, aka sigmoid? And what does it accomplish?

Answer

A

It is shown below. It squishes all real numbers x into a range of [0,1]

Question 12

Q

ML: In binary logistic regression, given the formula for P(Y=1|X=x), how do we choose our predicted outcome?

Answer

A

We of course simply choose the outcome with the higher probability.

Question 13

Q

ML: What decision boundary does binary logistic regression yield?

Answer

A

It yields the linear separator:

Question 14

Q

ML: For binary logistic regression, how do we learn the coefficients of the model B₀, B₁, B₂… in order to make predictions?

Answer

A

We estimate them using simple maximum likelihood. So we find the coefficients B₀, B₁, B₂… that have the highest likelihood of producing the observed training data with the observed labels.

However, this optimization function using 0/1 loss is computationally hard, so we solve it iteratively using something like stochastic gradient descent.

Question 15

Q

ML: How to we extend binary logistic regression to multinomial logistic regression? So our X’s are still n-dimensional vectors, but now our Y’s are scalars in [1,K]?

Question 16

Q

ML: What issue arises when logistic regression is performed on data that is perfectly linearly separable? And how can we combat this?

Answer

A

The learned weights will go to infinity, and we will overfit.

We can combat this by regularizing!

Question 17

Q

ML: What does having a high number of parameters in your model do to the bias/variance tradeoff?

Answer

A

Having a lot of parameters, or a complex model, decreases bias (as we have more flexibility to fit whatever truly is the underlying model) but variance increases

Question 18

Q

ML: What does having a high number of features in your model do to the bias/variance tradeoff?

Answer

A

Having a lot of featuers decreases bias (as we are able to fit more of the possible underlying models), but variance increases due to the curse of dimensionality.

Question 19

Q

ML: In general, what kinds of changes to your model will decrease bias, but increase variance?

Answer

A

When you allow your algorithm the potential ability to fit a higher amount of potential underlying models, or more types of underlying models.

This of course decreases bias as you are more likely to hit the real model, but it increases variance because (I think) when you have more possible options and the same amount of data, it becomes more likely that one of them appears as the best-fitting model due simply to noise.

Question 20

Q

ML: In general, what sorts of changes to your algorithm will decrease variance, but increase bias?

Answer

A

Changes that make your algorithm less influenced by outliers or noise.

Question 21

Q

ML: What is a definition of model bias that is useful to think about when considering the bias/variance tradeoff?

Answer

A

Bias in a model is a lack of ability to fit the underlying model.

In other words, a lack of robustness with which to fit potential underlying models

Question 22

Q

ML: What is a definition of model variance that is useful to think about when considering the bias/variance tradeoff?

Answer

A

Model variance is its amount of susceptibility to changes due to noise.

Question 23

Q

ML: What is the naive bayes assumption, both in words and mathematically?

Answer

A

In words, the naive bayes assumption is that there are no interactions between the features of n-dimensional feature vector X, or that the context of one of the features X_i, and how it interacts with other features X_j, is not relevant to prediction.

(An example is assuming seeing “Donald Trump” is has the same predictive value as seeing those two words 10 words apart and in reverse order.)

Mathematically, we have for D-dimensional X:

Question 24

Q

ML: Is Naive Bayes a discriminative or generative model?

Answer

A

Generative

Question 25

Q

ML: What is the form of P(Y=y|X=x) under the Naive Bayes classification algorithm?

Question 26

Q

ML: How do we learn a Naive Bayes classifier? What parameters need to be estimated, and what options do we have to estimate them?

Answer

A

For each Y=y, we need to estimate P(Y=y), as well as P(X_i=x_i|Y=y) for every possible value of X_i.

P(Y=y) is generally estimated through MLE. So, we just estimate it as the proportion of the training examples for which Y=y.

For P(X_i=x_i|Y=y) it can vary. You could also do basic population MLE (The proportion of X_i=x_iamong training examples where Y=y), or you could learn a gaussian N(µ,σ²) as with gaussian naive bayes, or something else!

Question 27

Q

ML: What are the advantages and disadvantages of Naive Bayes?

Answer

A

The disadvantage is that we cannot capture context, which is often very important. Hence this algorithm being “naive”.

The advantage is that it’s simple, and is much easier to train: because we ignore interactions, we only need to learn a bunch of univariate distributions rather than one giant joint distribution, which is much harder (I believe exponentially harder).

Question 28

Q

ML: What do the decision boundaries look like in multinomial logistic regression?

Answer

A

We’re finding n linear (or hyperplane) separators, one for each of the K possible Y=k which models P(Y=k) vs P(Y not k). As such, it looks like a bunch of k linear (or hyperplane) separators working together on the cartesian grid.

Question 29

Q

ML: What is the most common way to train a decision tree, without yet worrying about pruning?

Answer

A

At each step, simply greedily add the split that most decreases classification error on the training set (or MSE in the case of regression).

Question 30

Q

ML: What is the common way to validate a decision tree and combat overfitting?

Answer

A

Pruning. Once you have your full tree trained on training data, at each step, greedily remove the branch that causes the largest improvement on validation error. Continue until you can’t find any more improvements to validation error.

Question 31

Q

ML: How do we predict on a new point X using a trained decision tree, for both classification and regression?

Answer

A

Simply work your way a branch until you get to the bottom. For classification, return the most common label among the training examples at that branch. For regression, return the average training label.

Question 32

Q

ML: What is the main advantage of decision trees? (Not in ensembles or anything, just by themselves.)

Answer

A

They’re very interpretable, especially to non-technical people

Question 33

Q

ML: What are some disadvantages of decision trees? (Not in ensembles or anything, just by themselves.)

Answer

A

They don’t predict very well.

Variance is very high, as small changes in training data can cause big differences in how the splits occur.

They also have a fair amount of bias, as they can only capture certain types of decision bounds (no “diagonal” decision bounds).

They also handle categorical predictor variables poorly if there are too many predictors.

Question 34

Q

ML: What issue do decision trees have with categorical predictor variables?

Answer

A

If the variable has N possible valuest each split, it must consider each of the 2^N-1 subsets of the values as a splitting rule, which is of course very computationally hard.

Question 35

Q

ML: As the depth of a decision tree grows, what happens to bias and variance?

Answer

A

Bias decreases (as we increase complexity), and variance increases.

Question 36

Q

ML: How do ensemble classifiers generally work?

Answer

A

Several models are learned, and they vote on what the prediction is.

Question 37

Q

ML: What is boosting? How does it generally work?

Answer

A

In boosting, you fit several (typically very basic) classifiers iteratively, with each new classifier trying to do well on the training points that the previous classifiers did poorly on.

Question 38

Q

ML: How does boosting improve the performance of the classifiers its using? In otherwords, how does bias and/or variance impacted before and after the use of boosting?

Answer

A

Because it uses very simple predictors (for example, stumps of decision trees), each one individually has very high bias. But, by fitting a lot of them and correcting for previous mistakes, bias is decreased.

Question 39

Q

ML: How does prediction work for a KNN classifier or regressor?

Answer

A

For each new point, you find the K points in the training set closest to it (by some measure such as euclidean distance); then for classification you’d return the most common label among the nearest neighbors, and for regression you’d return the average label of the nearest neighbors.

Question 40

Q

ML: What is the (somewhat high-level) algorithm for Adaboost?

Answer

A

We start with each training example X_i having equal weights w_i. Then, repeat:

Fit a new classifier that minimizes weighted training error, based on the weights w_i. (Points on which earlier classifiers have performed poorly have higher weights, and errors on those points cost more)
Based on the weighted error of that classifier, find a weight alpha_j for the classifier for when it votes later on. The lower its weighted error, the higher its alpha_j, and thus the higher its voting power.
Update each of the point weights w_i based on this round’s results.

Question 41

Q

ML: How (at a high level) does gradient boosting work?

Answer

A

At each step, we find the current gradient of the loss function. We then find our next classifier to “approximate this gradient vector”.

So we in some way optimally take into account the gradient of the loss function at each step as we choose our classifiers.

Question 42

Q

ML: What issue do we run into when trying to minimize 0/1 loss for our classification problems? How can we solve this issue?

Answer

A

Because the 0/1 loss function is not smooth, as well as nonconvex, optimizing the solution is very hard, and you also can’t approximate it with iterative techniques like gradient descent (which is generally the solution when finding the true global minimum is impossible).

To combat this, you can instead iteratively minimize some similar loss function that puts an upper bound on 0/1 loss, such as hinge loss (pink), or exponential loss (yellow).

Question 43

Q

ML: How does bagging work, at a fairly high level? How do we train a bagging ensemble given a training set of n X_i’s?

Answer

A

Resample several samples from your training data; say m bootstrapped samples of size n, with replacement.

Train a (typically complex) model on each of the m bootstrapped samples. Have these models vote on the prediction.

Question 44

Q

ML: How do bias and variance work with bagging? So what is the level bias and variance from one of the individual classifiers in the ensemble, and then what happens to bias and variance when we have several classifiers vote?

Answer

A

The classifiers in bagging are complex; if they’re trees, they’re deep trees. As a result, the individual classifiers have low bias, but high variance.

However, when we use several classifiers and have them vote, we can drive down this variance because we’re averaging votes from classifiers trained on different bootstrapped samples. (As n goes infinity, these samples become independent.)

Question 45

Q

ML: What are the high-level differences between boosting and bagging, and how they achieve the goal of making several classifiers work effectively together?

Answer

A

Boosting fits several models that are not complex, and thus have high bias. But by fitting several and correcting errors from previous iterations, bias is reduced.

Bagging fits several complex models, so bias is lower but variance is high. But by averaging many predictions on slightly different datasets, we can lower variance.

Question 46

Q

ML: What is an improvement we can make to bagging (specifically classification bagging) when we are calculating our predicted class?

Answer

A

Rather than just picking the class with the most votes from each of the classifiers, average the probabilities of each class from each of the classifiers, then predict the class with the highest probability.

(If we get to the end of a branch in one of the classifiers, the “probability” of class j predicted by that classifier is the proportion of training examples that had class j).

This improvement makes performance more consistent, decreasing variance a bit.

Question 47

Q

ML: How do Random Forests work?

Answer

A

Random Forests work basically just by using bagging with deep decision trees: bootstrapping several resampled training sets, training a deep tree on each of them, and having them vote for a classification or regressed value.

The key change from traditional bagging is this: if there are p features of a training example X, at each split in a decision tree, only m < p features are considered for that split. (Typically m = sqrt(p))

Question 48

Q

ML: Why is the Random Forest alteration to bagging often advantageous?

Answer

A

By only considering m < p features at each split of a decision tree, the trees become more independent, which helps the random forest algorithm decrease the variance in the overall prediction (which is the general goal of bagging).

Question 49

Q

ML: What are two cons of bagging/random forests?

Answer

A

The models lack interpretibility.

Training and prediction both take a long time.

Question 50

Q

ML: What method is typically used to evaluate random forests instead of cross-validated error? Why is this other approach typically better for random forests?

Answer

A

You use out-of-bag error. Because each classifier is trained on a bootstrapped training set, each point in the original training set was not used in training some of the classifiers (about 36.8%). So, get a prediction on this point from these classifiers, and find the error on all the points from that method.

This is preferable to cross-validation because you don’t need to retrain the model for each of the folds; training takes a long time for random forests, so we want to avoid it. But with enough trees, out-of-bag error tends to be very similar to CV error.

Question 51

Q

ML: What do variable-importance measures look at? On which algorithm are they most often used and why?

Answer

A

Variable-importance measures look at how important a variable is to the predictions of the model.

They’re commonly used for random forests, because random forests lack interpretability without them.

And more recently in my experience, they’re big for xgboost. Ensemble tree methods just lend themselves to variable importance metrics, because you can look at how often a tree chooses to split on a given variable, how the model performs with and without trees that use a certain variable, etc.

Question 52

Q

ML: For random forests, what are the 2 most common ways to calculate variable importance metrics?

Answer

A

Find all of the splits which use that variable in all of the decision trees, and then find how much on average those splits improve the predictions, using some measure of prediction performance like Gini index.
Randomly permute each variable one at a time, and see how much model performance decreases for each. (I’m guessing this means to calculate out-of-bag error, and for each point, before getting the prediction, choose the value of the variable in question randomly.)

Question 53

Q

ML: What is clustering, and how does it differ from classification?

Answer

A

Clustering is an unsupervised learning technique, so our training data are not labeled, and we try to divide the data into groups that are in some way “similar”. So basically, we try to label each point such that similar points have the same label.

Classification is similar because each point has a label, but classification is supervised: we are given a labeling scheme and want to learn how to predict it as best as possible. With clustering, we aren’t given a labeling scheme, and want to find a sensible one.

Question 54

Q

ML: What does k-means clustering approximately minimize? And how does its approximation of this minimization work on a theory level?

Question 55

Q

ML: How do you run the k-means clustering algorithm?

Answer

A

After choosing your number of classes k, randomly initialize the locations of your k cluster centers. Then, repeat:

Assign all training points to the cluster center that is closest by euclidean distance
Set cluster center as the average of all points in the cluster

Question 56

Q

ML: What are some drawbacks to k-means clustering?

Answer

A

It’s nondeterministic due to the random initializations, and is not guarenteed to reach the global optimum.
Cluster center isn’t interpretable, as it’s an average of a bunch of points

Question 57

Q

ML: What is an advantage to k-means clustering?

Answer

A

It will always converge, as each step will decrease the in-cluster variance (until it’s at a local minimum)

Question 58

Q

ML: How does the k-medoids algorithm work? Specifically, what alterations are made to the k-means algorithm?

Answer

A

Here the cluster centers are actual points in the dataset, rather than an average of a bunch of points.

So we initialize k points randomly to be the initial cluster centers, then repeat:

Assign all training points to the cluster center (a real point) that is closest to it
For each cluster, choose the point in the cluster which minimizes mean squared distance from other points as the new cluster center. (In other words, minimize in-cluster variation.)

Question 59

Q

ML: What is a pro of k-medoids over k-means? What is a con?

Answer

A

K-medoids provides more interpretable cluster centers, as they are actual points in the dataset rather than an average.

However, this is at the cost of having slightly higher in-cluster variation for k-medoids, as you simply have “fewer choices” of where to put your cluster center.

Question 60

Q

ML: At a high level (so for an arbitrary linkage), how does the hierarchical clustering algorithm work?

Answer

A

Each training point starts as its own cluster. Then at each step, we merge the 2 clusters with the highest similarity into one. We use linkages to determine similarity.

Question 61

Q

ML: What are a few advantages to hierarchical clustering?

Answer

A

It is deterministic, and thus has lower variance
You get clusterings for all values of k
You have hierarchical clusters! If your clusters naturally have a sub-cluster architecture, this is useful.

Question 62

Q

ML: At a very high level, how do mixture models work?

Answer

A

When looking at our training data, we assume they come from a probability distribution that is a weighted sum of several simple underlying distributions, such as several gaussians.

Question 63

Q

ML: What advantage do mixture models have for clustering?

Answer

A

Rather than having each point discretely in one cluster, we compute the probability that each point is in each cluster. This is good for overlapping clusters, or for clusterings where some points have an unclear affiliation.

Question 64

Q

ML: For the purpose of clustering using graphs, what are 3 ways to compute a graph from training data?

Answer

A

Connect 2 points iff their distance is below a threshold epsilon (epsilon-nearest-neighbors)
Connect 2 points iff one is a KNN of the other, for some k
Form a weighted graph where points with close euclidean distances have high weights

Answer 61

A

A graph cut is simply putting the vertices into 2 (or more) subsets, but it can be visualized as “cutting” across the edges which go between your two subsets.

We want our cuts to :

Have a low total weight of cut edges. So if we’re using an unweighted graph, we want to cut as few edges as possible; if the graph is weighted, we want to cut as little weight as possible. This increases dissimilarity across clusters and similarity within clusters, as edges are drawn between similar points
We want cuts to be balanced (or approximately balanced), meaning that a similar number of points is in each subset. This avoids putting one point in a subset, for example.

Answer 62

A

It is very good at capturing non-blob-like clusters, or clusters with atypical shapes. This can be very useful

Answer 63

A

Form a graph from your training points, connecting points that are similar.

Then, make graph cuts which separate dissimilar points to form your clusters.

Answer 64

A

Spectral embedding (it uses some fancy linear algebra that is related to graph-based dimensionality reduction, but not going to get into weeds here.)

Answer 65

A

The optimal regression function µ(x) = E[Y|X=x]; it is the expected value of the true function y=f(x) at every point x.

This is the “optimal” regression function because it minimizes the expected mean squared error:

E[(Y-µ(x))²]

Answer 66

A

Bootstrapping allows us to get an approximate sampling distribution for a statistic of our data.

So typically we have a dataset of n values X_i, and from it we will calculate a variety of statistics: mean, standard deviation, regression coefficient for a model you’re fitting to the data, difference in MSE between 2 models you’re fitting, etc. But for each statistic, you get one value of the statistic per dataset, and to get an (approximate) sampling distribution for a statistic, you need to simulate many datasets, then use them to find the sampling distribution of the test statistic, which you calculate for every simulated dataset.

Answer 67

A

We use statistics to make estimations about the underlying data: maybe want the mean, or a coefficient in a model, or the difference in MSE between 2 models. By finding an approximate sampling distribution of such a statistic, we can approximately quantify the uncertainty around our estimated value. So if we have a sample mean estimating the expected value of the data µ, we can use bootstrapping to find a confidence interval, or a standard error, or a variance, etc. And quantifying our uncertainty regarding these estimates is always a responsible thing to do!

Answer 68

A

Rather than writing the hyperplane as w^Tx + b = 0, you write it as w^Tx = 0, and implicitly assume that the first (zeroth) entry in w is your intercept b, and the first (zeroth) entry in your x vector is a 1.

Answer 69

A

Just draw the hyperplane by normally drawing a line, by converting into y = mx + b form.

To figure out which side has which label and draw that arrow, do some easy-to-calculate sample point back in the original form of w^Tx = 0, and see whether w^Tx is above or below 0.

Answer 70

A

MAP allows you to include some prior distribution of the parameter before factoring in the data. So, if you have suspicions about the value of a parameter, perhaps because of domain knowledge, then MAP could be better as it will allow you to encorporate that knowledge.

Answer 71

A

We want to take our D-dimensional dataset, and represent it using fewer dimensions (or fewer features). And we want to preserve as much of the structure and information in the data as we can: we want similar points in our original dataset to also be similar in the dimension-reduced dataset, and want dissimilar points to also remain dissimilar (by whatever measure of similarity makes sense in context).

Answer 72

A

With D-dimensional data, PCA finds up to D new linear dimensions, or vectors, or “number lines” as I think of them, which are linear combinations of the original dimensions (e_1,e₂,…). Each of these dimensions is called a Principal Component, and the first is the vector v₁ maximizing the variance of Xv₁, the projections of the points in design matrix X onto the number line of v₁. The remaining principal components maximize the variance of the projection Xv_i while also making v_i orthogonal to all previous principal components. (All D principal components thus form a basis for R^D, as they’re all orthogonal.)

Using these principal components, you can choose k < D of them to represent your data in the k dimensional subspace preserving as much of the variance in the data as possible (I bet it’s actually an approximation of that, cuz it feels like it’s a greedy algorithm, but that’s the idea).

Answer 73

A

For some subset of the D principal components, say k of them, the proportion of variance explained is how much of the variance from the original dataset is preserved by this k-dimensional representation. (Or what is explained by each individual principal component; both definitions work.)

Specifically, it is (I assume) the variance in the k-dimensional representation divided by the variance in the original data.

We are often interested in proportion of variance explained as a function of k, so we can see when we have a k high enough to explain a reasonable proportion of the variance, such as maybe 90%.

Answer 74

A

It plots proportion of variance explained vs number of principal components used, as a means of visualizing how much information will be preserved by each additional principal component.

This is the general idea, but there are many versions. Some plot number of PCs used vs eigenvalue associatied with its eigenvector, because these are related (as described in another flashcard). Another type, shown below, has the proportion of variance at each individual PC.

Answer 75

A

It’s just the component of a that’s in the direction of b. Think 2-d velocity calculations from physics: when we divide the velocity of an object into its x component and y component, that’s the velocity vector’s projection onto the vectors of the axes, (1,0) and (0,1), respectively.

Answer 76

A

A basis is a set of vectors such that any vector in the vector space can be represented as a linear combination of the vectors in the set.

Answer 77

A

Mv = ÿv

(Mv = lambda*v)

Answer 78

A

X_new = XV_k, where the columns of V_k are the k principal components v_i.

In other words, you find all k PC scores, Xv_i, which is just the projection of each point x onto the number line of direction v_i, and then you combine all of those projections into your new representation X_new.

Answer 79

A

Given data matrix X, you can:

Find the eigendecomposition VDV^Tof its covariance matrix (Sigma-hat)
Find the singular value decomposition of X = UD*V^T.
Take the inner product matrix XX^T and find XX^T = U(D*)²U^T, where U and D* the same as in option 2.

Answer 80

A

Kernel PCA allows you to find nonlinear embeddings using PCA, which typically only finds optimal linear embeddings.

You’ll recall that you can do normal PCA by taking the inner product matrix as XX^T = U(D*)²U^T, where XX^T is the inner product matrix with (XX^T)_i,j = x_i^Tx_j.

Well instead, use the kernel trick and make a different similarity matrix, a kernel matrix K, where the entries K_i,j = k(x_i,x_j) for some nonlinear kernel. Then do the same decomposition into K = U(D*)²U^T for some other matrices U and D, and find the PC scores in the new nonlinear space as UD just as you would when using XX^T.

This way, you can find the PCA dimension reduction of X, but after transforming the points of X into any new subspace you want, likely a non-linear one, using kernels.

Answer 81

A

If you have n points, and you have a matrix of distances between points, then you can find a similarity matrix M, similar to XX^T, on which you can perform PCA to get a dimension-reduced version of the data. (Each entry M_i,j is the distance between points i and j.) But, you can use any distance measure, including non-euclidean distance measures, which will help you find nonlinear dimension reductions which can preserve nonlinear structure in the data.

Answer 82

A

You’ll recall that you can use PCA to find a linear dimension reduction of data using only similarity matrix XX^T. Additionally, in MDS, you feed PCA a similarity matrix that instead uses some non-euclidean distance measure.

Well, ISOMAP is simply an instance of MDS where the distance matrix is calculated using graphs. You make a graph where the vertices are your data points, and every edge from point i to j has a weight of their euclidian distance apart. Then the graph distance is the shortest path in the weighted graph from i to j; feed these non-euclidean distances into MDS, and you’ll get a nonlinear embedding which preserves structures like swirls or curves.

Answer 83

A

Overfitting is when you allow your model to learn the patterns in the training data too well, so that it follows the noise in the training data. This results in validation error that is much higher than training error. This typically happens when your model has too much complexity (so that it can learn all the idiosyncracies of the training data), and/or when you don’t regularize.

Answer 84

A

Regularization is when you limit the values your models parameters can take, or (more commonly) penalize your model for taking certain values.

Typically, this looks like penalizing your model for having parameter values/coefficients that are large nonzero. So your objective function now needs to balance model fit and model complexity.

Regularization is generally used to combat overfitting.

Answer 85

A

Regularization increases bias: by penalizing a model for taking certain forms or parameter values, it becomes less capable of taking the form of the true model in certain situations; its ability to fit the data decreases.

Conversely, variance decreases, meaning that the model becomes less impacted by noise and fluctuations in the data. This is really the whole point of regularization: to combat overfitting to noise.

Answer 86

A

Almost always. This is a good thing to remember; it was driven home in 462, but often falls by the wayside.

Answer 87

A

If you have p parameters, it takes the value below for some balancing value lambda. (Note you don’t regularize the intercept B₀)

Answer 88

A

If you have p parameters, it takes the value below for some balancing value lambda. (Note you don’t regularize the intercept B₀)

Answer 89

A

A coefficient basically says that there is a substantive relationship between one variable and your output variable. By shrinking the coefficients, or penalizing large coefficients, you limit the number and magnitude of substantive predictive relationships that your model says exist.

This in theory means, if tuned right, it can only capture the important or “real” trends, but can’t capture the minutia in the noise of the training data. Hence, it combats overfitting.

Answer 90

A

Lasso regression particularly wants coefficients to be zero (rather than them being very small, but not zero). This means it will result in more zero coefficients, so it makes models that are more interpretable and that show which variables “actually matter”.

However, lasso thus zero’s out small relationships, which means it has more bias. Ridge doesn’t do this: the difference between 0 and .001 isn’t as big for ridge, which results in basically no zero coefficients. Thus, ridge would make more sense if you’re more concerned with accuracy than with interpretability.

Answer 91

A

As lambda goes to zero, we move towards no regularization, so the solution moves towards being the same as if we hadn’t regularized.

As lambda goes to infinity, our model will just make all coefficients go to zero, as this minimizes the objective function.

Answer 92

A

A high lambda means that the shrinking of coefficients has a lot of weight in the objective function, and so combatting overfitting will be very important to this model.

Conversely, low lambda means more liberty to have high coefficients, and we are more able to fit training data.

Answer 93

A

Increased lambda means more regularization.

As such, training error will monotonically increase as lambda increases.

Test error will decrease, then increase. So there will be an optimal lambda value where test error is minimized.

Answer 94

A

The one where things go quickly to zero, or the right one, is lasso, because lasso zero’s out coefficients.

The one where coefficients asymptote towards zero is ridge, as it decreases coefficients but doesn’t actually zero them out.

Answer 95

A

If the scales of your variables are very different, then a small parameter might mean a large change, or a large parameter might be a smart change. Because regularization tries to decrease parameter values without considering which variables they correspond to or what those variables’ scales are, the incentives for decreasing the influence of certain variables will be accidentally too high or too low based on scale.

Answer 96

A

The goal of regularization is to shrink coefficients of predictor variables, making it less likely our model will say it has predictive value.

Take Y = aX + b for example. By shrinking our learned value for a, we limit the amount of influence our model can percieve that X has on Y.

But b doesn’t describe how much impact X has on Y. It merely offsets the relationship and says what Y is likely to be when X is zero, but it doesn’t impact the model’s perception of whether X is related to Y.

So if we were to regularize b, and penalize our model for having large values of b, we would be limiting our model’s ability to find the correct relationship (increasing bias) without helping its ability to combat overfitting, or finding substantive relationships between variables which don’t actually exist. Hence, don’t regularize b.

Answer 97

A

Linear regression and logistic regression (because they fit the formulas and ideas we’ve learned so well).

Answer 98

A

Our classifier becomes biased towards predicting the label that is more common in our training set, hurting its predictive ability.

Answer 99

A

Downsampling: Only train using a subset of the points with the common label

Upsampling: Duplicate the training points with the less common label and use them multiple times in the dataset.

Weights: Changes weights of your problem in a variety of ways. For example, penalize the model more for errors when the correct answer was the uncommon label.

Answer 100

A

Model flexibility, or ability to fit a variety of true underlying models
Ability to capture nonlinearity
Interpretability
Ability to handle irrelevant features well

(Don’t necessarily need to get all of these or get it exactly right, but these are good things to keep in mind)

Answer 101

A

They both handle irrelevant features very well, which means you can throw a lot of features at them in hopes of finding a couple good ones, even if you suspect lots of them have little or no predictive power.

Answer 102

A

Split off some of your data, say 10 or 20%, to act as test data. Do this first.
Split off more data to act as your validation data; leave the rest as training data.
Try a bunch of models by training them with the training data, then finding their approximate out-of-sample error by predicting on the validation data.
Choose a final model.
Approximate your final model’s true out-of-sample error by training it on both the training and validation data and finding its error on the test set.
Train the model on all of your data, and deploy it to do whatever it needs to do in the real world.

Answer 103

A

With normal validation, once you’ve set aside your test data, you split your remaining data into train and validation data. You train all your models on the train data, then validate them with the validation data. You pick a model and test it on the test data.

With cross-validation (say with k folds, such as 5 or 10), after you set aside test data, you instead divide it into k groups, or k “folds”. For every model you try, you train it on k-1 of the folds and find its validation error on the remaining fold, and you do this for each of the k folds. Finally, you get your validation error by averaging all k of your error measures on each of the folds. (So you train your model k times rather than once when you are validating.)

Answer 104

A

Cross-validation wastes less training data during the train/validation part of the model fitting process. It also decreases the variance that appears when you split train and test: what if an outlier shows up in the validation set, or what if it doesn’t? For these reasons, cross-validation is much more common in practice than normal validation.

However, cross-validation can take a long time: the more folds you have, the longer it’s going to take to do all that training, whereas with normal validation you only train once.

Answer 105

A

Generally more folds is better, and you want as many folds as possible. The “ideal” cross validation is leave-one-out cross validation or LOOCV, where each datapoint is its own fold, with the intuition being that the more data you use to train for each fold, the better you simulate how the model will perform on actual held-out data.

But of course, more folds means more time, and thus LOOCV is often not realistic. 5-fold and 10-fold CV are common and tend to do the trick.

Answer 106

A

Any choice that you make in model selection when working with training and validation data has the risk of overfitting to the training and validation data. The characteristics of your training and validation data that cause you to make certain decisions might be noise.

Additionally, if you try enough models, one of them is likely to get a low validation error simply due to random chance.

For these reasons, you need data to test your model on after you choose your model to accurately estimate your out-of-sample error. You need the data that you use to assess out-of-sample error to not be involved in selecting a model in any way, otherwise it is possible that you are underestimating your error for these reasons.

Answer 107

A

All decisions and all actions. Or as many as possible.

You shouldn’t do feature engineering, or consider candidate algorithms, or decide how to define an outlier, or even curiously look at a histogram of your data before you set aside your test data.

This is because all of these decisions are a part of your model selection process, and you need a test set which wasn’t used for model selection in order to get an accurate, unbiased estimate of your final model’s out-of-sample error.

Answer 108

A

If you have enough data (or as your sample size n tends to infinity), your odds of seeing unexpected new things and/or things that cause bugs should be low, as your training and validation data should be representitive of all available data.

Answer 109

A

It’s very important: finding the right features to feed into your model can easily be as important or more important as what algorithm you choose to use, and deciding/brainstorming these features will often take up a large majority of your time when trying and selecting models.

Answer 110

A

A model can really be thought of as any aspect of the algorithm that outputs your predictions. This of course includes the algorithm you choose (random forests, linear regression, etc) and the values of the model’s parameters, but it also includes the features that your model takes as inputs, or how your model defines and handles outliers. That features idea is very important: the presence or absence of features is a part of your model; feature engineering is a subset of model selection.

This is a good way to think because it emphasizes the importance of not making any decisions or looking at anything before you set aside your test set. It also encourages you to think deeply about (and remember to cross-validate) aspects of model selection like feature engineering that you might think less important than, say, choosing your algorithm.

Answer 111

A

You want to cross-validate as many decisions as possible. This includes the normal stuff like the value of your hyperparameters, but also which algorithm you use, which features you choose to include, etc.

To cross-validate these things, you just find a cross-validation error for each combination of those decisions. So you try your algorithms with and without each feature you’re considering, for each of your candidate algorithms, and each of their possible hyperparameter values.

The issue with this strategy is it takes a lot of time, as you have a lot of combos. You address it by taking shortcuts that seem reasonable to you, and that your intuition says would still lead to a good final model. Maybe you don’t try every combination of features, or every value of a hyperparameter. Maybe you choose an algorithm with only a couple possible sets of features and a couple sets of hyperparameters, then hone in on that algorithm and try more combos of features, more candidate features in general, and spend more time on more possible hyperparameter values.

Answer 112

A

Training data* is used to learn the values of your model’s parameters. For example, when learning a neural network, your training data would teach the model its weights and bias terms
Validation data* is used to learn the values of the hyperparameters. For example, in KNN, you might use validation data to decide k, the number of neighbors, which is a hyperparameter. In the neural network example, you might use validation data for defining number of layers, nodes per layer, which activation function, etc.

Answer 113

A

We use gradient descent to minimize (or maximize) differentiable objective functions, or to find the value of x sucht that f(x) is minimized.

f(x) needs to be differentiable because we need to take its derivative in order to get the gradient.

Answer 114

A

Solving it with calculus is often extremely time-consuming or impossibly time-consuming, especially when you need to minimize an objective function with respect to, say, 1000 parameters as with NNs.

Answer 115

A

Depending on where your starting point is, you may descent to a local minimum rather than a global minimum, but your algorithm will still stop: it can’t take a small step in a direction and improve its situation.

This won’t happen with convex functions, or where the gradient is always “negative” or zero (this is how to think of convexity in say 2 or 3 dimensions; the terminology might differ at higher dimensions)

To combat this issue for non-convex functions, you can try several randomly chosen starts and see which returns the minimum; it still doesn’t guarentee a global min though.

Answer 116

A

The gradient of the function with respect to theta is just a vector of the partial derivatives of the function with respect to each of the p parameters in theta.

Intuitively (especially when visualizing say 3 dimensions), it is the direction of the n-dimensional slope, or the direction in which a small step would most productively move you towards optimizing the objective function. (Technically, you might move in the opposite direction, but same idea.)

Answer 117

A

If you choose the point randomly, the expected value of the gradient with respect to a single point equals the gradient with respect to all n points.

Answer 118

A

A single pass throught the entire training set.

Answer 119

A

Finding the gradient with respect to the entire dataset takes a long time. (Specifically, I believe it takes about equal time to find the gradient with respect to each of the individual points, as I believe it’s basically just a sum of those). So getting one update, which is albeit definitely in the direction of the gradient, takes a long time.

Conversely, SGD may make some updates in the “wrong” direction, but you get to make n updates in the time normal gradient descent needs to make only one. In practice, this typically leads to faster convergence to the same result.

Answer 120

A

In normal GD, you make updates with respect to the entire datapoint; so, your update is always in the right direction, but updates take a long time.

In SGD, you make updates with respect to a single datapoint; so, your update may be in the wrong direction, but the expected value of the direction is correct, and you can make many updates quickly.

In mini-batch gradient descent, you combine these ideas by updating with respect to small subsets of the training data each time. This can sometimes lead to a happy medium between making updates quickly, and often being in an approximately correct direction.

Answer 121

A

Support vector machines (SVMs) are an algorithm for finding the optimal linear/hyperplane classifier for a dataset.

In hard-margin, no points can be misclassified, and the classifier tries to maximize the “margin”, which is the distance between the hyperplane and the closest point(s).

In soft-margin, points are allowed to cross the hyperplane. Each point is assigned a “slack variable”, describing how far it’s within the margin (a concept still kept from hard-margin) or within the decision boundary (i.e. misclassified).

Answer 122

A

Similar to multiclass logistic regression, we just fit k binary-classification SVMs: one for each of the k labels, predicting whether it’s that specific label vs any of the others (we treat all other labels as a single label for each binary classifier). We then predict via:

Answer 123

A

It maps a vector to a non-negative scalar in a way that measures its “length”

Answer 124

A

The l₂ norm, or the euclidean norm, is simply the distance formula: it’s the length of the vector in euclidean space.

If x = [1,5,2], ||x||₂ = sqrt(1² + 5² + 2²).

Answer 125

A

For ||x||₁, you just sum the absolute values of the entries in x.

||x||₁ = |x₁| + |x_{<span>2</span>}| + |x₃|

Answer 126

A

Suppose we are trying to find a vector x that minimizes a norm, either the l₁ or l₂ norm. So in both cases, we’re trying to find a vector x whose entries are near zero.

The l₂ norm doesn’t care much whether an entry in x is exactly zero, or very nearly zero; as such, when we try to minimize the l₂ norm, we get a bunch of values that are very nearly zero.

Conversely, the l₁ norm does care a lot whether an entry in x is exactly zero, or very nearly zero; as such, when we try to minimize the l_{<span>1</span>} norm, we get a bunch of values that are exactly zero, whereas with the l₂ norm they weren’t quite zeroed out.

This pops up in regularization: when we do ridge regression, we’re minimizing the parameter vector over the l₂ norm, and so we get a bunch of values that are almost zero, but none that are really zeroed out. When we do lasso regression, we use the l₁ norm and thus get a bunch of parameters that are exactly 0.

There are tradeoffs here: lasso regression yields more interpretable results, but ridge regression is better at capturing very small relationships that lasso tends to just zero out.

Answer 127

A

Most classifiers find the probability that a point has each possible label, and then predicts the label with the highest probability.

So, we can evaluate either the predicted labels, or the predicted probabilities.

(Note: The classifiers don’t exactly make probabilities, they make values between 0 and 1. But these values can often be interpreted as probabilities, especially if we choose to transform them into more of a probability distribution.)

Answer 128

A

Oftentimes it’s important to know how likely the labels actually are, not just what the maximum-likelihood label: if a potential client has a 99% chance of signing a deal, we’ll prioritize them differently than if they have a 51% chance.

Answer 129

A

Recall that a classifier generally outputs the probability that a point has each label. A classifier is well-calibrated if the labels happen about as often as they’re predicted to happen.

So if, in every case where the classifier says a point has a 70% chance of being label A, about 70% of them are actually label A while the other 30% are something else, we have a well-calibrated classifier (at least for that predicted percentage for that label).

For binary classifiers, we can evaluate calibration using a bin plot. We group together all of the points where we predicted the odds of y=1 were between 0% and 10%, and then we see how many of those points were actually y=1. Similarly for 10% to 20%, and all the other bins.

Answer 130

A

For a classifier with k labels, it’s a k-by-k grid that plots the frequency of every predicted label vs every actual label, so we can see what types of errors the classifier tends to make.

For a binary classifier, it simply shows the number of true positives, false positives, true negatives and false negatives the classifier had, in the form of a 2x2 grid.

Answer 131

A

It is the proportion of true positive’s, or of y=1’s, that we get right.

So it’s :

(true positives/[true positives + false negatives])

Answer 132

A

It is the proportion of true negativess, or of y=0’s, that we get right.

So it’s :

(true negatives/[true negatives + false positives])

Answer 133

A

Both of these cases occur when one error is much more costly than the other.

Say we’re trying to decide whether to administer a treatment that has few side effects. If we get a false negative, and incorrectly classify someone as “healthy”, then we don’t give them treatment, which is very costly. But, if we incorrectly say they’re “sick” and treat them when they don’t need it, it’s not as bad. Here we heavily prioritize avoiding false negatives, or we want sensitivity to be high.

Conversely, consider a spam filter. We’re okay with accidentally sending spam to the inbox (false negative), but don’t want to send important emails to spam (false positive). Now, we care a lot about the specificity of our classifier being high, as we want to avoid false positives.

Answer 134

A

Classifiers output a probability that y=1 on a given point. Typically we label a point with a 1 if the probability is above 50%, but we can change the threshold that the probability must reach to label a point as y=1.

If we only label it y=1 if the probability of y=1 is above 75%, then we are okay with false negatives, but really don’t want false positives; we’re prioritizing specificity.

Conversely, if we label it y=1 whenever the probability of y=1 is above 25%, then we are okay with false positives, but really don’t want false negatives; we’re prioritizing sensitivity.

Another option would be to initially train the classifier so a false-negative causes higher loss in the loss function than a false positive, or vice versa. So we would have different loss amounts L₀ and L₁ for the two different errors.

Answer 135

A

Sensitivity describes the probability that we label a point y=1 when it is in fact y=1; it’s the probability of avoiding a false-negative.

Specificity describes the probability that we label a point y=0 when it is in fact y=0; it’s the probability of avoiding a false-positive.

So, we want high sensitivity when false-negatives are bad, and high specificity when false-positives are bad.

Answer 136

A

Recall that our trained binary classifier outputs a probability p that a point x has label y=1, and we can tradeoff the sensitivity and specificity of our classifier by moving the threshold of this probability where we predict y=1. So for some value t, we predict y=1 if p > t. It is high, we are avoiding false positives and thus have high specificity; if t is low, we are avoiding false negatives and thus have high sensitivity.

A curve on an ROC plot shows a classifier’s sensitivity and its specificity at every possible value of threshold t, which can be anywhere in [0,1]. It plots (1 - specificity) on the x axis, and sensitivity on the y axis.

What we want is classifiers whose curves are bowed out to the left, because then for each value of t they do a good job with both sensitivity and specificity. In this case, the classificer has a high AUC, or area-under-the-curve. We want high AUC, as it basically means our classifier is accurate.

The diagonal curve on the ROC corresponds to a random classifier, where our probability value p is randomly chosen from the interval [0,1], and then we threshold that probability for some threshold value t. The AUC of this curve is 0.5.

So a curve on the ROC plot is a classifier, and point on the ROC plot is a classifier-threshold pair. The top-left corner is a classifier-threshold pair always predicts the label correctly, and the bottom-right always predicts incorrectly. The bottom-left always predicts 0, and the top-right always predicts 1.

Answer 137

A

It’s very common

It’s very easy to understand

The math is simple and well-known (I’m not going to memorize it, but it follows from the statement of what it’s trying to minimize.)

Answer 138

A

A residual for a prediction Y*=µ(x) is Y - Y*, the difference between the real and predicted values.

Answer 139

A

One option is to plot your points vs some axis like Time, and assure that there is no pattern among points close in time. Basically look at ways the points might be related to one another, and see that they’re not, and that there is no pattern.

(This one was less well-defined in class)

Answer 140

A

An additive model

Answer 141

A

As n goes to infinity, the learned model must converge to the actual underlying model.

Answer 142

A

Correct: B_i is the expected difference in the outcome variable Y between two points whose difference in X_i is 1, assuming the values of all other predictors are held equal.

Incorrect: B_i is the expected difference in Y caused by increasing X_i by 1.

The incorrect one is wrong because it assumes some sort of causal relationship. Our linear model by itself says nothing about what will happen if we change the value of X_i, it just says that on average, points with higher X_i have higher (or lower, depending on sign of Bi) values of Y based proportionally on B_i, which is what the correct interpretation is saying.

Answer 143

A

Cross-validation, to approximate the MSEs of each model on out-of-sample data.