ML Flashcards

1
Q

ML: What are loss functions?

A

Functions that numerically compare your predictions Y* to the true values Y to provide feedback on the accuracy of your trained model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

ML: What is the typical loss function for classification? For regression?

A

Classification: 0/1 loss

Regression: Mean-Squared Error (MSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ML: What is a generative model?

A

It models P(Y|X) using P(X|Y)P(Y)
(derived from Bayes rule)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

ML: What is a discriminative model?

A

Models P(Y|X) directly, as P(Y|X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

ML: What are some advantages of discriminative model over generative? What about generative over discriminative?

A

Discriminative models (Y|X) don’t need to make assumptions about the distribution of X|Y, and can be simpler as a result.

Generative models can be used to generate samples. They are also sometimes more intuitive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

ML: What is a Bayes classifier?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

ML: What are 2 important characteristics of a Bayes classifier?

A

It is the best possible classifier

Its error is thus the irreducible error of a classification problem.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ML: At a high level, how would a Bayes Classifier be constructed in a case where errors are asymmetric, meaning some errors are worse than others?

A

We would weight each possible error with its own amount of loss Li,j, pertaining to what you said the answer is and what the answer should’ve been.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

ML: What is a simplistic view of logistic regression when Y is binary?

A

We are just doing a linear regression on our X’s, then squishing the outcome into [0,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

ML: In binary logistic regression, what is the formula for P(Y=1|X=x)?

A

For x of any dimension, it is as follows:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

ML: What is the formula for the inverse logit function, aka sigmoid? And what does it accomplish?

A

It is shown below. It squishes all real numbers x into a range of [0,1]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

ML: In binary logistic regression, given the formula for P(Y=1|X=x), how do we choose our predicted outcome?

A

We of course simply choose the outcome with the higher probability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

ML: What decision boundary does binary logistic regression yield?

A

It yields the linear separator:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ML: For binary logistic regression, how do we learn the coefficients of the model B0, B1, B2… in order to make predictions?

A

We estimate them using simple maximum likelihood. So we find the coefficients B0, B1, B2… that have the highest likelihood of producing the observed training data with the observed labels.

However, this optimization function using 0/1 loss is computationally hard, so we solve it iteratively using something like stochastic gradient descent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

ML: How to we extend binary logistic regression to multinomial logistic regression? So our X’s are still n-dimensional vectors, but now our Y’s are scalars in [1,K]?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

ML: What issue arises when logistic regression is performed on data that is perfectly linearly separable? And how can we combat this?

A

The learned weights will go to infinity, and we will overfit.

We can combat this by regularizing!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

ML: What does having a high number of parameters in your model do to the bias/variance tradeoff?

A

Having a lot of parameters, or a complex model, decreases bias (as we have more flexibility to fit whatever truly is the underlying model) but variance increases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

ML: What does having a high number of features in your model do to the bias/variance tradeoff?

A

Having a lot of featuers decreases bias (as we are able to fit more of the possible underlying models), but variance increases due to the curse of dimensionality.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

ML: In general, what kinds of changes to your model will decrease bias, but increase variance?

A

When you allow your algorithm the potential ability to fit a higher amount of potential underlying models, or more types of underlying models.

This of course decreases bias as you are more likely to hit the real model, but it increases variance because (I think) when you have more possible options and the same amount of data, it becomes more likely that one of them appears as the best-fitting model due simply to noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

ML: In general, what sorts of changes to your algorithm will decrease variance, but increase bias?

A

Changes that make your algorithm less influenced by outliers or noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

ML: What is a definition of model bias that is useful to think about when considering the bias/variance tradeoff?

A

Bias in a model is a lack of ability to fit the underlying model.

In other words, a lack of robustness with which to fit potential underlying models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

ML: What is a definition of model variance that is useful to think about when considering the bias/variance tradeoff?

A

Model variance is its amount of susceptibility to changes due to noise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

ML: What is the naive bayes assumption, both in words and mathematically?

A

In words, the naive bayes assumption is that there are no interactions between the features of n-dimensional feature vector X, or that the context of one of the features Xi, and how it interacts with other features Xj, is not relevant to prediction.

(An example is assuming seeing “Donald Trump” is has the same predictive value as seeing those two words 10 words apart and in reverse order.)

Mathematically, we have for D-dimensional X:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

ML: Is Naive Bayes a discriminative or generative model?

A

Generative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

ML: What is the form of P(Y=y|X=x) under the Naive Bayes classification algorithm?

A

P

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

ML: How do we learn a Naive Bayes classifier? What parameters need to be estimated, and what options do we have to estimate them?

A

For each Y=y, we need to estimate P(Y=y), as well as P(Xi=xi|Y=y) for every possible value of Xi.

P(Y=y) is generally estimated through MLE. So, we just estimate it as the proportion of the training examples for which Y=y.

For P(Xi=xi|Y=y) it can vary. You could also do basic population MLE (The proportion of Xi=xi among training examples where Y=y), or you could learn a gaussian N(µ,σ2) as with gaussian naive bayes, or something else!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

ML: What are the advantages and disadvantages of Naive Bayes?

A

The disadvantage is that we cannot capture context, which is often very important. Hence this algorithm being “naive”.

The advantage is that it’s simple, and is much easier to train: because we ignore interactions, we only need to learn a bunch of univariate distributions rather than one giant joint distribution, which is much harder (I believe exponentially harder).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

ML: What do the decision boundaries look like in multinomial logistic regression?

A

We’re finding n linear (or hyperplane) separators, one for each of the K possible Y=k which models P(Y=k) vs P(Y not k). As such, it looks like a bunch of k linear (or hyperplane) separators working together on the cartesian grid.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

ML: What is the most common way to train a decision tree, without yet worrying about pruning?

A

At each step, simply greedily add the split that most decreases classification error on the training set (or MSE in the case of regression).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

ML: What is the common way to validate a decision tree and combat overfitting?

A

Pruning. Once you have your full tree trained on training data, at each step, greedily remove the branch that causes the largest improvement on validation error. Continue until you can’t find any more improvements to validation error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

ML: How do we predict on a new point X using a trained decision tree, for both classification and regression?

A

Simply work your way a branch until you get to the bottom. For classification, return the most common label among the training examples at that branch. For regression, return the average training label.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

ML: What is the main advantage of decision trees? (Not in ensembles or anything, just by themselves.)

A

They’re very interpretable, especially to non-technical people

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

ML: What are some disadvantages of decision trees? (Not in ensembles or anything, just by themselves.)

A

They don’t predict very well.

Variance is very high, as small changes in training data can cause big differences in how the splits occur.

They also have a fair amount of bias, as they can only capture certain types of decision bounds (no “diagonal” decision bounds).

They also handle categorical predictor variables poorly if there are too many predictors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

ML: What issue do decision trees have with categorical predictor variables?

A

If the variable has N possible valuest each split, it must consider each of the 2N-1 subsets of the values as a splitting rule, which is of course very computationally hard.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

ML: As the depth of a decision tree grows, what happens to bias and variance?

A

Bias decreases (as we increase complexity), and variance increases.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

ML: How do ensemble classifiers generally work?

A

Several models are learned, and they vote on what the prediction is.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

ML: What is boosting? How does it generally work?

A

In boosting, you fit several (typically very basic) classifiers iteratively, with each new classifier trying to do well on the training points that the previous classifiers did poorly on.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

ML: How does boosting improve the performance of the classifiers its using? In otherwords, how does bias and/or variance impacted before and after the use of boosting?

A

Because it uses very simple predictors (for example, stumps of decision trees), each one individually has very high bias. But, by fitting a lot of them and correcting for previous mistakes, bias is decreased.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

ML: How does prediction work for a KNN classifier or regressor?

A

For each new point, you find the K points in the training set closest to it (by some measure such as euclidean distance); then for classification you’d return the most common label among the nearest neighbors, and for regression you’d return the average label of the nearest neighbors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

ML: What is the (somewhat high-level) algorithm for Adaboost?

A

We start with each training example Xi having equal weights wi. Then, repeat:

  1. Fit a new classifier that minimizes weighted training error, based on the weights wi. (Points on which earlier classifiers have performed poorly have higher weights, and errors on those points cost more)
  2. Based on the weighted error of that classifier, find a weight alphaj for the classifier for when it votes later on. The lower its weighted error, the higher its alphaj, and thus the higher its voting power.
  3. Update each of the point weights wi based on this round’s results.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

ML: How (at a high level) does gradient boosting work?

A

At each step, we find the current gradient of the loss function. We then find our next classifier to “approximate this gradient vector”.

So we in some way optimally take into account the gradient of the loss function at each step as we choose our classifiers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

ML: What issue do we run into when trying to minimize 0/1 loss for our classification problems? How can we solve this issue?

A

Because the 0/1 loss function is not smooth, as well as nonconvex, optimizing the solution is very hard, and you also can’t approximate it with iterative techniques like gradient descent (which is generally the solution when finding the true global minimum is impossible).

To combat this, you can instead iteratively minimize some similar loss function that puts an upper bound on 0/1 loss, such as hinge loss (pink), or exponential loss (yellow).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

ML: How does bagging work, at a fairly high level? How do we train a bagging ensemble given a training set of n Xi’s?

A

Resample several samples from your training data; say m bootstrapped samples of size n, with replacement.

Train a (typically complex) model on each of the m bootstrapped samples. Have these models vote on the prediction.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

ML: How do bias and variance work with bagging? So what is the level bias and variance from one of the individual classifiers in the ensemble, and then what happens to bias and variance when we have several classifiers vote?

A

The classifiers in bagging are complex; if they’re trees, they’re deep trees. As a result, the individual classifiers have low bias, but high variance.

However, when we use several classifiers and have them vote, we can drive down this variance because we’re averaging votes from classifiers trained on different bootstrapped samples. (As n goes infinity, these samples become independent.)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

ML: What are the high-level differences between boosting and bagging, and how they achieve the goal of making several classifiers work effectively together?

A

Boosting fits several models that are not complex, and thus have high bias. But by fitting several and correcting errors from previous iterations, bias is reduced.

Bagging fits several complex models, so bias is lower but variance is high. But by averaging many predictions on slightly different datasets, we can lower variance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

ML: What is an improvement we can make to bagging (specifically classification bagging) when we are calculating our predicted class?

A

Rather than just picking the class with the most votes from each of the classifiers, average the probabilities of each class from each of the classifiers, then predict the class with the highest probability.

(If we get to the end of a branch in one of the classifiers, the “probability” of class j predicted by that classifier is the proportion of training examples that had class j).

This improvement makes performance more consistent, decreasing variance a bit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

ML: How do Random Forests work?

A

Random Forests work basically just by using bagging with deep decision trees: bootstrapping several resampled training sets, training a deep tree on each of them, and having them vote for a classification or regressed value.

The key change from traditional bagging is this: if there are p features of a training example X, at each split in a decision tree, only m < p features are considered for that split. (Typically m = sqrt(p))

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

ML: Why is the Random Forest alteration to bagging often advantageous?

A

By only considering m < p features at each split of a decision tree, the trees become more independent, which helps the random forest algorithm decrease the variance in the overall prediction (which is the general goal of bagging).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

ML: What are two cons of bagging/random forests?

A

The models lack interpretibility.

Training and prediction both take a long time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

ML: What method is typically used to evaluate random forests instead of cross-validated error? Why is this other approach typically better for random forests?

A

You use out-of-bag error. Because each classifier is trained on a bootstrapped training set, each point in the original training set was not used in training some of the classifiers (about 36.8%). So, get a prediction on this point from these classifiers, and find the error on all the points from that method.

This is preferable to cross-validation because you don’t need to retrain the model for each of the folds; training takes a long time for random forests, so we want to avoid it. But with enough trees, out-of-bag error tends to be very similar to CV error.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

ML: What do variable-importance measures look at? On which algorithm are they most often used and why?

A

Variable-importance measures look at how important a variable is to the predictions of the model.

They’re commonly used for random forests, because random forests lack interpretability without them.

And more recently in my experience, they’re big for xgboost. Ensemble tree methods just lend themselves to variable importance metrics, because you can look at how often a tree chooses to split on a given variable, how the model performs with and without trees that use a certain variable, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

ML: For random forests, what are the 2 most common ways to calculate variable importance metrics?

A
  1. Find all of the splits which use that variable in all of the decision trees, and then find how much on average those splits improve the predictions, using some measure of prediction performance like Gini index.
  2. Randomly permute each variable one at a time, and see how much model performance decreases for each. (I’m guessing this means to calculate out-of-bag error, and for each point, before getting the prediction, choose the value of the variable in question randomly.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

ML: What is clustering, and how does it differ from classification?

A

Clustering is an unsupervised learning technique, so our training data are not labeled, and we try to divide the data into groups that are in some way “similar”. So basically, we try to label each point such that similar points have the same label.

Classification is similar because each point has a label, but classification is supervised: we are given a labeling scheme and want to learn how to predict it as best as possible. With clustering, we aren’t given a labeling scheme, and want to find a sensible one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

ML: What does k-means clustering approximately minimize? And how does its approximation of this minimization work on a theory level?

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

ML: How do you run the k-means clustering algorithm?

A

After choosing your number of classes k, randomly initialize the locations of your k cluster centers. Then, repeat:

  1. Assign all training points to the cluster center that is closest by euclidean distance
  2. Set cluster center as the average of all points in the cluster
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

ML: What are some drawbacks to k-means clustering?

A
  1. It’s nondeterministic due to the random initializations, and is not guarenteed to reach the global optimum.
  2. Cluster center isn’t interpretable, as it’s an average of a bunch of points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

ML: What is an advantage to k-means clustering?

A

It will always converge, as each step will decrease the in-cluster variance (until it’s at a local minimum)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

ML: How does the k-medoids algorithm work? Specifically, what alterations are made to the k-means algorithm?

A

Here the cluster centers are actual points in the dataset, rather than an average of a bunch of points.

So we initialize k points randomly to be the initial cluster centers, then repeat:

  1. Assign all training points to the cluster center (a real point) that is closest to it
  2. For each cluster, choose the point in the cluster which minimizes mean squared distance from other points as the new cluster center. (In other words, minimize in-cluster variation.)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

ML: What is a pro of k-medoids over k-means? What is a con?

A

K-medoids provides more interpretable cluster centers, as they are actual points in the dataset rather than an average.

However, this is at the cost of having slightly higher in-cluster variation for k-medoids, as you simply have “fewer choices” of where to put your cluster center.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

ML: At a high level (so for an arbitrary linkage), how does the hierarchical clustering algorithm work?

A

Each training point starts as its own cluster. Then at each step, we merge the 2 clusters with the highest similarity into one. We use linkages to determine similarity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

ML: What are a few advantages to hierarchical clustering?

A
  1. It is deterministic, and thus has lower variance
  2. You get clusterings for all values of k
  3. You have hierarchical clusters! If your clusters naturally have a sub-cluster architecture, this is useful.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

ML: At a very high level, how do mixture models work?

A

When looking at our training data, we assume they come from a probability distribution that is a weighted sum of several simple underlying distributions, such as several gaussians.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

ML: What advantage do mixture models have for clustering?

A

Rather than having each point discretely in one cluster, we compute the probability that each point is in each cluster. This is good for overlapping clusters, or for clusterings where some points have an unclear affiliation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

ML: For the purpose of clustering using graphs, what are 3 ways to compute a graph from training data?

A
  1. Connect 2 points iff their distance is below a threshold epsilon (epsilon-nearest-neighbors)
  2. Connect 2 points iff one is a KNN of the other, for some k
  3. Form a weighted graph where points with close euclidean distances have high weights
65
Q

ML: What is a graph cut? For the purposes of graph clustering, what 2 propert do we want a cut to have?

A

A graph cut is simply putting the vertices into 2 (or more) subsets, but it can be visualized as “cutting” across the edges which go between your two subsets.

We want our cuts to :

  1. Have a low total weight of cut edges. So if we’re using an unweighted graph, we want to cut as few edges as possible; if the graph is weighted, we want to cut as little weight as possible. This increases dissimilarity across clusters and similarity within clusters, as edges are drawn between similar points
  2. We want cuts to be balanced (or approximately balanced), meaning that a similar number of points is in each subset. This avoids putting one point in a subset, for example.
66
Q

ML: What is a key advantage of graph clustering?

A

It is very good at capturing non-blob-like clusters, or clusters with atypical shapes. This can be very useful

67
Q

ML: At an extremely high level, how does graph clustering work?

A

Form a graph from your training points, connecting points that are similar.

Then, make graph cuts which separate dissimilar points to form your clusters.

68
Q

ML: What method is used to do graph clustering with more than 2 clusters?

A

Spectral embedding (it uses some fancy linear algebra that is related to graph-based dimensionality reduction, but not going to get into weeds here.)

69
Q

402: What is the “optimal” regression function µ(x), and why is it optimal?

A

The optimal regression function µ(x) = E[Y|X=x]; it is the expected value of the true function y=f(x) at every point x.

This is the “optimal” regression function because it minimizes the expected mean squared error:

E[(Y-µ(x))2]

70
Q

402: What purpose does bootstrapping serve, and how does it do so?

A

Bootstrapping allows us to get an approximate sampling distribution for a statistic of our data.

So typically we have a dataset of n values Xi, and from it we will calculate a variety of statistics: mean, standard deviation, regression coefficient for a model you’re fitting to the data, difference in MSE between 2 models you’re fitting, etc. But for each statistic, you get one value of the statistic per dataset, and to get an (approximate) sampling distribution for a statistic, you need to simulate many datasets, then use them to find the sampling distribution of the test statistic, which you calculate for every simulated dataset.

71
Q

402: What is the Bootstrap Principle? In other words, what is the 3-step high-level/theoretical process for bootstrapping?

A
72
Q

402: Why is it so often useful to use bootstrapping to get an approximate sampling distribution of a statistic?

A

We use statistics to make estimations about the underlying data: maybe want the mean, or a coefficient in a model, or the difference in MSE between 2 models. By finding an approximate sampling distribution of such a statistic, we can approximately quantify the uncertainty around our estimated value. So if we have a sample mean estimating the expected value of the data µ, we can use bootstrapping to find a confidence interval, or a standard error, or a variance, etc. And quantifying our uncertainty regarding these estimates is always a responsible thing to do!

73
Q

ML: How do you specify a hyperplane classifier for D-dimensional x, and Y either 0 or 1?

A
73
Q

ML: What does it mean to “fold the intercept in” for a hyperplane classifier?

A

Rather than writing the hyperplane as wTx + b = 0, you write it as wTx = 0, and implicitly assume that the first (zeroth) entry in w is your intercept b, and the first (zeroth) entry in your x vector is a 1.

74
Q

ML: What is a simple way to draw a hyperplane classifier wTx = 0, as well as the direction of the arrow pointing towards y=1?

A

Just draw the hyperplane by normally drawing a line, by converting into y = mx + b form.

To figure out which side has which label and draw that arrow, do some easy-to-calculate sample point back in the original form of wTx = 0, and see whether wTx is above or below 0.

75
Q

ML: When would you use MAP estimation of a parameter instead of MLE estimation?

A

MAP allows you to include some prior distribution of the parameter before factoring in the data. So, if you have suspicions about the value of a parameter, perhaps because of domain knowledge, then MAP could be better as it will allow you to encorporate that knowledge.

76
Q
A
77
Q

ML: What is the goal of dimensionality reduction?

A

We want to take our D-dimensional dataset, and represent it using fewer dimensions (or fewer features). And we want to preserve as much of the structure and information in the data as we can: we want similar points in our original dataset to also be similar in the dimension-reduced dataset, and want dissimilar points to also remain dissimilar (by whatever measure of similarity makes sense in context).

78
Q

ML: What does PCA give you?

A

With D-dimensional data, PCA finds up to D new linear dimensions, or vectors, or “number lines” as I think of them, which are linear combinations of the original dimensions (e1, e2,…). Each of these dimensions is called a Principal Component, and the first is the vector v1 maximizing the variance of Xv1, the projections of the points in design matrix X onto the number line of v1. The remaining principal components maximize the variance of the projection Xvi while also making vi orthogonal to all previous principal components. (All D principal components thus form a basis for RD, as they’re all orthogonal.)

Using these principal components, you can choose k < D of them to represent your data in the k dimensional subspace preserving as much of the variance in the data as possible (I bet it’s actually an approximation of that, cuz it feels like it’s a greedy algorithm, but that’s the idea).

79
Q

ML: In PCA, what is proportion of variance explained?

A

For some subset of the D principal components, say k of them, the proportion of variance explained is how much of the variance from the original dataset is preserved by this k-dimensional representation. (Or what is explained by each individual principal component; both definitions work.)

Specifically, it is (I assume) the variance in the k-dimensional representation divided by the variance in the original data.

We are often interested in proportion of variance explained as a function of k, so we can see when we have a k high enough to explain a reasonable proportion of the variance, such as maybe 90%.

80
Q

ML: In PCA, what is a scree plot?

A

It plots proportion of variance explained vs number of principal components used, as a means of visualizing how much information will be preserved by each additional principal component.

This is the general idea, but there are many versions. Some plot number of PCs used vs eigenvalue associatied with its eigenvector, because these are related (as described in another flashcard). Another type, shown below, has the proportion of variance at each individual PC.

81
Q

ML: What is an intuitive explanation of proja(b), the projection of vector a onto vector b?

A

It’s just the component of a that’s in the direction of b. Think 2-d velocity calculations from physics: when we divide the velocity of an object into its x component and y component, that’s the velocity vector’s projection onto the vectors of the axes, (1,0) and (0,1), respectively.

82
Q

ML: What is a basis, with respect to a vector space?

A

A basis is a set of vectors such that any vector in the vector space can be represented as a linear combination of the vectors in the set.

83
Q

ML: What is the formula or equation satisfied by matrix M and its eigenvector v and associated eigenvalue ÿ (lambda)?

A

Mv = ÿv

(Mv = lambda*v)

84
Q

ML: In PCA, once you have your top k principal components vi, how do you get your k-dimensional representation of data matrix X?

A

Xnew = XVk, where the columns of Vk are the k principal components vi.

In other words, you find all k PC scores, Xvi, which is just the projection of each point x onto the number line of direction vi, and then you combine all of those projections into your new representation Xnew.

85
Q

ML: There are three common high-level ways to compute PCA. What are they? In other words, what 3 matrices can you do what three things to to get relevant information for PCA?

A

Given data matrix X, you can:

  1. Find the eigendecomposition VDVT of its covariance matrix (Sigma-hat)
  2. Find the singular value decomposition of X = UD*VT.
  3. Take the inner product matrix XXT and find XXT = U(D*)2UT, where U and D* the same as in option 2.
86
Q

ML: How do you Kernel PCA for some dataset, and what does Kernel PCA allow you to do?

A

Kernel PCA allows you to find nonlinear embeddings using PCA, which typically only finds optimal linear embeddings.

You’ll recall that you can do normal PCA by taking the inner product matrix as XXT​ = U(D*)2UT, where XXT is the inner product matrix with (XXT)i,j = xiTxj.

Well instead, use the kernel trick and make a different similarity matrix, a kernel matrix K, where the entries Ki,j = k(xi,xj) for some nonlinear kernel. Then do the same decomposition into K = U(D*)2UT for some other matrices U and D, and find the PC scores in the new nonlinear space as UD just as you would when using XXT.

This way, you can find the PCA dimension reduction of X, but after transforming the points of X into any new subspace you want, likely a non-linear one, using kernels.

87
Q

ML: How does MDS work for dimensionality reduction, and what does it accomplish?

A

If you have n points, and you have a matrix of distances between points, then you can find a similarity matrix M, similar to XXT, on which you can perform PCA to get a dimension-reduced version of the data. (Each entry Mi,j is the distance between points i and j.) But, you can use any distance measure, including non-euclidean distance measures, which will help you find nonlinear dimension reductions which can preserve nonlinear structure in the data.

88
Q

ML: How does ISOMAP work for dimensionality reduction, and what does it accomplish?

A

You’ll recall that you can use PCA to find a linear dimension reduction of data using only similarity matrix XXT. Additionally, in MDS, you feed PCA a similarity matrix that instead uses some non-euclidean distance measure.

Well, ISOMAP is simply an instance of MDS where the distance matrix is calculated using graphs. You make a graph where the vertices are your data points, and every edge from point i to j has a weight of their euclidian distance apart. Then the graph distance is the shortest path in the weighted graph from i to j; feed these non-euclidean distances into MDS, and you’ll get a nonlinear embedding which preserves structures like swirls or curves.

89
Q

ML: What is overfitting?

A

Overfitting is when you allow your model to learn the patterns in the training data too well, so that it follows the noise in the training data. This results in validation error that is much higher than training error. This typically happens when your model has too much complexity (so that it can learn all the idiosyncracies of the training data), and/or when you don’t regularize.

90
Q

ML: What does the graph of train error and test error vs model complexity look like?

A
91
Q

ML: In general, what is Regularization? And what is its purpose?

A

Regularization is when you limit the values your models parameters can take, or (more commonly) penalize your model for taking certain values.

Typically, this looks like penalizing your model for having parameter values/coefficients that are large nonzero. So your objective function now needs to balance model fit and model complexity.

Regularization is generally used to combat overfitting.

92
Q

ML: What does Regularization do to bias and variance? And what causes these changes?

A

Regularization increases bias: by penalizing a model for taking certain forms or parameter values, it becomes less capable of taking the form of the true model in certain situations; its ability to fit the data decreases.

Conversely, variance decreases, meaning that the model becomes less impacted by noise and fluctuations in the data. This is really the whole point of regularization: to combat overfitting to noise.

93
Q

ML: Generally speaking, when should you regularize your model?

A

Almost always. This is a good thing to remember; it was driven home in 462, but often falls by the wayside.

94
Q

ML: What is the general form of an objective function for finding your parameters beta (B1, B2,…) using ridge regression?

A

If you have p parameters, it takes the value below for some balancing value lambda. (Note you don’t regularize the intercept B0)

95
Q

ML: What is the general form of an objective function for finding your parameters beta (B1, B2,…) using lasso regression?

A

If you have p parameters, it takes the value below for some balancing value lambda. (Note you don’t regularize the intercept B0)

96
Q

ML: When doing regularization, why does shrinking the coefficients combat overfitting?

A

A coefficient basically says that there is a substantive relationship between one variable and your output variable. By shrinking the coefficients, or penalizing large coefficients, you limit the number and magnitude of substantive predictive relationships that your model says exist.

This in theory means, if tuned right, it can only capture the important or “real” trends, but can’t capture the minutia in the noise of the training data. Hence, it combats overfitting.

97
Q

ML: What are the pros and cons of lasso regression vs ridge regression?

A

Lasso regression particularly wants coefficients to be zero (rather than them being very small, but not zero). This means it will result in more zero coefficients, so it makes models that are more interpretable and that show which variables “actually matter”.

However, lasso thus zero’s out small relationships, which means it has more bias. Ridge doesn’t do this: the difference between 0 and .001 isn’t as big for ridge, which results in basically no zero coefficients. Thus, ridge would make more sense if you’re more concerned with accuracy than with interpretability.

98
Q

ML: When using regularization, what will the regularized parameters look like as lambda goes to 0? To infinity?

A

As lambda goes to zero, we move towards no regularization, so the solution moves towards being the same as if we hadn’t regularized.

As lambda goes to infinity, our model will just make all coefficients go to zero, as this minimizes the objective function.

99
Q

When regularizing, what does a high lambda mean? A low lambda?

A

A high lambda means that the shrinking of coefficients has a lot of weight in the objective function, and so combatting overfitting will be very important to this model.

Conversely, low lambda means more liberty to have high coefficients, and we are more able to fit training data.

100
Q

ML: Generally, when regularizing a model, what will plots of training and test error vs lambda look like if, say, no regularization would lead to an overfit model?

A

Increased lambda means more regularization.

As such, training error will monotonically increase as lambda increases.

Test error will decrease, then increase. So there will be an optimal lambda value where test error is minimized.

101
Q

ML: Which of the following plots of parameter values vs lambda is for ridge regularization, and which is for lasso regularization? How can you tell?

A

The one where things go quickly to zero, or the right one, is lasso, because lasso zero’s out coefficients.

The one where coefficients asymptote towards zero is ridge, as it decreases coefficients but doesn’t actually zero them out.

102
Q

ML: Why is the equitable scaling of your different variables important when regularizing?

A

If the scales of your variables are very different, then a small parameter might mean a large change, or a large parameter might be a smart change. Because regularization tries to decrease parameter values without considering which variables they correspond to or what those variables’ scales are, the incentives for decreasing the influence of certain variables will be accidentally too high or too low based on scale.

103
Q

ML: When regularizing, why do we not regularize the bias term b?

A

The goal of regularization is to shrink coefficients of predictor variables, making it less likely our model will say it has predictive value.

Take Y = aX + b for example. By shrinking our learned value for a, we limit the amount of influence our model can percieve that X has on Y.

But b doesn’t describe how much impact X has on Y. It merely offsets the relationship and says what Y is likely to be when X is zero, but it doesn’t impact the model’s perception of whether X is related to Y.

So if we were to regularize b, and penalize our model for having large values of b, we would be limiting our model’s ability to find the correct relationship (increasing bias) without helping its ability to combat overfitting, or finding substantive relationships between variables which don’t actually exist. Hence, don’t regularize b.

104
Q

ML: What is a basic example of a regression algorithm and a classification algorithm where ridge or lasso regression are often used?

A

Linear regression and logistic regression (because they fit the formulas and ideas we’ve learned so well).

105
Q

ML: In classification, what might happen if your training set is imbalanced, and one label is very common or very uncommon?

A

Our classifier becomes biased towards predicting the label that is more common in our training set, hurting its predictive ability.

106
Q

ML: In classification, what are 3 ways to combat an imbalanced training set, where one label is way more common than the others?

A

Downsampling: Only train using a subset of the points with the common label

Upsampling: Duplicate the training points with the less common label and use them multiple times in the dataset.

Weights: Changes weights of your problem in a variety of ways. For example, penalize the model more for errors when the correct answer was the uncommon label.

107
Q

ML: If your training error is much lower than your test error, does this mean your model’s variance is too high, or its bias is too high?

A

Variance

108
Q

ML: What are a few high-level model traits that are good to keep in mind when choosing a model?

A
  • Model flexibility, or ability to fit a variety of true underlying models
  • Ability to capture nonlinearity
  • Interpretability
  • Ability to handle irrelevant features well

(Don’t necessarily need to get all of these or get it exactly right, but these are good things to keep in mind)

109
Q

ML: What is a common advantage that both lasso-regularized models and random forests have in common? And why is this useful?

A

They both handle irrelevant features very well, which means you can throw a lot of features at them in hopes of finding a couple good ones, even if you suspect lots of them have little or no predictive power.

110
Q

ML: When training a model using a simple train/validation/test split (not cross validation), what are the general high-level steps?

A
  1. Split off some of your data, say 10 or 20%, to act as test data. Do this first.
  2. Split off more data to act as your validation data; leave the rest as training data.
  3. Try a bunch of models by training them with the training data, then finding their approximate out-of-sample error by predicting on the validation data.
  4. Choose a final model.
  5. Approximate your final model’s true out-of-sample error by training it on both the training and validation data and finding its error on the test set.
  6. Train the model on all of your data, and deploy it to do whatever it needs to do in the real world.
111
Q

ML: How do you train a model using cross-validation? How does it differ from normal validation?

A

With normal validation, once you’ve set aside your test data, you split your remaining data into train and validation data. You train all your models on the train data, then validate them with the validation data. You pick a model and test it on the test data.

With cross-validation (say with k folds, such as 5 or 10), after you set aside test data, you instead divide it into k groups, or k “folds”. For every model you try, you train it on k-1 of the folds and find its validation error on the remaining fold, and you do this for each of the k folds. Finally, you get your validation error by averaging all k of your error measures on each of the folds. (So you train your model k times rather than once when you are validating.)

112
Q

ML: What are the pros and cons of cross-validation vs normal validation? Which of these makes cross-validation more common and popular than normal validation?

A

Cross-validation wastes less training data during the train/validation part of the model fitting process. It also decreases the variance that appears when you split train and test: what if an outlier shows up in the validation set, or what if it doesn’t? For these reasons, cross-validation is much more common in practice than normal validation.

However, cross-validation can take a long time: the more folds you have, the longer it’s going to take to do all that training, whereas with normal validation you only train once.

113
Q

ML: When cross-validating, is more or less folds better? What is the “ideal” number of folds? What is the intuition behind this?

A

Generally more folds is better, and you want as many folds as possible. The “ideal” cross validation is leave-one-out cross validation or LOOCV, where each datapoint is its own fold, with the intuition being that the more data you use to train for each fold, the better you simulate how the model will perform on actual held-out data.

But of course, more folds means more time, and thus LOOCV is often not realistic. 5-fold and 10-fold CV are common and tend to do the trick.

114
Q

ML: In the train-validate-test split, why do we need a test set to get an accurate idea of our model’s out-of-sample error? Why can’t we use the model’s validation error for this? (Equivalently, why do we need a test set when doing cross-validation rather than just using cross-validation error?)

A

Any choice that you make in model selection when working with training and validation data has the risk of overfitting to the training and validation data. The characteristics of your training and validation data that cause you to make certain decisions might be noise.

Additionally, if you try enough models, one of them is likely to get a low validation error simply due to random chance.

For these reasons, you need data to test your model on after you choose your model to accurately estimate your out-of-sample error. You need the data that you use to assess out-of-sample error to not be involved in selecting a model in any way, otherwise it is possible that you are underestimating your error for these reasons.

115
Q

ML: When choosing a model for a given task, what choices need to be made or actions need to be done after you set aside your test set, rather than before? And why?

A

All decisions and all actions. Or as many as possible.

You shouldn’t do feature engineering, or consider candidate algorithms, or decide how to define an outlier, or even curiously look at a histogram of your data before you set aside your test data.

This is because all of these decisions are a part of your model selection process, and you need a test set which wasn’t used for model selection in order to get an accurate, unbiased estimate of your final model’s out-of-sample error.

116
Q

ML: One concern about not looking at your test data before splitting it off is that it might contain weird or unexpected data points that cause your final model to crash or fail, such as a new category in a categorical variable. What is a response to this concern?

A

If you have enough data (or as your sample size n tends to infinity), your odds of seeing unexpected new things and/or things that cause bugs should be low, as your training and validation data should be representitive of all available data.

117
Q

ML: Discuss the potential importance of feature engineering when finding an effective model.

A

It’s very important: finding the right features to feed into your model can easily be as important or more important as what algorithm you choose to use, and deciding/brainstorming these features will often take up a large majority of your time when trying and selecting models.

118
Q

ML: With the train/val/test split and the importance of a held-out test set in mind, what is a good way to think about the definition of a “model”? And why?

A

A model can really be thought of as any aspect of the algorithm that outputs your predictions. This of course includes the algorithm you choose (random forests, linear regression, etc) and the values of the model’s parameters, but it also includes the features that your model takes as inputs, or how your model defines and handles outliers. That features idea is very important: the presence or absence of features is a part of your model; feature engineering is a subset of model selection.

This is a good way to think because it emphasizes the importance of not making any decisions or looking at anything before you set aside your test set. It also encourages you to think deeply about (and remember to cross-validate) aspects of model selection like feature engineering that you might think less important than, say, choosing your algorithm.

119
Q

ML: During model selection, what sorts of decisions should you cross-validate? How do you cross-validate these decisions? And why do you choose to cross-validate them? What practical concerns arise and how do you address them?

A

You want to cross-validate as many decisions as possible. This includes the normal stuff like the value of your hyperparameters, but also which algorithm you use, which features you choose to include, etc.

To cross-validate these things, you just find a cross-validation error for each combination of those decisions. So you try your algorithms with and without each feature you’re considering, for each of your candidate algorithms, and each of their possible hyperparameter values.

The issue with this strategy is it takes a lot of time, as you have a lot of combos. You address it by taking shortcuts that seem reasonable to you, and that your intuition says would still lead to a good final model. Maybe you don’t try every combination of features, or every value of a hyperparameter. Maybe you choose an algorithm with only a couple possible sets of features and a couple sets of hyperparameters, then hone in on that algorithm and try more combos of features, more candidate features in general, and spend more time on more possible hyperparameter values.

120
Q

ML: At an introductory level, what is training data used for and what is validation data used for when training a model? What is an example of each of these?

A
  • Training data* is used to learn the values of your model’s parameters. For example, when learning a neural network, your training data would teach the model its weights and bias terms
  • Validation data* is used to learn the values of the hyperparameters. For example, in KNN, you might use validation data to decide k, the number of neighbors, which is a hyperparameter. In the neural network example, you might use validation data for defining number of layers, nodes per layer, which activation function, etc.
121
Q

ML: What do we use methods like gradient descent for?

A

We use gradient descent to minimize (or maximize) differentiable objective functions, or to find the value of x sucht that f(x) is minimized.

f(x) needs to be differentiable because we need to take its derivative in order to get the gradient.

122
Q

ML: Why would we use an iterative method like gradient descent to minimize a function f(x) rather than just solving for its true minimum with calculus?

A

Solving it with calculus is often extremely time-consuming or impossibly time-consuming, especially when you need to minimize an objective function with respect to, say, 1000 parameters as with NNs.

123
Q

ML: What is a common problem you will run into when using an iterative descent method like gradient descent to minimize a function f(x)?

For what type of functions are you guarenteed not to have this issue?

If you don’t have this type of function, what is one way to combat this issue?

A

Depending on where your starting point is, you may descent to a local minimum rather than a global minimum, but your algorithm will still stop: it can’t take a small step in a direction and improve its situation.

This won’t happen with convex functions, or where the gradient is always “negative” or zero (this is how to think of convexity in say 2 or 3 dimensions; the terminology might differ at higher dimensions)

To combat this issue for non-convex functions, you can try several randomly chosen starts and see which returns the minimum; it still doesn’t guarentee a global min though.

124
Q
A

The gradient of the function with respect to theta is just a vector of the partial derivatives of the function with respect to each of the p parameters in theta.

Intuitively (especially when visualizing say 3 dimensions), it is the direction of the n-dimensional slope, or the direction in which a small step would most productively move you towards optimizing the objective function. (Technically, you might move in the opposite direction, but same idea.)

125
Q
A
126
Q
A
127
Q

ML: In stochastic gradient descent, why do we feel comfortable taking small steps in the direction of the gradient for only one of the training points, when what we’re really trying to optimize the gradient with respect to the entire training set?

A

If you choose the point randomly, the expected value of the gradient with respect to a single point equals the gradient with respect to all n points.

128
Q

ML: What is an epoch?

A

A single pass throught the entire training set.

129
Q

ML: Why is SGD often faster and more effective than normal gradient descent?

A

Finding the gradient with respect to the entire dataset takes a long time. (Specifically, I believe it takes about equal time to find the gradient with respect to each of the individual points, as I believe it’s basically just a sum of those). So getting one update, which is albeit definitely in the direction of the gradient, takes a long time.

Conversely, SGD may make some updates in the “wrong” direction, but you get to make n updates in the time normal gradient descent needs to make only one. In practice, this typically leads to faster convergence to the same result.

130
Q

ML: What is mini-batch gradient descent, and why is it sometimes better than both normal GD and SGD?

A

In normal GD, you make updates with respect to the entire datapoint; so, your update is always in the right direction, but updates take a long time.

In SGD, you make updates with respect to a single datapoint; so, your update may be in the wrong direction, but the expected value of the direction is correct, and you can make many updates quickly.

In mini-batch gradient descent, you combine these ideas by updating with respect to small subsets of the training data each time. This can sometimes lead to a happy medium between making updates quickly, and often being in an approximately correct direction.

131
Q

What is an SVM at a high level? Hard and soft-margin SVM?

A

Support vector machines (SVMs) are an algorithm for finding the optimal linear/hyperplane classifier for a dataset.

In hard-margin, no points can be misclassified, and the classifier tries to maximize the “margin”, which is the distance between the hyperplane and the closest point(s).

In soft-margin, points are allowed to cross the hyperplane. Each point is assigned a “slack variable”, describing how far it’s within the margin (a concept still kept from hard-margin) or within the decision boundary (i.e. misclassified).

132
Q

ML: How does multi-class SVM work, i.e. where there’s more than two classes?

A

Similar to multiclass logistic regression, we just fit k binary-classification SVMs: one for each of the k labels, predicting whether it’s that specific label vs any of the others (we treat all other labels as a single label for each binary classifier). We then predict via:

133
Q

ML: What is the idea of the kernel trick? What two valuable advantages does the kernel trick give us?

A
134
Q

ML: How does using the kernel trick for SVM work? Where do we sub in the kernel?

A
135
Q

ML: What does a norm accomplish at a very high level?

A

It maps a vector to a non-negative scalar in a way that measures its “length”

136
Q

ML: What is the lp norm (as in l0, l1, etc) for integer values from 0 all the way to infinity?

A
137
Q

ML: What is the l2 norm? What does it find, and how do we calculate it for a vector?

A

The l2 norm, or the euclidean norm, is simply the distance formula: it’s the length of the vector in euclidean space.

If x = [1,5,2], ||x||2 = sqrt(12 + 52 + 22).

138
Q

ML: What is the l1 norm? How do we calculate it?

A

For ||x||1, you just sum the absolute values of the entries in x.

||x||1 = |x1| + |x<span>2</span>| + |x3|

139
Q

ML: In the context of machine learning, what is the key difference between the l1 and l2 norms, and in the outcomes they produce?

How does this concept impact the regularization techniques we’re familiar with?

A

Suppose we are trying to find a vector x that minimizes a norm, either the l1 or l2 norm. So in both cases, we’re trying to find a vector x whose entries are near zero.

The l2 norm doesn’t care much whether an entry in x is exactly zero, or very nearly zero; as such, when we try to minimize the l2 norm, we get a bunch of values that are very nearly zero.

Conversely, the l1 norm does care a lot whether an entry in x is exactly zero, or very nearly zero; as such, when we try to minimize the l<span>1</span> norm, we get a bunch of values that are exactly zero, whereas with the l2 norm they weren’t quite zeroed out.

This pops up in regularization: when we do ridge regression, we’re minimizing the parameter vector over the l2 norm, and so we get a bunch of values that are almost zero, but none that are really zeroed out. When we do lasso regression, we use the l1 norm and thus get a bunch of parameters that are exactly 0.

There are tradeoffs here: lasso regression yields more interpretable results, but ridge regression is better at capturing very small relationships that lasso tends to just zero out.

140
Q

ML: What are the two different outputs we can use to evaluate a classifier

A

Most classifiers find the probability that a point has each possible label, and then predicts the label with the highest probability.

So, we can evaluate either the predicted labels, or the predicted probabilities.

(Note: The classifiers don’t exactly make probabilities, they make values between 0 and 1. But these values can often be interpreted as probabilities, especially if we choose to transform them into more of a probability distribution.)

141
Q

ML: Why might we want to evaluate the probabilities that a classifier outputs, rather than just the labels?

A

Oftentimes it’s important to know how likely the labels actually are, not just what the maximum-likelihood label: if a potential client has a 99% chance of signing a deal, we’ll prioritize them differently than if they have a 51% chance.

142
Q

ML: In the context of classifiers, what is calibration, or what does it mean for a classifier to be well-calibrated? How might we evaluate it for binary classifiers?

A

Recall that a classifier generally outputs the probability that a point has each label. A classifier is well-calibrated if the labels happen about as often as they’re predicted to happen.

So if, in every case where the classifier says a point has a 70% chance of being label A, about 70% of them are actually label A while the other 30% are something else, we have a well-calibrated classifier (at least for that predicted percentage for that label).

For binary classifiers, we can evaluate calibration using a bin plot. We group together all of the points where we predicted the odds of y=1 were between 0% and 10%, and then we see how many of those points were actually y=1. Similarly for 10% to 20%, and all the other bins.

143
Q

ML: What is the confusion matrix for a classifier?

A

For a classifier with k labels, it’s a k-by-k grid that plots the frequency of every predicted label vs every actual label, so we can see what types of errors the classifier tends to make.

For a binary classifier, it simply shows the number of true positives, false positives, true negatives and false negatives the classifier had, in the form of a 2x2 grid.

144
Q

ML: What is the sensitivity of a binary classifier?

A

It is the proportion of true positive’s, or of y=1’s, that we get right.

So it’s :

(true positives/[true positives + false negatives])

145
Q

ML: What is the specificity of a binary classifier?

A

It is the proportion of true negativess, or of y=0’s, that we get right.

So it’s :

(true negatives/[true negatives + false positives])

146
Q

ML: Why might a classifier really need to avoid false positives, but be alright with false negatives?

What about one that really needs to avoid false negatives, but be alright with false positives?

How does this relate to specificity and sensitivity?

A

Both of these cases occur when one error is much more costly than the other.

Say we’re trying to decide whether to administer a treatment that has few side effects. If we get a false negative, and incorrectly classify someone as “healthy”, then we don’t give them treatment, which is very costly. But, if we incorrectly say they’re “sick” and treat them when they don’t need it, it’s not as bad. Here we heavily prioritize avoiding false negatives, or we want sensitivity to be high.

Conversely, consider a spam filter. We’re okay with accidentally sending spam to the inbox (false negative), but don’t want to send important emails to spam (false positive). Now, we care a lot about the specificity of our classifier being high, as we want to avoid false positives.

147
Q

ML: How to we alter a binary classifier we’ve already trained to make a tradeoff between specificity and sensitivity?

At a high level, how would we instead accomplish this prioritization/tradeoff during training rather than after?

A

Classifiers output a probability that y=1 on a given point. Typically we label a point with a 1 if the probability is above 50%, but we can change the threshold that the probability must reach to label a point as y=1.

If we only label it y=1 if the probability of y=1 is above 75%, then we are okay with false negatives, but really don’t want false positives; we’re prioritizing specificity.

Conversely, if we label it y=1 whenever the probability of y=1 is above 25%, then we are okay with false positives, but really don’t want false negatives; we’re prioritizing sensitivity.

Another option would be to initially train the classifier so a false-negative causes higher loss in the loss function than a false positive, or vice versa. So we would have different loss amounts L0 and L1 for the two different errors.

148
Q

ML: When false-negatives are much worse than false-positives, do we want a classifier with high sensitivity, or high specificity? What about when false-positives are much worse?

A

Sensitivity describes the probability that we label a point y=1 when it is in fact y=1; it’s the probability of avoiding a false-negative.

Specificity describes the probability that we label a point y=0 when it is in fact y=0; it’s the probability of avoiding a false-positive.

So, we want high sensitivity when false-negatives are bad, and high specificity when false-positives are bad.

149
Q

ML: For a binary classifier, what is an ROC curve? What is the AUC, and do we want our classifiers to have a high or low AUC?

Also, what is the ROC curve of a “random” classifier, and what is its AUC? (You’ll have to define what “random” means in this context.)

A

Recall that our trained binary classifier outputs a probability p that a point x has label y=1, and we can tradeoff the sensitivity and specificity of our classifier by moving the threshold of this probability where we predict y=1. So for some value t, we predict y=1 if p > t. It is high, we are avoiding false positives and thus have high specificity; if t is low, we are avoiding false negatives and thus have high sensitivity.

A curve on an ROC plot shows a classifier’s sensitivity and its specificity at every possible value of threshold t, which can be anywhere in [0,1]. It plots (1 - specificity) on the x axis, and sensitivity on the y axis.

What we want is classifiers whose curves are bowed out to the left, because then for each value of t they do a good job with both sensitivity and specificity. In this case, the classificer has a high AUC, or area-under-the-curve. We want high AUC, as it basically means our classifier is accurate.

The diagonal curve on the ROC corresponds to a random classifier, where our probability value p is randomly chosen from the interval [0,1], and then we threshold that probability for some threshold value t. The AUC of this curve is 0.5.

So a curve on the ROC plot is a classifier, and point on the ROC plot is a classifier-threshold pair. The top-left corner is a classifier-threshold pair always predicts the label correctly, and the bottom-right always predicts incorrectly. The bottom-left always predicts 0, and the top-right always predicts 1.

150
Q

ML: What does linear regression try to find for a dataset of (x,y) pairs?

A
151
Q

ML: Why is linear regression so popular?

A

It’s very common

It’s very easy to understand

The math is simple and well-known (I’m not going to memorize it, but it follows from the statement of what it’s trying to minimize.)

152
Q

ML: In regression, what are the residuals for your dataset X and model µ()?

A

A residual for a prediction Y*=µ(x) is Y - Y*, the difference between the real and predicted values.

153
Q
A
154
Q

ML: In linear regression, how do we check the i.i.d. assumption?

A

One option is to plot your points vs some axis like Time, and assure that there is no pattern among points close in time. Basically look at ways the points might be related to one another, and see that they’re not, and that there is no pattern.

(This one was less well-defined in class)

155
Q

402: What did we learn that we should almost always use in place of linear regression?

A

An additive model

156
Q

402: What does it mean for a model to be universally consistent?

A

As n goes to infinity, the learned model must converge to the actual underlying model.

157
Q

402/401: What is the correct way of interpreting a coefficient Bi in a linear regression? What is an incorrect way of interpreting it? Why is the latter correct while the former is incorrect?

A

Correct: Bi is the expected difference in the outcome variable Y between two points whose difference in Xi is 1, assuming the values of all other predictors are held equal.

Incorrect: Bi is the expected difference in Y caused by increasing Xi by 1.

The incorrect one is wrong because it assumes some sort of causal relationship. Our linear model by itself says nothing about what will happen if we change the value of Xi, it just says that on average, points with higher Xi have higher (or lower, depending on sign of Bi) values of Y based proportionally on Bi, which is what the correct interpretation is saying.

158
Q

402: What should be our go-to tool for model selection, or for comparing the difference in prediction error of two models, rather than R2?

A

Cross-validation, to approximate the MSEs of each model on out-of-sample data.