Machine Learning Flashcards

Question 1

Q

What is Machine Learning?

Answer

A

ML is the field of science that studies algorithms that approximate functions increasingly well as they are given more observations.

Question 2

Q

What are some common applications of Machine Learning?

Answer

A

ML algorithms are used to learn and automate human processes, optimize outcomes, predict outcomes, model complex relationships, and to learn patterns in data (among many other uses).

Question 3

Q

What is labeled data and what is it used for?

Answer

A

Labeled data is data that has the information about a target variable for each instance.

Labeled data allows us to train supervised ML algorithms.

Question 4

Q

What are the most common types of algorithms that use supervised learning?

Answer

A

Most common types supervised learning algorithms:

regression
classification

Question 5

Q

What are the most common types of algorithms that use unsupervised learning?

Answer

A

Most common unsupervised learning algos:

clustering, dimensionality reduction (PCA), and association-rule mining.

Question 6

Q

What is the difference between online and offline learning?

Answer

A

Online learning refers to the updating of models as they gain more information.

Offline learning refers to learning by batch processing data. If new data comes in, an entire new batch (including all the old and new data) must be fed into the algorithm to learn from the new data.

Question 7

Q

What is reinforcement learning?

Answer

A

Reinforcement learning describes a set of algorithms that learn from the outcome of each decision.

e.g., a robot could use reinforcement learning to learn that walking forward into a wall is bad, but turning away from a wall and walking is good.

Question 8

Q

What is the difference between a model parameter and a learning hyperparameter?

Answer

A

A model parameter describes the final model itself; e.g. slope of a linear regression fit.

A learning hyperparameter describes a way in which a model parameter is learned; e.g. learning rate, penalty terms, number of features to include in a weak predictor.

Question 9

Q

What is overfitting?

Answer

A

Overfitting is when a model makes much better predictions on known training data than on unseen (validation, test) data.

Question 10

Q

How can we combat overfitting?

Answer

A

Ways to combat overfitting:

a. simplify the flexibility of the model (by changing the hyperparameters)
b. select a different model
c. use more training data
d. gather better quality data

Question 11

Q

What is training data and what is it used for?

Answer

A

Training data is data which will be used to train the ML model.

For supervised learning, this training data must have a labeled target, i.e. what we are trying to predict must be defined.

For unsupervised learning, the training data will contain only features and will use no labeled targets; i.e. what we are trying to predict is not defined.

Question 12

Q

What is a validation set and why do we use one?

Answer

A

A validation set is a set of data that is used to evaluate a model’s performance during training/model selection. After models are trained, they are evaluated on the validation set to select the best possible model.

Information from the validation set must never be used to train the model.

It must also not be used as the test data set because we’ve biased our model selection toward working well with this data, even though the model was not directly trained on it.

Question 13

Q

What is a test set and why use one?

Answer

A

A test set is a data set not used during ML training or validation.

The model’s performance is evaluated on the test set to predict how well it will generalize to new data.

Question 14

Q

What is cross validation and why is it useful?

Answer

A

Cross validation is a technique for more accurately training and validating models. It rotates what data is held out from model training to be used as the validation data.

Several models are trained and evaluated. with every piece of data being held out from one model. The average performance of all models is then calculated.

It is a more reliable way to validate models but is more computationally expensive, e.g. 5-fold CV requires training and validating 5 models instead of 1.

Question 15

Q

What does a confusion matrix look like?

Answer

A

Predicted values
yhat=1 yhat=0
True y=1 TP FN recall: TP/(y=1)
Values y=0 FP TN specificity: TN/(y=0)
precision: TP/(yhat=1) accuracy: (TP+FN)/total

precision: measures accuracy of a PREDICTED_POSITIVE outcome
recall: (sensitivity) measures strength of model to predict real-positive-class outcomes, the proportion of true 1s identified
specificity: measures a model’s ability to predict a negative outcome, the proportion of true 0s identified

Question 16

Q

what is a ROC curve?

Answer

Study These Flashcards

A

Notice there is a trade off between recall (model ability to identify true 1s) and precision (model ability to identify true 0s)

i.e. capturing more 1s usually comes at cost of misclassifying more true 0s as 1s.

The ideal clf would do an excellent job classifying the 1s without more 0s as 1s

The metric that captures the recall/sensitivity trade off is the receiver operating characteristic curve (ROC)

y-axis: (recall, model ability to predict true 1s)

x-axis: specificity (model ability to predict true 0s)

Question 17

Q

what is AUC?

Answer

Study These Flashcards

A

The clf diagnostic metric Area Under the Curve (AUC) is the total area under the ROC curve where the x-axis is specificity (model ability to capture true 0s) and y-axis is recall (model ability to capture true 1s)

the larger the AUC value, the more effective the clf.

An AUC=1 indicates a perfect clf…it gets all 1s correctly clf’d and doesn’t misclassify any 0s as 1s.

A completely ineffective clf would be on the diagonal line on the AUC and has value of AUC = 0.5.

Question 18

Q

Explain What Gradient Descent is to a 5 year old

Answer

Study These Flashcards

A

In ML, we optimize a lot of stuff.

e.g.
linear reg: optimize intercept and slope
logistic reg: optimize the location of sigmoid

If we learn how to optimize linreg params with GD, then we learn how to optimize everything else with GD

Say we have data height y, weight X

Toy example: say we fix weight coef as b=0.64 and let GD optimize/find intercept s.t. min Loss =: min(SSR):

predicted height = intercept + 0.64*weight

pick a random value (say 0) for intercept and compute loss:

SSR 1 = 1.1^2 + 0.4^2 + 1.3^2 = 3.1

try to plot on graph: x-axis is intercept value, y-axis is SSR value (note: this process brute-force strategy is SLOW).

Instead we can optimize intercept with GD, which will do a few calculations (take large steps) far away from optimal solution, and will increase the number of calcs (take baby steps) when closer to the optimal value.

Apply GD algo by taking derivative of Loss func (chain rule on SSR), where derivative of SSR is the slope of the SSR parabola shaped function:

do until step size approaches 0:

slope = d /d intercept (SSR i) =:

2(1.4 - (intercept+0.64*0.5)
2(1.9 - (intercept+0.64*2.3)
2(3.2 - (intercept+0.64*2.9)

step size =: slope*learning rate
new intercept = old intercept - step size

Now lets separately optimize BOTH intercept and the feature coef (partial derivatives). The loss SSR will now be a 3D bowl shaped parabola (y axis is SSR, x-axis reprsents different values of intercept, z-axis represents different values of slope). We want to find the optimal intercept AND slope which min(SSR).

d/d intercept (SSR i) = (yi - yhat)^2 =:

d/d intercept (1.4 - (intercept +slope0.5) )^2
d/d intercept (1.9 - (intercept +slope2.3) )^2
d/d intercept (3.2 - (intercept +slope*2.9) )^2

=: (by chain rule)

d / dintercept =
-2(1.4 -(intercept + slope0.5) )
+ -2(1.9 -(intercept + slope2.3) )
+ -2(3.2 - intercept +slope2.9)

Now, do the same partial derivatives for the slope vector:

d /d slope (SSR i) =:

d /d slope (1.4 - (intercept+slope0.5))^2
+d /d slope (1.9 - (intercept+slope2.3))^2
+d /d slope (3.2 - (intercept+slope*2.9))^2

=: (by the chain rule)

d / d slope =
-2 0.5 * (1.4 - (intercept + slope0.5) )
+-2 2.9 * (3.2 - (intercept + slope2.9) )
+-2 2.3 * (1.9 - (intercept + slope2.3) )

Note: when we have two or more derivatives of the same function, they are called a GRADIENT.

We will use the gradient to descend into the lowest point of the loss function (RSS), which is why this algo is called GRADIENT DESCENT.

Begin the GD algo by initializing value 0 for intercept and value 1 for slope.

Plug in intercept=0 and slope=1:

d / d intercept SSR_i 
= -2(1.4 -(0 +1*0.5) )
   \+ -2(1.9 -(0 +1*2.3) )
   \+ -2(3.2 -(0 +1*2.9) ) 
= -1.6

d / dslope SSR_i
= -2 * 0.5(1.4 - (0 +1*0.5) )
  \+ -2 * 2.9(3.2 - (0 +1*2.9) )^2
  \+ -2 * 2.3(1.9 - (0 +1*2.3) )^2 
= -0.8

Now, plug the slopes into the step size formulas (eta = 0.01):
step_size_intercept = -1.6 * 0.01 = -0.016
step_size_slope = -0.8 * 0.01 = -0.008

We don’t need to worry about the learning rate as a reasonable learning rate can be determined automatically by starting large and getting smaller at each step.

Question 19

Q

Explain what overfitting is.

Answer

Study These Flashcards

A

Overfitting is finding spurious results that were due to random chance and cannot be reproduced by subsequent studies.

examples:

We frequently hear about reports about studies that overturn previous findings (eggs/wine are good for your heart). The problem is that many researchers (esp in social sciences) too frequently commit the cardinal sin of data mining–overfitting data.

The researchers test too many hypotheses without proper statistical controls, until they happen to find something to interesting to report. Unsurprisingly, the next time someone looks into the effect (which was largely due to chance) the effect will be much smaller or absent (i.e. irreproducible results),

Several methods can be used to avoid overfitting data:

try to find the simplest possible hypothesis
regularization (introduce bias to the model, which relaxes the flexibility)
randomization testing (randomize the class variable then try your model on this data set–if same results are found then something is wrong)
nested CV: do feature selection on one level, then run entire method in cross validation on outer level.
Adjusting the false discover rate
using the reusable holdout method–a breakthrough approach proposed in 2015.

Question 20

Q

What is the curse of dimensionality

Answer

Study These Flashcards

A

The curse of dimensionality refers to problems that occur when we try to use statistical methods in high-dimension spaces.

As the number of features (dimensionality) increases, the data becomes relatively more sparse and often exponentially more samples are needed to make statistically significant predictions.

Imagine going from a 10x10 grid to a 10x10x10 grid…if we want one sample in each “1x1” square, then the addition of the third parameter requires us to have 10 times as many samples (1000) we needed when we had 2 parameters (100).

In short, some models become much less accurate in high-dimensional space and may behave erratically. Examples include: linear models with no feature selection or regularization, kNN, Bayesian models.

Models that are less effected by the curse of dimensionality: regularized models, random forest, some neural networks, stochastic models (e.g. monte-carlo simulations).

Machine Learning Flashcards

Technical interview study (20 cards)