Regression and Classification Flashcards

Question

Learning Curve

Answer 1

Graph of cost (J(w,b)) vs # of iterations. Can be used to determine that gradient descent is working correctly. J(w,b) should quickly approach 0 and stop decreasing if everything is working right.

Answer 2

A key hyperparameter that tells gradient descent how strongly to adjust weights and measures on each iteration. If too low training will take too long, if too high it will have trouble reaching convergence. Typically a number between 0 and 1.

Answer 3

Optimization/cost function that is used in logistic regression. Penalizes based on probability of prediction (e.g., highly penalizes a high-probability incorrect class).

Answer 4

Logistic regression is the algorithm used for binary classification tasks. It takes a linear combination of features and maps it to a probability between 0 and 1 using a sigmoid function.

Answer 5

During training, a function that calculates the loss for a single example. A loss function returns a lower loss for a model that makes better predictions vs a model that makes worse predictions. The purpose of training is typically to minimize the loss that a loss function returns across all examples.

Answer 6

Average L2 loss per example (used for regression). Squared loss is another term for L2 loss.

Answer 7

The process of replacing missed values in a dataset so that the examples can still be used for modeling and analysis. Common methods include simple (e.g., mean, mode, median) and fixed (e.g., -1, 0)

Answer 8

A small, randomly selected subset of a batch processed in one iteration. It's more efficient to calculate loss on a mini-batch then on all the examples in the full batch.

Answer 9

f(x) -> y-hat where x is the feature, f(x) is the model, and y-hat is the prediction

Answer 10

Random or irrelevant data that obscures underlying pattern

Answer 11

A model that is too complex to generalize well (e.g., 4-th order function for 2nd order function). Can be addressed with regularization. Also known as high variance (i.e., model changes a lot with a small amount change in training data). Solutions are 1) more data 2) fewer features 3) regularization.

Answer 12

Feature engineering that includes feature values raised to powers (e.g., including x, x^2, and x^3). Can be used to find non-linear relationships.

Answer 13

TP / (TP + FP). If we detect positive, high likelihood that it's actually positive. Examples are where false positives have high pain (e.g., spam detection, verdict)

Answer 14

Model should train on on an order of magnitude (or two) more than the number of trainable parameters. Large datasets with small numbers of features generally perform better.

Answer 15

True positive rate. TP / (TP + FN). When detecting positives is most important and can't miss them (e.g., identifying cancer, fraud, security, churn)

Answer 16

Learning models that attempt to predict numbers with infinte outcomes

Answer 17

Any mechanism to reduce overfitting without removing features. Typically done with a modified cost funtion that includes a "regularization term" and an additional penalty "lambda." Can also be thought of as a penalty on model complexity. By convention only on "w" parameters and not "b" parameter. Popular ones are L1, L2, dropout, and early stopping.

Answer 18

"Rectified linear unit" - An activation with the following behavior - if input is negative or 0, then the ouput is 0, else the value is equal to the input. (-3 = 0, 3 = 3). Enables a neural network to learn non-linear relationships between features and the label.

Answer 19

Visual representation of a model's accuracy at all thresholds. Graphs the TPR vs FPR at all thresholds. Stands for receiver operating characteristic.

Answer 20

A mathematical finction that "squishes" the output between two values - typically 0 and 1 or -1 to +1. Has several uses including converting raw output of logistic regression to a probability and acting as an activation function in some networks.

Answer 21

Learning where training data (inputs and output label) includes correct predictions.

Answer 22

A portion of the data that is reserved specifically for evaluating the performance of the trained model. Helpful for identifying under or over fitting.

Answer 23

The specific numeric cutoff used applied to probabilities to convert them to a class label (e.g., over 50% -> class A)

Answer 24

The subset of the dataset used to train the model. Typically datasets are broken into three distinct subsets - training, testing, and validation. Ideally, any example should only belong to one of these three subsets.

Answer 25

An example where the model correctly predicts the negative class (e.g., an email is not spam)

Answer 26

Same as recall rate. TP / (TP + FN). When detecting positives is most important and can't miss them (e.g., identifying cancer, fraud, security, churn)

Answer 27

Find something interesting in unlabeled data (e.g., clustering, dimensionality reduction, anomoly detection)

Answer 28

A metric used to understand how well training worked on validation set. Discrepancies between training loss and validation loss can indicate overfitting.

Answer 29

A slice of the data used to test paramter testing or model selection. Often used multiple times before a test set is used.

Answer 30

Delete examples, impute missing values

Answer 31

Normalization such that all features have a mean of 0 and a standard deviation of 1. Each feature value is subtracted by the mean of all values for that feature and divided by the standard devision of that feature.

Answer 32

A model that is not complex enough to generalize well (e.g., linear model for quadratic function). Also known as high bias.

Regression and Classification Flashcards

(56 cards)