Regression and Classification Flashcards

1
Q

Accuracy

A

Correct / Total. Good when equal consequences from false positive or false negative (e.g., manufacturing QC, weather, manufacturing stock prediction, school grades)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

AUC

A

Area under the ROC curve - represents the chances that a randomly positive will have higher ranking than random negative. Coin is .5, perfect is 1. Good general way to choose between models for a balanced dataset.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Batch Gradient Descent

A

Graident Descent Algorithm that goes through the entire data set to calculate partial derivatives and update weight and bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Batch Size

A

The number of examples used in a single learning iteration (before updating weight and bias)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Binary classification

A

Type of classification that predicts one of two mutually exclusive classes - positive and negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Clasification

A

Learning models that attempt to predict a defined number of categories that can be numbers or or non-numeric (e.g., dog, cat, bird)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Class-Imbalanced

A

A dataset for a classification problem where the number of labels of each class differs significantly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Common causes of data quality and reliability issues

A

Omitted, duplicate, value errors, label errors, bad sections

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Confusion matrix

A

Used to summarize the performance of a classification algorithm. Typically a 2x2 with actual vs predicted with actual as rows and predicted as values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Cost function

A

A combination across all loss functions. May include some sort of penalty for complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Decision boundary

A

A surface or line that separates different classes predited by a classification algorithm. It markes the boundary of one class vs another.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Early stopping

A

Regularization method that stops before training loss stops decreasing, stops when validation loss increases (or generalization performance worsens)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

epoch

A

A full training pass over the entire training set (every example processed once). 1 epoch = 1/batch size

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

false positive rate

A

FP / (FP + TN) - rate that negative is accidentally called a positive. This is the x-axis for ROC curve.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Feature

A

Individual model input that is an individual property or characteristic typically represented by columns (or “x” in a model formula)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Feature engineering

A

Using intuition to design new features by transforming or combining original features (using depth and width to define “area”)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Feature scaling

A

Scaling the range of features to improve gradient descent performance - typically done via mean normalization or z-score normalization. Goal is generally -1 < x < 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Gradient Descent

A

An optimization function to minimize a loss function by iteratitvely adjusting the parameters of a model based on the partial derivatives that point in the steepest direction of the loss function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Hyperparameters

A

The variables that you or a hyperparameter tuning service adjust between successive runs of training a model. (e.g., learning rate). This is in contast to parameters, like weight and bias that the model learns during training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

L1 loss

A

A loss function used for regression that calculates the aboslute difference between actual labels and what a model predicts. L1 loss is less sensitive to outliars than L2 loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

L1 Regularization

A

A type of regularization that penalizes weights in proportion to the sum of the absolute value of the weights. L1 regularization helps drive weights of irrelevant or almost irrelevant features to exactly 0 - removing it from the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

L2 Loss

A

A loss function used for regression that calculates the square of the difference between actual labels and what amodel predicts. L2 loss is more sensitive to outliars than L1 loss.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

L2 Regularization

A

A type of regularization that penalizes weights in proportion to the sum of the square of the weights. L2 regularization helps drive outlier weights closer to zero (but not zero). Those with low 0 values stay in the model. L2 regularization always improves generalization in linear models.

24
Q

Label

A

A label is the target output or value that a model is trained to predict. It represents the answer or ground truth for a single data point in supervised learning. Also known as target.

25
Q

Learning Curve

A

Graph of cost (J(w,b)) vs # of iterations. Can be used to determine that gradient descent is working correctly. J(w,b) should quickly approach 0 and stop decreasing if everything is working right.

26
Q

Learning Rate

A

A key hyperparameter that tells gradient descent how strongly to adjust weights and measures on each iteration. If too low training will take too long, if too high it will have trouble reaching convergence. Typically a number between 0 and 1.

27
Q

Log Loss

A

Optimization/cost function that is used in logistic regression. Penalizes based on probability of prediction (e.g., highly penalizes a high-probability incorrect class).

28
Q

Logistic Regression

A

Logistic regression is the algorithm used for binary classification tasks. It takes a linear combination of features and maps it to a probability between 0 and 1 using a sigmoid function.

29
Q

Loss function

A

During training, a function that calculates the loss for a single example. A loss function returns a lower loss for a model that makes better predictions vs a model that makes worse predictions. The purpose of training is typically to minimize the loss that a loss function returns across all examples.

30
Q

Mean Squared Error

A

Average L2 loss per example (used for regression). Squared loss is another term for L2 loss.

31
Q

Methods of imputation

A

The process of replacing missed values in a dataset so that the examples can still be used for modeling and analysis. Common methods include simple (e.g., mean, mode, median) and fixed (e.g., -1, 0)

32
Q

Mini-batch

A

A small, randomly selected subset of a batch processed in one iteration. It’s more efficient to calculate loss on a mini-batch then on all the examples in the full batch.

33
Q

Model formula

A

f(x) -> y-hat where x is the feature, f(x) is the model, and y-hat is the prediction

34
Q

Noise

A

Random or irrelevant data that obscures underlying pattern

35
Q

Overfitting

A

A model that is too complex to generalize well (e.g., 4-th order function for 2nd order function). Can be addressed with regularization. Also known as high variance (i.e., model changes a lot with a small amount change in training data). Solutions are 1) more data 2) fewer features 3) regularization.

36
Q

Polynomial regression

A

Feature engineering that includes feature values raised to powers (e.g., including x, x^2, and x^3). Can be used to find non-linear relationships.

37
Q

Precision

A

TP / (TP + FP). If we detect positive, high likelihood that it’s actually positive. Examples are where false positives have high pain (e.g., spam detection, verdict)

38
Q

Quantity of data

A

Model should train on on an order of magnitude (or two) more than the number of trainable parameters. Large datasets with small numbers of features generally perform better.

39
Q

Recall

A

True positive rate. TP / (TP + FN). When detecting positives is most important and can’t miss them (e.g., identifying cancer, fraud, security, churn)

40
Q

Regression

A

Learning models that attempt to predict numbers with infinte outcomes

41
Q

Regularization

A

Any mechanism to reduce overfitting without removing features. Typically done with a modified cost funtion that includes a “regularization term” and an additional penalty “lambda.” Can also be thought of as a penalty on model complexity. By convention only on “w” parameters and not “b” parameter. Popular ones are L1, L2, dropout, and early stopping.

42
Q

relu

A

“Rectified linear unit” - An activation with the following behavior - if input is negative or 0, then the ouput is 0, else the value is equal to the input. (-3 = 0, 3 = 3). Enables a neural network to learn non-linear relationships between features and the label.

43
Q

ROC

A

Visual representation of a model’s accuracy at all thresholds. Graphs the TPR vs FPR at all thresholds. Stands for receiver operating characteristic.

44
Q

Sigmoid

A

A mathematical finction that “squishes” the output between two values - typically 0 and 1 or -1 to +1. Has several uses including converting raw output of logistic regression to a probability and acting as an activation function in some networks.

45
Q

Supervised Learning

A

Learning where training data (inputs and output label) includes correct predictions.

46
Q

Test set

A

A portion of the data that is reserved specifically for evaluating the performance of the trained model. Helpful for identifying under or over fitting.

47
Q

Threshold

A

The specific numeric cutoff used applied to probabilities to convert them to a class label (e.g., over 50% -> class A)

48
Q

training set

A

The subset of the dataset used to train the model. Typically datasets are broken into three distinct subsets - training, testing, and validation. Ideally, any example should only belong to one of these three subsets.

49
Q

True negative

A

An example where the model correctly predicts the negative class (e.g., an email is not spam)

50
Q

True Positive

A

Same as recall rate. TP / (TP + FN). When detecting positives is most important and can’t miss them (e.g., identifying cancer, fraud, security, churn)

51
Q

Unsupervised Learning

A

Find something interesting in unlabeled data (e.g., clustering, dimensionality reduction, anomoly detection)

52
Q

validation loss

A

A metric used to understand how well training worked on validation set. Discrepancies between training loss and validation loss can indicate overfitting.

53
Q

validation set

A

A slice of the data used to test paramter testing or model selection. Often used multiple times before a test set is used.

54
Q

Ways to handle incomplete data (missing features)

A

Delete examples, impute missing values

55
Q

Z-score normalization

A

Normalization such that all features have a mean of 0 and a standard deviation of 1. Each feature value is subtracted by the mean of all values for that feature and divided by the standard devision of that feature.

56
Q

Underfit

A

A model that is not complex enough to generalize well (e.g., linear model for quadratic function). Also known as high bias.