Regression and Classification Flashcards
Accuracy
Correct / Total. Good when equal consequences from false positive or false negative (e.g., manufacturing QC, weather, manufacturing stock prediction, school grades)
AUC
Area under the ROC curve - represents the chances that a randomly positive will have higher ranking than random negative. Coin is .5, perfect is 1. Good general way to choose between models for a balanced dataset.
Batch Gradient Descent
Graident Descent Algorithm that goes through the entire data set to calculate partial derivatives and update weight and bias
Batch Size
The number of examples used in a single learning iteration (before updating weight and bias)
Binary classification
Type of classification that predicts one of two mutually exclusive classes - positive and negative
Clasification
Learning models that attempt to predict a defined number of categories that can be numbers or or non-numeric (e.g., dog, cat, bird)
Class-Imbalanced
A dataset for a classification problem where the number of labels of each class differs significantly
Common causes of data quality and reliability issues
Omitted, duplicate, value errors, label errors, bad sections
Confusion matrix
Used to summarize the performance of a classification algorithm. Typically a 2x2 with actual vs predicted with actual as rows and predicted as values
Cost function
A combination across all loss functions. May include some sort of penalty for complexity.
Decision boundary
A surface or line that separates different classes predited by a classification algorithm. It markes the boundary of one class vs another.
Early stopping
Regularization method that stops before training loss stops decreasing, stops when validation loss increases (or generalization performance worsens)
epoch
A full training pass over the entire training set (every example processed once). 1 epoch = 1/batch size
false positive rate
FP / (FP + TN) - rate that negative is accidentally called a positive. This is the x-axis for ROC curve.
Feature
Individual model input that is an individual property or characteristic typically represented by columns (or “x” in a model formula)
Feature engineering
Using intuition to design new features by transforming or combining original features (using depth and width to define “area”)
Feature scaling
Scaling the range of features to improve gradient descent performance - typically done via mean normalization or z-score normalization. Goal is generally -1 < x < 1
Gradient Descent
An optimization function to minimize a loss function by iteratitvely adjusting the parameters of a model based on the partial derivatives that point in the steepest direction of the loss function.
Hyperparameters
The variables that you or a hyperparameter tuning service adjust between successive runs of training a model. (e.g., learning rate). This is in contast to parameters, like weight and bias that the model learns during training.
L1 loss
A loss function used for regression that calculates the aboslute difference between actual labels and what a model predicts. L1 loss is less sensitive to outliars than L2 loss.
L1 Regularization
A type of regularization that penalizes weights in proportion to the sum of the absolute value of the weights. L1 regularization helps drive weights of irrelevant or almost irrelevant features to exactly 0 - removing it from the model.
L2 Loss
A loss function used for regression that calculates the square of the difference between actual labels and what amodel predicts. L2 loss is more sensitive to outliars than L1 loss.
L2 Regularization
A type of regularization that penalizes weights in proportion to the sum of the square of the weights. L2 regularization helps drive outlier weights closer to zero (but not zero). Those with low 0 values stay in the model. L2 regularization always improves generalization in linear models.
Label
A label is the target output or value that a model is trained to predict. It represents the answer or ground truth for a single data point in supervised learning. Also known as target.
Learning Curve
Graph of cost (J(w,b)) vs # of iterations. Can be used to determine that gradient descent is working correctly. J(w,b) should quickly approach 0 and stop decreasing if everything is working right.
Learning Rate
A key hyperparameter that tells gradient descent how strongly to adjust weights and measures on each iteration. If too low training will take too long, if too high it will have trouble reaching convergence. Typically a number between 0 and 1.
Log Loss
Optimization/cost function that is used in logistic regression. Penalizes based on probability of prediction (e.g., highly penalizes a high-probability incorrect class).
Logistic Regression
Logistic regression is the algorithm used for binary classification tasks. It takes a linear combination of features and maps it to a probability between 0 and 1 using a sigmoid function.
Loss function
During training, a function that calculates the loss for a single example. A loss function returns a lower loss for a model that makes better predictions vs a model that makes worse predictions. The purpose of training is typically to minimize the loss that a loss function returns across all examples.
Mean Squared Error
Average L2 loss per example (used for regression). Squared loss is another term for L2 loss.
Methods of imputation
The process of replacing missed values in a dataset so that the examples can still be used for modeling and analysis. Common methods include simple (e.g., mean, mode, median) and fixed (e.g., -1, 0)
Mini-batch
A small, randomly selected subset of a batch processed in one iteration. It’s more efficient to calculate loss on a mini-batch then on all the examples in the full batch.
Model formula
f(x) -> y-hat where x is the feature, f(x) is the model, and y-hat is the prediction
Noise
Random or irrelevant data that obscures underlying pattern
Overfitting
A model that is too complex to generalize well (e.g., 4-th order function for 2nd order function). Can be addressed with regularization. Also known as high variance (i.e., model changes a lot with a small amount change in training data). Solutions are 1) more data 2) fewer features 3) regularization.
Polynomial regression
Feature engineering that includes feature values raised to powers (e.g., including x, x^2, and x^3). Can be used to find non-linear relationships.
Precision
TP / (TP + FP). If we detect positive, high likelihood that it’s actually positive. Examples are where false positives have high pain (e.g., spam detection, verdict)
Quantity of data
Model should train on on an order of magnitude (or two) more than the number of trainable parameters. Large datasets with small numbers of features generally perform better.
Recall
True positive rate. TP / (TP + FN). When detecting positives is most important and can’t miss them (e.g., identifying cancer, fraud, security, churn)
Regression
Learning models that attempt to predict numbers with infinte outcomes
Regularization
Any mechanism to reduce overfitting without removing features. Typically done with a modified cost funtion that includes a “regularization term” and an additional penalty “lambda.” Can also be thought of as a penalty on model complexity. By convention only on “w” parameters and not “b” parameter. Popular ones are L1, L2, dropout, and early stopping.
relu
“Rectified linear unit” - An activation with the following behavior - if input is negative or 0, then the ouput is 0, else the value is equal to the input. (-3 = 0, 3 = 3). Enables a neural network to learn non-linear relationships between features and the label.
ROC
Visual representation of a model’s accuracy at all thresholds. Graphs the TPR vs FPR at all thresholds. Stands for receiver operating characteristic.
Sigmoid
A mathematical finction that “squishes” the output between two values - typically 0 and 1 or -1 to +1. Has several uses including converting raw output of logistic regression to a probability and acting as an activation function in some networks.
Supervised Learning
Learning where training data (inputs and output label) includes correct predictions.
Test set
A portion of the data that is reserved specifically for evaluating the performance of the trained model. Helpful for identifying under or over fitting.
Threshold
The specific numeric cutoff used applied to probabilities to convert them to a class label (e.g., over 50% -> class A)
training set
The subset of the dataset used to train the model. Typically datasets are broken into three distinct subsets - training, testing, and validation. Ideally, any example should only belong to one of these three subsets.
True negative
An example where the model correctly predicts the negative class (e.g., an email is not spam)
True Positive
Same as recall rate. TP / (TP + FN). When detecting positives is most important and can’t miss them (e.g., identifying cancer, fraud, security, churn)
Unsupervised Learning
Find something interesting in unlabeled data (e.g., clustering, dimensionality reduction, anomoly detection)
validation loss
A metric used to understand how well training worked on validation set. Discrepancies between training loss and validation loss can indicate overfitting.
validation set
A slice of the data used to test paramter testing or model selection. Often used multiple times before a test set is used.
Ways to handle incomplete data (missing features)
Delete examples, impute missing values
Z-score normalization
Normalization such that all features have a mean of 0 and a standard deviation of 1. Each feature value is subtracted by the mean of all values for that feature and divided by the standard devision of that feature.
Underfit
A model that is not complex enough to generalize well (e.g., linear model for quadratic function). Also known as high bias.