General Flashcards
Stochastic Gradient Descent
Gradient descent algorithm that increments the parameters using only single observations at a time.
More efficient than batch gradient descent, especially with large datasets.
Batch Gradient Descent
Gradient descent algorithm in which it is required to scan through the entire training set before taking a single step.
Localized Linear Regression
A variant of traditional linear regression that uses only local data points around Xi to predict Yi
Type I Error (False Positive)
Incorrectly rejecting the null hypothesis in favor of the alternative hypothesis when the null is true.
Same as alpha, set at the beginning of the experiment.
Type II Error (False Negative)
Failing to reject the null hypothesis when it is false
Also known as beta. Note that power is (1 - beta)
A\B Testing
A/B testing, also known as split or bucket testing, is a user experience research method that compares two or more versions of content to determine which one performs best
A/B testing involves randomly assigning visitors to see either a control (A) version or a variant (B) version of a page or content. The performance of each version is then measured based on key metrics, such as the number of conversions or visitors who took the desired action.
SVMs : General description
Simple: Machine learning model that uses a hyperplane to differentiate and classify different groups of data
Detailed: SVM identified an appropriate hyperplane by attempting to maximize the margins between points between the closest points of each class to boundary.
If data that cannot be separated linearly, use transformations to map data into high dimensions.
SVMs: Soft Margin Classification
A mechanism that serves to reduce the overfitting of maximum margin classification by penalizing misclassifications
Bias / Variance Tradeoff
A tradeoff in machine learning models where you have the choice of reducing bias (how well a model fits a specific set of data) vs. reducing variance (how much performance of a mode varies across many datasets).
Precision
TP / (TP + FP)
Measures the accuracy of positive predictions (but not necessarily identifying all of them).
Recall
TP / (TP + FN)
Measures completeness of positive predictions
F1 Score
2 / ((1/Precision) + 1/Recall))
Harmonic mean of precision and recall, ranging between 0 and 100%
ROC Curve
Plots true positive rate (recall) against false positive rate. A good ROC curve goes toward the top left of the chart.
X = FPR
Y = TPR
False Positive Rate
(1 - Specificity)
Proportion of negative instances that are incorrectly classified as positive (i.e. false positive)
FP / (FP + TN)
Lasso Regression
Check this
Elastic Net
Check this
Early Stopping
A way of regularizing a model by stopping training once validation error reaches a minimum
Soft max Regression
Also known as multinomial logistic regression.
Classification with multiple classes. For each instance x, assigns a score s(x) for each class k, then estimates probability by applying a soft max function.
Soft max function is as followed:
Exp(s(x)) / summation(exp(s(x)))
Cross-entropy
Loss function used to measure difference between predicted and true probability distributions. Penalizes low probability on true labels significantly.
-1/m ∑ ∑y log (p(k))
Essentially the mean of the -log(estimated probabilities).
Accuracy
Number of correctly classified instances / number of all classified instances
True Positive Rate
(Sensitivity)
Proportion of positive instances that are correctly classified as positive (i.e. true positive)
TP / (TP + FN)
Specificity
True Negative Rate
Proportion of negative instances that are correctly classified as negative (i.e. true negative)
TN / (TN + FP)
Gradient Descent
An algorithm that minimizes a particular function (in ML the loss function) by taking small steps in the direction of the steepest descent for that function.
Step 1: Take the derivative of the loss function for each parameter (i.e. take the gradient of the loss function).
Step 2: Initialize parameters with random values
Step 3: Plug parameters into the partial derivatives (gradient)
Step 4: Calculate step sizes (Calculated slope from step 3 * learning rate)
Step 5: Calculate the new parameters (New = Old - Step Size)
Step 6: Repeat 4-5 until convergence
Steps for K-fold Cross Validation
Step 1: Shuffle data into equally sized blocks (folds)
Step 2: For each fold k, train model on all data except fold i, and evaluate validation error using the remaining fold i.
Step 3: Average the validation errors from step 2 to get estimate of the true error.
Bootstrapping
Drawing observations from a large data sample repeatedly (sampling with replacement) and then estimating some quantity of a population by averaging estimates from multiple smaller samples.
Useful for small data sets and helping to deal with class imbalance.
Hyperparameter tuning:
Grid search
Forming a grid that is the Cartesian product of all parameters and then sequentially trying all such combinations and seeing which yields best results.
Hyperparameter tuning:
Random Search
Randomly sample from the joint distribution of all parameters.
ROC Curve
AUC?
Plots the true positive rate (y) against the false positive rate (x) for various thresholds.
Area under the curve (AUC) measures how well the classifier separates classes.
Conditional Probability
P(A/B)
P(A ∩ B) / P(B)
Bayes Theorem
Matrix
(row x column)
(2 x 3)