DS - Concepts Flashcards
[STATS] Bias
wrong assumptions when training → can’t capture underlying patterns → underfit
Bias is error due to overly simplistic assumptions in the learning algorithm being used. This can lead the model underfitting your data, making it hard for it to have high predictive accuracy and for you to generalize your knowledge from the training set to the test set.
[STATS] Variance
sensitive to fluctuations when training→ can’t generalize on unseen data → overfit
Variance is error due to too much complexity in the learning algorithm you’re using. This leads to algorithm being highly sensitive to high degree of variation in your training data, which can lead your model to overfit the data. You’ll be carrying too much noise from your training data for your model to be very useful to your test data.
[STATS] Bias Variance Tradeoff
The bias-variance tradeoff attempts to minimize these two sources of error, through methods such as:
– Cross validation to generalize to unseen data
– Dimension reduction and feature selection
In all cases, as variance decreases, bias increases.
The bias-variance decomposition essentially decomposes the learning error from any algorithm by adding the bias, the variance and a bit of irreducible error due to noise in the underlying dataset. Essentially, if you make a model more complex and add more variables, you’ll lose bias but gain more variance - in order to get the optimally reduced amount of error, you’ll have to tradeoff bias and variance.
[STATS] Precision
TP / TP + FP —> percent correct when predict positive
positive predictive value : a measure of the amount of accurate positives the model claims compared to the number of positives it actually claims
[STATS] Recall
Sensitivity
TP / TP + FN —-> percent of actual positives identified correctly (True Positive Rate)
Recall is the true positive rate : the amount of positives your model claims compared to the actual number of positives there are really are in the data
[STATS] Specificity
TN / TN + FP —-> percent of actual negatives identified correctly
[STATS] F1 Score
2 * (( Precision * Recall) / Precision + Recall)
Useful when classes are imbalanced
The F1 Score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best and those tending to 0 being the worst. You would use it in classification tests where true negatives don’t matter much.
[STATS] ROC Curve
plots TPR vs. FPR for every threshold α. Area Under the Curve measures how likely the model differentiates positives and negatives (perfect AUC = 1, baseline = 0.5).
The ROC curve is a graphical representation of the contrast between true positive rates and false positive rates at various thresholds. It’s often used as a proxy for the trade-off between the sensitivity of the model (true positives) vs the fallout or probability it will trigger a false alarm (false positives).
[STATS] Precision-Recall Curve
Focuses on the correct prediction of the minority class, useful when data is imbalanced
[STATS] P-Value
probability that an effect could have occurred by
chance. If less than the significance level α, or if the test statistic is greater than the critical value, then reject the null.
A p-value is a measure of the probability that an observed difference could have occurred just by random chance. The lower the p-value, the greater the statistical significance of the observed difference.
[STATS] Type I Error
(False Positive α) - rejecting a true null
Type 1 error is a false positive meaning claiming something has happened when it hasn’t
Confidence Level (1 - α) - probability of finding an effect that did not occur by chance and avoiding a Type I error
[STATS] Type II Error
(False Negative β) - not rejecting a false null Decreasing Type I Error causes an increase in Type II Error
Type 2 error is a false negative meaning you claim nothing happening when in fact something is
Power (1 - β) - probability of picking up on an effect that is present and avoiding a Type II Error
[ML] Logistic Regression
Predicts probability that y belongs to a binary class. Estimates β through maximum likelihood estimation (MLE) by fitting a logistic (sigmoid) function to the data. This is equivalent to minimizing the cross entropy loss. The threshold a classifies predictions as either 1 or 0.
[ML] Logistic Regression Assumptions
– Linear relationship between X and log-odds of Y
– Independent observations
– Low multicollinearity
[ML] Multiclass Classification
- Multiclass can distinguish between more than two classes
- SGD classifiers, Random Forest and Naïve Bayes can use multiclass by default
- One-versus-the-rest (OVR) strategy for those that don’t support
- Multilabel Classification: Can test multiple outcomes in one model. Use F1 score to evaluate multilabel factor
- Multioutput Classification: Generalization of multilabel classification where each label can be multiclass (have more than two possible values)
[ML] Linear Regression
Models linear relationships between a continuous response and explanatory variables
Makes prediction by computing a weighted sum of input features plus a constant called the bias term (intercept term)