Methods Flashcards
LOCF
Last Observation Carried Forward. Imputation method for missing data.
MICE
Multiple Imputation by Chained Equations
SMOTE
Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.
Probabilistic PCA
Can be used for imputation. Probabilistic PCA generalizes classical PCA.
Entropy MDL
A binning method. This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column.
PQuantile
Normalization that happens after binning. Note that normalizing values transforms the values, but does not affect the final number of bins. Values are normalized within the range [0,1]
SHAP explainer
Uses Shapley values to explain any machine learning model or python function.
- Global interpretability — the SHAP values can show how much each predictor contributes, either positively or negatively, to the target variable.
Local interpretability — each observation gets its own set of SHAP values. This greatly increases its transparency. We can explain why a case receives its prediction and the contributions of the predictors.
Mimic explainer
Mimic explainer is based on the idea of training global surrogate models to mimic blackbox models. A global surrogate model is an intrinsically interpretable model that is trained to approximate the predictions of any black box model as accurately as possible. Data scientists can interpret the surrogate model to draw conclusions about the black box model. You can use one of the following interpretable models as your surrogate model: LightGBM (LGBMExplainableModel), Linear Regression (LinearExplainableModel), Stochastic Gradient Descent explainable model (SGDExplainableModel), and Decision Tree (DecisionTreeExplainableModel).
PFI
Permutation Feature Importance is a technique used to explain classification and regression models that is inspired by Breiman’s Random Forests paper (see section 10). At a high level, the way it works is by randomly shuffling data one feature at a time for the entire dataset and calculating how much the performance metric of interest changes. The larger the change, the more important that feature is. PFI can explain the overall behavior of any underlying model but does not explain individual predictions.
Fast Forest Quantile Regression
This article describes how to use the Fast Forest Quantile Regression module in Machine Learning Studio (classic), to create a regression model that can predict values for a specified number of quantiles.
Quantile regression is useful if you want to understand more about the distribution of the predicted value, rather than get a single mean prediction value. This method has many applications, including.
Boosted Decision Tree Regression
Boosting means that each tree is dependent on prior trees. The algorithm learns by fitting the residual of the trees that preceded it. Thus, boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage.
Demographic parity constraint
Mitigate allocation harms in binary classification and regression
Equalized odds constraint
Diagnose allocation and quality-of-service harms in Binary classification
Equal opportunity constraint
Diagnose allocation and quality-of-service harms in Binary classification
Bounded group loss constraint
Mitigate quality-of-service harms in regression