Methods Flashcards

Question 1

Q

LOCF

Answer

A

Last Observation Carried Forward. Imputation method for missing data.

Question 2

Q

MICE

Answer

A

Multiple Imputation by Chained Equations

Question 3

Q

SMOTE

Answer

A

Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases.

Question 4

Q

Probabilistic PCA

Answer

A

Can be used for imputation. Probabilistic PCA generalizes classical PCA.

Question 5

Q

Entropy MDL

Answer

A

A binning method. This method requires that you select the column you want to predict and the column or columns that you want to group into bins. It then makes a pass over the data and attempts to determine the number of bins that minimizes the entropy. In other words, it chooses a number of bins that allows the data column to best predict the target column.

Question 6

Q

PQuantile

Answer

A

Normalization that happens after binning. Note that normalizing values transforms the values, but does not affect the final number of bins. Values are normalized within the range [0,1]

Question 7

Q

SHAP explainer

Answer

A

Uses Shapley values to explain any machine learning model or python function.

Global interpretability — the SHAP values can show how much each predictor contributes, either positively or negatively, to the target variable.
Local interpretability — each observation gets its own set of SHAP values. This greatly increases its transparency. We can explain why a case receives its prediction and the contributions of the predictors.

Question 8

Q

Mimic explainer

Answer

A

Mimic explainer is based on the idea of training global surrogate models to mimic blackbox models. A global surrogate model is an intrinsically interpretable model that is trained to approximate the predictions of any black box model as accurately as possible. Data scientists can interpret the surrogate model to draw conclusions about the black box model. You can use one of the following interpretable models as your surrogate model: LightGBM (LGBMExplainableModel), Linear Regression (LinearExplainableModel), Stochastic Gradient Descent explainable model (SGDExplainableModel), and Decision Tree (DecisionTreeExplainableModel).

Question 9

Q

PFI

Answer

A

Permutation Feature Importance is a technique used to explain classification and regression models that is inspired by Breiman’s Random Forests paper (see section 10). At a high level, the way it works is by randomly shuffling data one feature at a time for the entire dataset and calculating how much the performance metric of interest changes. The larger the change, the more important that feature is. PFI can explain the overall behavior of any underlying model but does not explain individual predictions.

Question 10

Q

Fast Forest Quantile Regression

Answer

A

This article describes how to use the Fast Forest Quantile Regression module in Machine Learning Studio (classic), to create a regression model that can predict values for a specified number of quantiles.

Quantile regression is useful if you want to understand more about the distribution of the predicted value, rather than get a single mean prediction value. This method has many applications, including.

Question 11

Q

Boosted Decision Tree Regression

Answer

A

Boosting means that each tree is dependent on prior trees. The algorithm learns by fitting the residual of the trees that preceded it. Thus, boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage.

Question 12

Q

Demographic parity constraint

Answer

A

Mitigate allocation harms in binary classification and regression

Question 13

Q

Equalized odds constraint

Answer

A

Diagnose allocation and quality-of-service harms in Binary classification

Question 14

Q

Equal opportunity constraint

Answer

A

Diagnose allocation and quality-of-service harms in Binary classification

Question 15

Q

Bounded group loss constraint

Answer

A

Mitigate quality-of-service harms in regression

Question 16

Q

Fairness algorithms for reduction

Answer

Study These Flashcards

A

Reduction: These algorithms take a standard black-box machine learning estimator (e.g., a LightGBM model) and generate a set of retrained models using a sequence of re-weighted training datasets. For example, applicants of a certain gender might be up-weighted or down-weighted to retrain models and reduce disparities across different gender groups. Users can then pick a model that provides the best trade-off between accuracy (or other performance metric) and disparity, which generally would need to be based on business rules and cost calculations.

Question 17

Q

Fairness algorithms for post-processing

Answer

Study These Flashcards

A

Post-processing: These algorithms take an existing classifier and the sensitive feature as input. Then, they derive a transformation of the classifier’s prediction to enforce the specified fairness constraints. The biggest advantage of threshold optimization is its simplicity and flexibility as it does not need to retrain the model.

Question 18

Q

Differential privacy

Answer

Study These Flashcards

A

Differential privacy is a set of systems and practices that help keep the data of individuals safe and private.

Question 19

Q

Epsilon in the context of privacy

Answer

Study These Flashcards

A

A value known as epsilon measures how noisy or private a report is. Epsilon has an inverse relationship to noise or privacy. The lower the epsilon, the more noisy (and private) the data is.

Question 20

Q

Matchbox Recommender

Answer

Study These Flashcards

A

How this works: When a user is relatively new to the system, predictions are improved by making use of the feature information about the user, thus addressing the well-known “cold-start” problem. However, once you have collected a sufficient number of ratings from a particular user, it is possible to make fully personalized predictions for them based on their specific ratings rather than on their features alone. Hence, there is a smooth transition from content-based recommendations to recommendations based on collaborative filtering. Even if user or item features are not available, Matchbox will still work in its collaborative filtering mode.

Question 21

Q

Parameter sweeping mode: Entire grid

Answer

Study These Flashcards

A

Entire grid: When you select this option, the module loops over a grid predefined by the system, to try different combinations and identify the best learner. This option is useful for cases where you don’t know what the best parameter settings might be and want to try all possible combination of values.

Question 22

Q

Parameter sweeping mode: Random sweep

Answer

Study These Flashcards

A

Random sweep: When you select this option, the module will randomly select parameter values over a system-defined range. You must specify the maximum number of runs that you want the module to execute. This option is useful for cases where you want to increase model performance using the metrics of your choice but still conserve computing resources.

Methods Flashcards

(22 cards)