Concepts Flashcards

Question

Explain the random forest learning algorithm.

Answer 1

Random forests can be used on both continuous and categorical data, and they can predict either a continuous or a categorical outcome. In this algorithm, a number of decision trees are generated for the same data. There are two techniques to prevent all trees from reaching the same conclusions: - Bootstrap the data (sample with replacement) and assign a sample to each tree - Assign each tree only a subset of input features (usual to pick the sqrt of the number of input features) To reach a prediction, the results of all trees are averaged (in the continuous case). In the categorical case, the trees vote. Decision trees often work well with default parameters in sklearn. Params to tune: n_estimators (100), criterion ('squared_error'), min_samples_split (2), max_depth (None)

Answer 2

- Decision trees are interpretable and easy to visualize. - Decision trees are highly reproducible and perform well on large data sets because they are quick to run - Decision trees are very prone to overfitting, especially if the tree is deep. We can limit tree depth, but this increases the risk of a biased model. - Random forests are able to reduce overfitting while not dramatically increasing error due to bias. - Random forests are also more robust to outliers and general variation in the data, because they are an ensemble method where multiple trees must reach consensus. My sense is that decision trees are almost never used in practice.

Answer 3

- Decision trees and random forests can outperform linear/logistic regression if the output is not well-represented by linear combinations of the input variables (tree-based methods are non-parametric and learn interactions without them having to be explicitly modeled). - Random forests perform well in the case when the number of variables is close to or exceeds the number of observations, a regime in which linear/logistic regression breaks down. - Random forests are more robust to outliers because they are an ensemble method. - Random forests are generally less interpretable (easy to explain) than regression models, and they take more time and memory to run.

Answer 4

SVMs can be used for regression or classification, but they're usually used for the latter, and usually only for binary classification problems (true?). The SVM itself is a complex, multidimensional surface, also called a hyperplane, that separates classes. The goal of this algorithm is to determine the SVM such that is separates classes as successfully as possible. In the SVM algorithm, a hyperplane is first identified that completely separates class A from class B. Then the distance from the support vectors (also called the "margin") is maximized. (Which algorithm?) Kernels can be used to create non-linear boundaries. In sklearn, there are options including 'linear', 'rbf', 'poly', and 'sigmoid.' Linear is usually best when you have a large number of features (> 1000) because it helps you avoid overfitting.

Answer 5

This is easiest to think about in the context of binary classification. The points that are the closest to the interface between clusters. The actual Support Vector Machine (SVM) is a complex multidimensional surface, called a hyperplane, that separates classes.

Answer 6

1) Gamma is the kernel coefficient if a non-linear boundary is used (i.e. rbf, sigmoid, etc). A high value of gamma (e.g. 100) will likely result in overfitting. 2) C is a penalty parameter that controls the trade-off between correct classification and smooth boundaries. Is there regularization in SVM?

Answer 7

Pros: - It works well when there is a clear margin of separation between classes - It is effective in high dimensional spaces, even when the number of dimensions is greater than the number of samples - It is memory-efficient because only the support vectors are used to tune the location of the hyperplane Cons: - Training time can be large on large data sets - It performs poorly on noisy data (i.e. there is not clear separation between classes) - SVM does not directly provide probability estimates. These must be obtained using k-fold cross-validation.

Answer 8

K-means is an unsupervised learning algorithm that groups similar points together to reveal underlying patterns. The algorithm looks for a fixed number (k) of clusters in the data, where k defines the number of centroids you want to find. - Starts with randomly located group of centroids - Calculates the distance between each data point and all k centroids - Assigns the data point to the closest centroid - After all the data points have been assigned, updates the location of each centroid to the average location of all data points assigned to that centroid. - The algorithm stops when centroid locations are not changing much between iterations, or when a certain number of iterations is reached. The model output is the cluster each data record belongs to. K is a critically-important hyperparameter. Recommendation systems often use k-means.

Answer 9

A centroid is the center of a cluster of data. Formally, it's the average location of all the data points assigned to a cluster.

Answer 10

- The choice of initial positions for the centroids is important, and a poor choice can result in the algorithm failing to stabilize. Rather than assigning entirely random centroid locations, one option is to initialize the centroids at the locations of actual data points. - The selection of hyperparameter k matters a lot - Data must be normalized in order for k means to make sense. This can be done in sklearn.

Answer 11

- Sometimes k can be estimated by eye | - You can also use an elbow plot

Answer 12

It's a plot that shows k on the x-axis and the within-clusters sum of squares on the y-axis. To calculate the within-clusters sum of squares, for each cluster, calculate the Euclidean distance between each point in the cluster and the cluster’s centroid. Then sum the distances and divide by the number of points.

Answer 13

There are two kinds of hierarchical clustering: agglomerative and divisive. The latter is rarely used, so we'll focus on the former. Algorithm - Each data point starts out in its own cluster - A proximity matrix is calculated which describes how far each point is from all the others - The two points that are closest together are clustered, and then the proximity matrix is re-calculated (using the centroids of multi-point clusters, I assume?) - This process repeats until k clusters are achieved

Answer 14

- Minimum (linkage=’single’ in sklearn): All the distances between points in cluster1 and points in cluster2 are calculated; the minimum distance is taken to be the measure of proximity. o Performs well on non-globular clusters o Performs poorly on noisy clusters - Maximum (linkage=’complete’ in sklearn): All the distances between points in cluster1 and points in cluster2 are calculated; the maximum distance is taken to be the measure of proximity. o Performs well on noisy clusters o Performs best on globular clusters o Tends to break large clusters - Group average (linkage=’average’ in sklearn): All the distances between points in cluster1 and points in cluster2 are calculated; the average of these is taken to be the measure of proximity. o Performs well on noisy clusters o Performs best on globular clusters o Variations: distance between centroids, Ward's method (linkage='ward') In addition, in calculating the above, "distance" can be measured as: Euclidean distance (affinity=’euclidian’ in sklearn, required if linkage=’ward’), squared Euclidian distance, Manhattan distance (affinity=’manhattan’ in sklearn) and others.

Answer 15

- As in k-means, it is important to normalize data prior to modeling in hierarchical clustering. - Hierarchical clustering allows k to be well-approximated before running the algorithm by visually inspecting a dendrogram. Dendrograms can be plotted using sklearn. - Agglomerative hierarchical clustering cannot be used on large data sets; both space and time complexity are large, and importantly, larger than k-means.

Answer 16

``` MSPE? MSAE? MAE RMSE R^2 Adjusted R^2 ```

Answer 17

``` Precision-Recall ROC/AUC Accuracy Log-loss F1 score ```

Answer 18

Rand index | Mutual information

Answer 19

CV error Heuristic methods to find k BLEU score (NLP)

Answer 20

A confusion matrix is a tool to evaluate the performance of a classification model. It's an n x n matrix where n is the number of classes you are predicting. In the simplest case (binary classification), the matrix has four squares that capture the number of examples that were true positives, false positives, true negatives, and false negatives.

Answer 21

In a binary classification model, a true positive is an example where the model predicted the positive class (1) and the actual data was from the positive class (1).

Answer 22

In a binary classification model, a false positive is an example where the model predicted the positive class (1) and the actual data was from the negative class (0). This is also known as Type 1 error.

Answer 23

In a binary classification model, a true negative is an example where the model predicted the negative class (0) and the actual data was from the negative class (0).

Answer 24

In a binary classification model, a false negative is an example where the model predicted the negative class (0) and the actual data was from the positive class (1). This is also known as Type 2 error.

Answer 25

Recall = number of positives identified / true number of actual positives = TP / (TP + FN) Also called sensitivity. Maximize this if you want the fewest false negatives. Examples: - You want to grant as many loans as possible, so you minimize the number of borrowers wrongly identified as risky - Amazon Customer Service: You want to find as many coaching opportunities as possible, so you minimize the number of interactions wrongly identified as lacking coaching opportunities

Answer 26

Specificity = number of negatives identified / total number of actual negatives = TN / (TN + FP)

Answer 27

Precision = number of positives identified / total number of classified positives = TP / (TP + FP) Also called positive predictive value (PPV) Maximize this when you want the fewest possible false positives. Examples: - You want to correctly identify when individuals have cancer, but you want to minimize the number of individuals you incorrectly diagnose with cancer - Amazon Customer Service: You want to correctly identify coaching opportunities, but you also want to minimize the number of interactions when you say there's a coaching opportunity but there isn't

Answer 28

Accuracy = the proportion of the total predictions that were correct = (TP + TN) / (TP + TN + FP + FN) Most intuitive metric. Is misleading when classes are imbalanced.

Answer 29

F1 score is the harmonic mean of recall and precision. It tries to balance recall and precision - minimizing all false conclusions (FP, FN) It's a good alternative to accuracy when classes are imbalanced. Examples: - When FP and FN are equally harmful or benign. Maybe choosing which YouTube video to automatically play after the current one - labeling a good video as bad or a bad video as good might have approximately the same effect (the user doesn't watch the next video). - Amazon Customer Service: If labeling an interaction as a coaching opportunity when it wasn't (wastes manager's time) and missing a coaching opportunity (failing to help customer service associates improve) were equal outcomes.

Answer 30

- A Receiver Operating Characteristic curve is one way of visualizing the performance of a classification model. - It plots the false positive rate (1 - specificity) on the x-axis against the true positive rate (recall/sensitivity) on the y-axis. - False positive rate = 1 - specificity = FP / (FP + TN) - To create the plot, FPR and TPR are calculated at different values of some model parameter, and then plotted. For example, in logistic regression, the model parameter is the probability threshold above which the model gives a positive outcome. - The Area Under the ROC Curve (AUC) is a measure of model accuracy. The lowest possible AUC is 0.5, when the ROC curve is indistinguishable from the 45-degree random line. The highest possible AUC is 1, when the curve is pushed as high up into the left hand corner of the plot as possible (??). - ROC curves can be misleading diagnostics in the case of very imbalanced data sets. In this case, a precision-recall curve is preferred.

Answer 31

- Precision-recall curves are like ROC curves, except recall is plotted on the x-axis and precision is plotted on the y-axis. - Pairs of (recall, precision) values are calculated at different values of some model parameter. For example, in logistic regression, the model parameter is the probability threshold above which the model gives a positive outcome. - The AUC is a measure of model accuracy, with the lowest possible AUC occurring when the precision-recall curve is indistinguishable from the 45-degree random line (running from the top left to the bottom right). - The highest possible AUC is 1, when the curve is pushed as high up into the upper right hand corner of the plot as possible. - This option is better than an ROC curve for imbalanced classes.

Answer 32

RMSE is a common validation metric for regression models. It is the standard deviation of the residuals, where residual r_i = y_i, predicted - y_i, actual RMSE = sqrt( 1/n * sum from 1 to n ((y_i, actual - y_i, predicted)^2)) Smaller values of RMSE are better. RMSE cannot be smaller than Mean Absolute Error (MAE).

Answer 33

- RMSE penalizes large residuals (i.e. it is higher when the data contain large residuals). In other words, being off by 10 is more than twice as bad as being off by 5. - RMSE is very popular when used as a loss (or cost) function because it makes taking the derivative easier as in algorithms such as gradient descent. - However, MAE is more robust to outliers, and so can perform better on noisy data. - In addition, MAE can be more interpretable.

Answer 34

R^2 is a common validation metric for regression models. It also can be used for explanatory purposes: it gives an estimate of the amount of variation in the dependent variable (y) that can be explained by the independent variable(s) (x). It is basically 1 - MSE/variance. So, the higher the mean standard error (MSE), the lower R^2 and the poorer the model. How to calculate: - If the residuals e_i = y_i, actual - y_i, predicted - Total sum of squares = SS_tot = the sum of (y_i, actual - mean of y)^2 - Residual sum of squares = SS_res = sum of (e_i)^2 - R^2 = 1 - (SS_res/SS_total) If R = 0, none of the variance in y is explained by x. If R = 1, all of the variance in y is explained by x.

Answer 35

Adjusted R^2 is a common validation metric for regression models. It also can be used for explanatory purposes: it gives an estimate of the amount of variation in the dependent variable (y) that can be explained by the independent variable(s) (x). It differs from R^2 in that it penalizes including more parameters in your model. R_adj^2 = 1 - [ (1-R^2)(n-1) / ( n - k - 1) ] where n = the number of observations and k = the number of parameters. Adjusted R^2 will always be <= R^2 (??) If R_adj^2 = 0, none of the variance in y is explained by x. If R_adj^2 = 1, all of the variance in y is explained by x. Larger values of R_adj^2 are better.

Answer 36

R^2 will increase as you add parameters to a model whether or not those parameters are informative. In contrast, adjusted R^2 will increase when you add useful terms, and decrease if you add less useful terms. So, generally, you should use adjusted R^2 unless your model includes only one term.

Answer 37

It depends on what you need to find out. RMSE on its own doesn’t actually tell you how good a model is – it only tells you if one model is better than another. In contrast, adjusted R2 has meaning even when it isn’t in comparison with another option. The best R2 value is always 1. On the low end, it is possible to get infinitely large, negative R2 values, but it doesn’t usually occur.

Answer 38

Cross-validation is useful in particular on small data sets. (What constitutes a small data set?) A small test set means more variation in estimated test error, and therefore it is more difficult to claim that one algorithm works better than another. Cross-validation allows you to use all the data to estimate an average test error.

Answer 39

K-fold cross-validation is a procedure that allows you to determine testing error. It is especially useful when a dataset is small. It provides a more conservative estimate of testing error when sample sizes are small? To perform k-fold cross-validation: 1. Randomly shuffle your rows 2. Divide your rows into k equal groups 3. Designate one group as the test set, and use the remaining groups as the training set 4. Perform any data preparation (e.g. normalization) and parameter tuning on the test set - if you do this on the full data set, you risk data leakage 5. Train the model 6. Calculate the testing error 7. Repeat from step 3 until you've used each of the k groups once as the test set (i.e., each example has been in the test set once and the training set k times) 8. Average your testing error across the k iterations. It's best practice to also calculate a standard deviation. K-fold cross-validation can be computationally expensive.

Answer 40

Principle component analysis (PCA) is an analytical (non-iterative) method to reduce the dimensionality (or number of features) in a data set while maintaining as much of the variation present in the data set as possible. The principle components are the new, reduced features produced by the analysis. They are the eigenvectors of the covariance matrix from the original features, and hence are orthogonal (independent). The first principle component captures the most variation from the original data set, the second captures the second most, and so on. PCA only works well on scaled data. Relationships between features are assumed to be linear. PCA is commonly used to compress big data while losing as little important information as possible, and to visualize high-dimensional data (especially for unsupervised learning applications). It should NOT be used to fix overfitting. (Why??)

Answer 41

There are several techniques for handling classification problems when there are imbalanced classes. Some include: - Selecting appropriate metrics (precision, recall or F1 rather than accuracy) - Oversampling instances of the minority class or undersampling instances of the majority class. Note that oversampling can result in overfitting because it can produce duplicate instances. SMOTE avoids this by creating new minority class instances by combining existing ones (SMOTE resource: https://beckernick.github.io/oversampling-modeling/). On the other hand, undersampling can leave out important instances (reduces the overall amount of data) - In extreme cases, it can be good to consider classification in the context of anomaly detection (anomaly detection algorithms include clustering methods, one-class SVMs and isolation forests)

Answer 42

Covariance is the degree to which corresponding elements from two features tend to move in the same direction. For example, if two of your features to predict the temperature of the asphalt in a neighborhood parking lot were air temperature and amount of light, you would expect those to covary. Hence, a covariance matrix captures which features are relatively redundant (i.e. they have high covariance) and which are information-rich (i.e. they have low covariance).

Answer 43

Eigenvectors are vectors that do not change direction when transformed (multiplied) by the covariance matrix. They may, however, change size, which is indicated by the eigenvalue. They represent the principal axes of maximum variance. The eigenvalues provide the order of importance of these axes (first principal component, second principal component, etc.)

Answer 44

It is common practice to choose k to be 10. 5 is also a common choice. A value of 10 is commonly recommended if you're struggling. It's preferable to choose k such that your groups have equal (or nearly equal) numbers of examples. As k gets larger, the difference in size between the training set and the resampling subsets gets smaller. As this difference decreases, the bias of the technique becomes smaller. To summarize, there is a bias-variance trade-off associated with the choice of k in k-fold cross-validation. Typically, given these considerations, one performs k-fold cross-validation using k = 5 or k = 10, as these values have been shown empirically to yield test error rate estimates that suffer neither from excessively high bias nor from very high variance.

Answer 45

Biases are the simplifying assumptions made by a model to make the target function easier to learn. Generally, linear algorithms have a high bias making them fast to learn and easier to understand but generally less flexible. In turn, they have lower predictive performance on complex problems that fail to meet the simplifying assumptions of the algorithms bias. Models with high bias are prone to underfitting. Models with high bias: Linear regression, linear discriminant analysis, logistic regression Models with low bias: Decision Trees, k-Nearest Neighbors and Support Vector Machines

Answer 46

Variance is the amount that the model will change if different training data was used. I.e., how responsive the model is to new training examples. All models should have some variance, but a good model should not change too much from one training data set to the next, because it's good at identifying underlying patterns and correctly mapping between inputs and outputs. Machine learning algorithms that have a high variance are strongly influenced by the specifics of the training data. They are prone to overfitting. Low variance models: Linear Regression, Linear Discriminant Analysis and Logistic Regression. High variance models: Decision Trees, k-Nearest Neighbors and Support Vector Machines.

Answer 47

The aim of all ML models is to achieve both low bias and low variance. But as bias decreases, variance increases, and vice versa. So, the goal is to find the model where the model is responsive to the training data, but not too responsive. Examples: - The k-nearest neighbors algorithm has low bias and high variance, but the trade-off can be changed by increasing the value of k which increases the number of neighbors that contribute to the prediction and in turn increases the bias of the model. - The support vector machine algorithm has low bias and high variance, but the trade-off can be changed by increasing the C parameter that influences the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.

Answer 48

Overfitting is when a model is too specific to the training data. You can identify this when the training error is low but the testing error is much higher.

Answer 49

Missing values - sklearn will throw an error if you try to train a model on data with missing values. (Options: drop column (possibly better than imputation if > half values are missing), imputation/fillna, imputation + new column indicating which values were missing) Encode non-numeric variables - sklearn expects numeric values in columns (Options: drop column, original encoding (each value gets its own integer; for variables with inherent order), one-hot-encoding (each value gets its own column; does not assume order; does not work well if the variable assumes > 15 values). Things to consider: missing values in new cols in validation/test data; new options for categorical variables in validation/test data

Answer 50

Ensemble learning is a general approach to machine learning that combines the predictions from multiple models to improve performance. The idea is that a set of weak learners can come together to produce one strong learner. The three main ensemble learning strategies are 1. Bagging 2. Stacking 3. Boosting

Answer 51

Bagging is short for "bootstrap aggregation." It creates a diverse ensemble of models by varying the training data. A single algorithm is typically used, usually a decision tree. Each decision tree is training on a subset of the training data, produced by sampling the rows with replacement (i.e. bootstrapping). The results from each decision tree are either averaged or counted to form a final model output. Random forest is the obvious example of an ensemble method that uses bagging. (RF expands on basic bagging by selecting a subset of features to split on for each split in each tree). Extra trees is another example.

Answer 52

In stacking, a diverse ensemble is created by varying the types of models used. A stacking model typically has two levels. Level 0 contains all of the models that make predictions on the training data. It is desirable to use a wide variety of models with different assumptions on this level. Level 1 contains the model that aggregates the predictions into a final answer (i.e., it is trained on the predictions from the level 0 models). The level 1 model is often simple, such as linear or logistic regression. This encourages the complexity of the model to reside in level 0. Examples of stacking algorithms include Stacked models, Blending, and Super Ensemble.

Answer 53

Boosting creates a diverse ensemble by sequentially adding models that focus on examples that were not well-classified by the other ensemble members. Typically, this involves the use of very simple decision trees that are added to the model sequentially. Training examples for each model are weighted to indicate whether they were accurately classified by the preceding models. Ensemble output is aggregated through weighted averaging or voting. Examples of boosting algorithms include AdaBoost, Gradient Boosting Machines, and Stochastic Gradient Boosting Machines (e.g. XGBoost, LightGBM). These algorithms are currently among the most successful on tabular data.

Answer 54

AdaBoost was the first widely successful boosting algorithm. It uses a set of single-split decision trees ("decision stumps") added sequentially. Difficult-to-classify examples receive larger and larger weights until the algorithm identifies a stump that can classify them. The final outcome is an average of the outcome for each decision stump, weighted by that stump's accuracy. AdaBoost is most successful in binary classification applications. Three things to remember about AdaBoost: 1. AdaBoost combines a lot of weak learners (stumps) to make decisions. 2. Some stumps get more say in the classification than others. 3. Each stump is trained taking the previous stump's mistakes into account. Reference: https://www.youtube.com/watch?v=LsK-xG1cLYA

Answer 55

Gradient Boosting Machines (GBMs) re-cast boosting as a numerical optimization problem where the goal is to minimize a loss function and new trees (or new "weak learners") are added via a gradient descent-like procedure. New learners are added one at a time, and the existing weak learners are not updated. Unlike prior boosting algorithms, GBMs can use any differentiable loss function. This expanded the types of problems that could be solved via boosting beyond binary classification. Decision trees are used as the weak learners in GBMs. These trees are typically constrained (i.e. in their depth), and they learn in a greedy way. Each subsequent decision tree is designed to "correct" large residuals from the previous set of decision trees. A tree is added and then the parameters are tuned such that it minimizes the overall loss of the ensemble. Training stops when 1. a fixed number of trees have been added OR 2. loss reaches an acceptable level OR 3. performance no longer improves on an external validation set.

Answer 56

1. Adding tree constraints - It's important that each decision tree have some skill, but remain weak overall. To do this, we can tune the number of trees (keep adding trees until improvement is no longer observed), tree depth (4-8 levels), the number of leaves, the minimum number of training observations per split, and the minimum improvement to loss per split. 2. Weighted updates - The predictions of each tree are added together sequentially, and the contribution of each tree to this sum can be weighted by a learning rate. Smaller learning rates require more trees. "Shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model." Typical learning rates are 0.1-0.3, or even smaller than 0.1. 3. Stochastic gradient boosting - Instead of each learner being fit on the full training data, in SGB, a subset of the data is randomly selected (without replacement). This can include a subset of rows before creating each tree, a subset of columns before creating each tree, or a subset of columns before creating each split. Aggressive sub-sampling - such as 50% of the data - has been shown to be the most effective. 4. Penalized gradient boosting - Classical decision trees like CART are not usually used as weak learners. Instead, regression trees are typically used, which have numerical values for each leaf. These leaf values can be regularized using usual L1 and L2 functions. This helps avoid over-fitting.

Answer 57

Extreme Gradient Boosting or XGBoost is an efficient (i.e. fast) and effective (i.e. accurate) open-source implementation of the gradient boosting algorithm. It is the implementation that really caught on with the ML community, and it is a go-to method and often part of the winning solution in ML competitions. Because randomness is involved in model training, a slightly different model will be created each time the model is trained. Because of this, it is best to evaluate model performance over multiple runs or across multiple rounds of cross-validation (e.g. repeated stratified k-fold cross-validation). Consider tuning: - Number of trees - Tree depth - Learning rate - Number of samples - Number of features Code example: https://machinelearningmastery.com/extreme-gradient-boosting-ensemble-in-python/

Answer 58

LightGBM is an efficient (i.e. fast) and effective (i.e. accurate) open-source implementation of the gradient boosting algorithm. It extends traditional gradient boosting by adding a type of automatic feature selection (EFB) and by focusing on boosting examples with larger gradients (GOSS). This can result in a dramatic speed-up of training. Like XGBoost, it is a state-of-the-art model for problems on tabular data, and is a staple of winning solutions in ML competitions. Consider tuning: - Number of trees - Tree depth - Learning rate - Boosting type Code example: https://machinelearningmastery.com/light-gradient-boosted-machine-lightgbm-ensemble/

Answer 59

EFB, or Exclusive Feature Bundling, is an addition to the Gradient Boosting Machine algorithm as implemented in LightGBM. It is a method to automatically reduce features and it can dramatically speed up training. It does this by combining (bundling) features that are sparse (mostly zero) and exclusive (they are never non-zero in the same place). Example: F1 -> [0, 0, 1, 0, 0, 2] F2 -> [3, 3, 0, 0, 0, 0] F1 and F2 bundled -> [3, 3, 1, 0, 0, 2]

Answer 60

GOSS, or Gradient-based One-Side Sampling is an addition to the Gradient Boosting Machine algorithm as implemented in LightGBM. It focuses attention on training examples that result in a larger gradient, and it can dramatically speed up training.

Answer 61

The Out-Of-Bag score, or OOB, is calculated for free as part of the random forest algorithm. Each decision tree in the ensemble is trained on a bootstrapped subset of the data, meaning that not all examples are given to all decision trees. If an example is not given to a particular decision tree, it's considered out-of-bag for that tree. At the end of training, for each example, a prediction is made by all the trees where the example was out-of-bag. The OOB score is the number of correctly predicted rows from the OOB examples. Reference: https://towardsdatascience.com/what-is-out-of-bag-oob-score-in-random-forest-a7fa23d710

Answer 62

The Out-Of-Bag score is computed on only a subset of examples and trees (the out-of-bag examples and the trees for which those examples were out-of-bag). Therefore, it is better to use metrics evaluated on a test/validation set, where all examples are evaluated by all decision trees in the ensemble. However, OOB score may be used in cases where very little data is available for training, and you don't want to split the data into training/test/val sets. Reference: https://towardsdatascience.com/what-is-out-of-bag-oob-score-in-random-forest-a7fa23d710

Answer 63

Shapley values are a way to measure how much each feature contributes to a model's prediction. To calculate the Shapley value for feature f: 1. Create all possible combinations of features (excluding feature f). These sets of features are called "coalitions". 2. Calculate the average prediction of the model across all examples. 3. For each coalition, calculate how different the prediction is from the average prediction WITH feature f. 4. For each coalition, calculate how different the prediction is from the average prediction WITHOUT feature f. 5. Calculate the "marginal contribution" of feature f, which is step 4 - step 3 6. The Shapley value for feature f is the average marginal contribution of f across all coalitions. In practice, this algorithm's run time increases exponentially with the number of features. Instead, SHapley Additive exPlanations (SHAP values) are used. Reference: https://www.aidancooper.co.uk/how-shapley-values-work/ Guide to interpreting SHAP analyses: https://www.aidancooper.co.uk/a-non-technical-guide-to-interpreting-shap-analyses/?xgtab&

Concepts Flashcards

(87 cards)