Definitions and FAQ Flashcards
In regards to why a GLM can’t handle collinearity
[collinearity] can lead to large standard errors of the coefficient estimates and make the interpretation of the coefficients estimates difficult because interpreting the estimates as changes in the target mean with other variables held constant is not meaningful - the other variables cannot be kept fixed.”
In regards to how GLMs/Trees handle numeric vs factor variables
A GLM will assign a single coefficient to each of these variables and assume that they have a monotonic effect on [the target variable]. Moreover, consecutive changes in these variables have the same marginal impact on [the target] (e.g., increasing [numeric variable] from 0 to 1 has the same effect as 1 to 2). These restrictions are lifted in decision trees, but their splits will have to respect the ordered nature of the variable values, e.g., a single split with [numeric variable] = 0, 10, 30 in one branch and [numeric variable] = 20, 40, 50 is another is impermissible.
And why converting to a factor is a good idea
Although their values have numerical order, converting these numeric variables into factor variables will provide our predictive model with much more flexibility to capture the effects of these variables on [target variable] across different values of their ranges more effectively. This has the potential of increasing prediction accuracy, at the cost of a higher risk of overfitting and heavier computational burden (these variables have to be represented by a great number of dummy variables in a GLM and the number of possible splits to make a decision tree will also increase substantially).
Why decision trees are prone to overfitting
Even with pruning, a base decision tree is susceptible to overfitting and tends to produce predictions with large variance due to the sequential (or recursive) fashion in which the tree splits are made, with each split operating on the results of the previous splits. The effect of a poorly chosen split early on due to noise in the training data will cascade over the rest of the fitted tree. This explains why base trees are generally sensitive to small changes in the training data.
Why we don’t need to binarize for decision trees
They don’t handle factor variables by converting them to dummy variables. Instead, they split factor levels into two groups directly. Binarizing the factor variables in advance will impose the restrictions that each tree split has to be based on one and only one factor level.
Why Decision Trees do not need an interaction
Interactions are automatically captured when fitting decision trees due to the subsequent splits working recursively on previous splits and therefore applying only part of the feature space.
Explaining Cross Validation
An alternative is to estimate the cp parameter using cross-validation. This approach involves dividing the training data up into folds, training the model on all but one of the folds, and measuring the performance on the remaining fold. This process is repeated to develop a distribution of performance values for a given cp value. The cp value that yields the best accuracy is then selected.
Explaining AUC and ROC
A classification tree does not return on exact prediction of what group a new individual belongs to, but rather a probability. Specifically in our trees, the predictions are the probability of being in the high value category. For example, in the single tree, if you end up at the third node from the right in the bottom row, the result is a probability of 47% of being high value. If the cutoff for being a high value is 50%, then these observations would be predicted as low value. But suppose we set the criteria for predicting high value at 40%. Then these observations would be predicted to be high value. This leads to tradeoffs as predicting more high value customers will lead to more errors on those predications, but fewer errors when predicting low value. The ROC curve shows the sensitivity and specificity at various probability cutoffs. Curves that bend to the upper left of the square represent greater accuracy, and hence the area under the curve (AUC) is an overall measure of accuracy.
Why removing multiple features at once is not a good idea
It is rarely a good idea to remove multiple features at once time as the removal of one feature can lead to others becoming important. One of the easiest ways to remove features is to apply a procedure such as step AIC. It is automated and keeps removing features until al that remain are significant by the selected measure.
How a transformation (of the target variable) impacts a decision tree
With a right-skewed distribution, higher values will have greater influence over our decision tree model because splits are chosen that minimize the sum of squared errors. For each group created by a split, the sum of squared errors adds up the squared differences between the observed pedestrians and the mean pedestrians for the group. In a right-skewed distribution, the differences tend to be larger in magnitude for higher values than lower values, so they contribute more to the sum. Transforming the target variable impacts the sum of squared errors calculation, so the following items that rely on that calculation are also affected:
- the location of splits
- the number of observations in each leaf
- the predicted values for each leaf – with a right-skewed distribution the predictions tend to be higher than they would be with a less-skewed distribution
Hierarchical clustering
Hierarchical clustering is a bottom-up clustering method. It starts with the individual observations in the data each treated as a separate cluster and successively fuses the closest pair of clusters, one pair at a time. The process goes on iteratively until all clusters are eventually fused into a single cluster containing all of the observations. The result is a dendrogram, a tree-based representation of the hierarchy of clusters formed. From a dendrogram, we can determine the clusters by making a horizontal cut across the dendrogram. The resulting clusters formed are the distinct branches immediately below the cut.
To measure the distance (or dissimilarity) between two clusters, at least one of which has two or more observations, requires a linkage. With complete linkage, which is the default of the hclust() function in R, the distance between two clusters is the maximal pairwise distance (usually measured in the Euclidean manner) between observations in one cluster and observations in the other cluster with respect to a given set of predictors.
Regularization
Regularized regression (which includes ridge regression, lasso, and elastic net) is an alternative approach to reducing the complexity of a GLM and prevent overfitting. It works by adding a penalty term that reflects the size of the coefficients to the deviance of the GLM (or equivalently, negative of the training loglikelihood) and minimizing this penalized objective function to get the coefficient estimates. The objective function balances the goodness of fit of the model on the training data with the complexity of the model. The regularization penalty has the effect of shrinking the magnitude of the estimated coefficients of features with limited predictive power towards zero. In some cases, the effect of regularization is so strong that the coefficient estimates of these features are forced to be exactly zero, leading to simplification of the model and potential improvement in prediction accuracy.
Lambda (regularization)
lambda is the shrinkage parameter that can be used to index the complexity of an elastic net. As the value of lambda increases, the shrinkage penalty becomes heavier and the coefficient estimates become smaller (in magnitude) in general, leading to a decreased variance but an increased bias. Overall, the elastic net becomes less complex and less prone to overfitting.
Random Forest (bagging)
A random forest is an ensemble method that overcomes this problem by constructing a sequence of decision trees in parallel to bootstrapped versions of the training data and averaging the results of these trees to form the overall prediction. The averaging goes a long way towards reducing the variance of the model predictions and preventing overfitting, especially when there’s a large number of bootstrapped trees.
mtry (random forest)
mtry is the number of features randomly sampled as candidates at each split. Such a randomization of features has the effect of reducing the correlation between the predictors of different trees, further reducing the variance of the overall predictions, but making mtry too small may impose severe restrictions on the tree growing process, so this hyperparameter needs to be tuned carefully, typically by cross-validation.