Definitions and FAQ Flashcards

Question 1

Q

In regards to why a GLM can’t handle collinearity

Answer

A

[collinearity] can lead to large standard errors of the coefficient estimates and make the interpretation of the coefficients estimates difficult because interpreting the estimates as changes in the target mean with other variables held constant is not meaningful - the other variables cannot be kept fixed.”

Question 2

Q

In regards to how GLMs/Trees handle numeric vs factor variables

Answer

A

A GLM will assign a single coefficient to each of these variables and assume that they have a monotonic effect on [the target variable]. Moreover, consecutive changes in these variables have the same marginal impact on [the target] (e.g., increasing [numeric variable] from 0 to 1 has the same effect as 1 to 2). These restrictions are lifted in decision trees, but their splits will have to respect the ordered nature of the variable values, e.g., a single split with [numeric variable] = 0, 10, 30 in one branch and [numeric variable] = 20, 40, 50 is another is impermissible.

Question 3

Q

And why converting to a factor is a good idea

Answer

A

Although their values have numerical order, converting these numeric variables into factor variables will provide our predictive model with much more flexibility to capture the effects of these variables on [target variable] across different values of their ranges more effectively. This has the potential of increasing prediction accuracy, at the cost of a higher risk of overfitting and heavier computational burden (these variables have to be represented by a great number of dummy variables in a GLM and the number of possible splits to make a decision tree will also increase substantially).

Question 4

Q

Why decision trees are prone to overfitting

Answer

A

Even with pruning, a base decision tree is susceptible to overfitting and tends to produce predictions with large variance due to the sequential (or recursive) fashion in which the tree splits are made, with each split operating on the results of the previous splits. The effect of a poorly chosen split early on due to noise in the training data will cascade over the rest of the fitted tree. This explains why base trees are generally sensitive to small changes in the training data.

Question 5

Q

Why we don’t need to binarize for decision trees

Answer

A

They don’t handle factor variables by converting them to dummy variables. Instead, they split factor levels into two groups directly. Binarizing the factor variables in advance will impose the restrictions that each tree split has to be based on one and only one factor level.

Question 6

Q

Why Decision Trees do not need an interaction

Answer

A

Interactions are automatically captured when fitting decision trees due to the subsequent splits working recursively on previous splits and therefore applying only part of the feature space.

Question 7

Q

Explaining Cross Validation

Answer

A

An alternative is to estimate the cp parameter using cross-validation. This approach involves dividing the training data up into folds, training the model on all but one of the folds, and measuring the performance on the remaining fold. This process is repeated to develop a distribution of performance values for a given cp value. The cp value that yields the best accuracy is then selected.

Question 8

Q

Explaining AUC and ROC

Answer

A

A classification tree does not return on exact prediction of what group a new individual belongs to, but rather a probability. Specifically in our trees, the predictions are the probability of being in the high value category. For example, in the single tree, if you end up at the third node from the right in the bottom row, the result is a probability of 47% of being high value. If the cutoff for being a high value is 50%, then these observations would be predicted as low value. But suppose we set the criteria for predicting high value at 40%. Then these observations would be predicted to be high value. This leads to tradeoffs as predicting more high value customers will lead to more errors on those predications, but fewer errors when predicting low value. The ROC curve shows the sensitivity and specificity at various probability cutoffs. Curves that bend to the upper left of the square represent greater accuracy, and hence the area under the curve (AUC) is an overall measure of accuracy.

Question 9

Q

Why removing multiple features at once is not a good idea

Answer

A

It is rarely a good idea to remove multiple features at once time as the removal of one feature can lead to others becoming important. One of the easiest ways to remove features is to apply a procedure such as step AIC. It is automated and keeps removing features until al that remain are significant by the selected measure.

Question 10

Q

How a transformation (of the target variable) impacts a decision tree

Answer

A

With a right-skewed distribution, higher values will have greater influence over our decision tree model because splits are chosen that minimize the sum of squared errors. For each group created by a split, the sum of squared errors adds up the squared differences between the observed pedestrians and the mean pedestrians for the group. In a right-skewed distribution, the differences tend to be larger in magnitude for higher values than lower values, so they contribute more to the sum. Transforming the target variable impacts the sum of squared errors calculation, so the following items that rely on that calculation are also affected:

the location of splits
the number of observations in each leaf
the predicted values for each leaf – with a right-skewed distribution the predictions tend to be higher than they would be with a less-skewed distribution

Question 11

Q

Hierarchical clustering

Answer

A

Hierarchical clustering is a bottom-up clustering method. It starts with the individual observations in the data each treated as a separate cluster and successively fuses the closest pair of clusters, one pair at a time. The process goes on iteratively until all clusters are eventually fused into a single cluster containing all of the observations. The result is a dendrogram, a tree-based representation of the hierarchy of clusters formed. From a dendrogram, we can determine the clusters by making a horizontal cut across the dendrogram. The resulting clusters formed are the distinct branches immediately below the cut.

To measure the distance (or dissimilarity) between two clusters, at least one of which has two or more observations, requires a linkage. With complete linkage, which is the default of the hclust() function in R, the distance between two clusters is the maximal pairwise distance (usually measured in the Euclidean manner) between observations in one cluster and observations in the other cluster with respect to a given set of predictors.

Question 12

Q

Regularization

Answer

A

Regularized regression (which includes ridge regression, lasso, and elastic net) is an alternative approach to reducing the complexity of a GLM and prevent overfitting. It works by adding a penalty term that reflects the size of the coefficients to the deviance of the GLM (or equivalently, negative of the training loglikelihood) and minimizing this penalized objective function to get the coefficient estimates. The objective function balances the goodness of fit of the model on the training data with the complexity of the model. The regularization penalty has the effect of shrinking the magnitude of the estimated coefficients of features with limited predictive power towards zero. In some cases, the effect of regularization is so strong that the coefficient estimates of these features are forced to be exactly zero, leading to simplification of the model and potential improvement in prediction accuracy.

Question 13

Q

Lambda (regularization)

Answer

A

lambda is the shrinkage parameter that can be used to index the complexity of an elastic net. As the value of lambda increases, the shrinkage penalty becomes heavier and the coefficient estimates become smaller (in magnitude) in general, leading to a decreased variance but an increased bias. Overall, the elastic net becomes less complex and less prone to overfitting.

Question 14

Q

Random Forest (bagging)

Answer

A

A random forest is an ensemble method that overcomes this problem by constructing a sequence of decision trees in parallel to bootstrapped versions of the training data and averaging the results of these trees to form the overall prediction. The averaging goes a long way towards reducing the variance of the model predictions and preventing overfitting, especially when there’s a large number of bootstrapped trees.

Question 15

Q

mtry (random forest)

Answer

A

mtry is the number of features randomly sampled as candidates at each split. Such a randomization of features has the effect of reducing the correlation between the predictors of different trees, further reducing the variance of the overall predictions, but making mtry too small may impose severe restrictions on the tree growing process, so this hyperparameter needs to be tuned carefully, typically by cross-validation.

Question 16

Q

eta (boosted tree)

Answer

A

eta is known as the learning rate or shrinkage parameter, a scalar multiple between 0 and 1. In boosting, trees are built sequentially, with each tree fitted to the residuals of the preceding tree and the predictions of the current tree scaled by the eta parameter are added to the overall predictions. In general, it’s best to choose eta to be a small value to slow down the learning process. It is often the case that predictive models that learn slowly perform better.

Question 17

Q

nrounds (boosted tree)

Answer

A

nrounds is the maximum number of boosting iterations. It is often a large number to ensure that sufficient trees are grown to produce a good fit. At the same time, it cannot be excessively large to avoid overfitting. In general, a lower value for eta requires a larger value for nrounds.

Question 18

Q

Principal Component Analysis

Answer

A

Principal components analysis is a method to summarize high dimensional numeric data with fewer dimensions while preserving the spread of the data. It can be particularly helpful when variables are highly correlated. PCA find orthogonal linear combinations of the input variables (which are typically centered and scaled) called principal components (PCs) that maximize variance to retain as much information as possible. The principal components are ordered according to their variance. The sum of their variances is the total variance explained. It is then common to look at the proportion of variance explained by each PC to decide how many PCs to use.

Question 19

Q

PCA Advantage

Answer

A

PCA could callow us to build a simpler model with fewer features. When exploring data, PCA can help visualize high-dimensional data to explore relationships between variables. PCA can help identify latent variables

Question 20

Q

PCA Disadvantage

Answer

A

using a subset of the principal components results in some information loss. The PCs will be less interpretable than the original variable inputs.

Question 21

Q

K-means clustering

Answer

A

With K-means cluster analysis, an unsupervised learning technique, the goal is to assign records into one of k groups or clusters such that members of each group are overall more similar to one another than they are to members of other groups. The number of groups, k, is specified at the beginning and the group members are determined through an iterative process. Initially, k random centers are chosen and the group assignment for each record is determined by which of these centers is closest. New centers for each group are calculated based on its members, and then group membership is redetermined based on these new centers. This process continues until the centers and group membership are stable or stopped by an iteration limit.

Question 22

Q

Elbow Plot (k-means clustering)

Answer

A

In an elbow plot, the proportion of variance explained by the variance between the k centers is calculated and plotted for successive values of k. Increases in k generally lead to increases in the proportion of variance explained, but the size of each increase typically decreases with each additional cluster. Where the incremental proportion of variance explained suddenly decreases with the addition of another cluster, the plot shows an “elbow” for the sudden change of direction, and the number of clusters just to the left of this, before the less helpful cluster is added, is considered a good, parsimonious choice for k.

Question 23

Q

Pruning (decision tree)

Answer

A

Pruning reduces the size of the tree, hopefully removing less valuable splits from the tree. This process reduces overfitting the tree on the training data, can lead to better predictions, and results in a simpler, more interpretable tree.

Question 24

Q

Boosted Tree

Answer

A

Boosted trees are built sequentially. A first tree is built in the usual way. Then another tree is fit to the residuals (errors) and is added to the first tree. This allows the second tree to focus on observations on which the first tree performed poorly. Additional trees are then added with cross-validation used to decide when to stop (and prevent overfitting).