Predictive Analytics Flashcards
Advantages of Converting Numeric Variables to Factor Variables
- Provides added flexibility to model
- When treated as numeric variables, there is an implicit assumption that there is a monotonic relationship between these variables and the target variable in GLMS
- In decision trees, splits have to respect the order of the variable’s values
- Potential improvement in predictive accuracy due to the flexibility of capturing the effects of the variable across different values of their range
Disadvantages of Converting Numeric Variables to Factor Variables
- When converted, these variables have to be represented by a greater number of dummy variables in a GLM, which inflates the dimension of the data and possibly dilutes predictive power
- In a decision tree, the number of splits increases, which increases the computing burden and diminishes interpretability
Check to Support Conversion of Numeric Variable to Factor Variable
Examine the mean of the target variable split by different values of the integer variable. If the mean does not vary in a monotonic fashion, this supports the conversion.
Define offset
In a predictive model, an offset is a variable that serves to account for the different exposure periods of different observations and therefore yield more accurate predictions.
In a GLM, an offset is a predictor whose regression coefficient is known to be 1 a priori
Introducing Training / Test Sets
Prior to fitting any models, I split the data into a training set (70% of the observations) and a test set (30% of the observations) using stratified sampling. To check that the two sets are representative, I note that the means of the target variable in the two data sets are comparable. The predictive power of the models we will study will be performed on the training data, while the predictive performance of these models will be performed on the test data.
Define minbucket
The minimum number of observations in any terminal node of the tree. The higher the value, the smaller the number of splits and less complex the tree.
Define cp
The minimum improvement (with respect to R-squared) needed in order to make a split. The higher the value, the less complex the tree.
Define maxdepth
The maximum number of branches from the tree’s root note to the furthest terminal node. The higher the value, the more complex the tree.
Describe Cost-Complexity Pruning
Technique that performs cross-validation to evaluate the predictive performance of a tree. The algorithm divides the training data in 10 (default for xval parameter) folds , trains the tree on all but one fold, and then computes the R-squared on the held-out fold. This is performed for all values of the cp paramter greater than the defined parameter to determine which cp value yields the lowest error. The tree is then pruned using this cp value. Splits that do not fulfill the impurity reduction threshold are removed.
One-Standard-Error Rule
Chooses the smallest tree whose cross validation error is within one standard error of the minimum cross-validation error. Generally results in more simple trees.
Define ntree
The number of trees to be grown in a random forest. It is generally a good idea to set ntree to a large number to increase the variance reduction and ensure that each observation and each feature is represented at least once in the random forest. However, too large a value fo ntree may lead to excess run time.
Define mtry
Every time a split is made in a random forest, a random sample of features is taken and considered as split candidates. The number of candidates is defined by the mtry parameter. Making mtry too large can reduce the correlation between the predictions of different trees. Making mtry too small can impose severe restrictions on the tree growing process.
What is a random forest?
Ensemble method that relies on bagging to produce a large number of bootstrapped training samples over which trees are constructed in parallel. The results of the different trees are combined , which reduces the variance of predictions and prevents overfitting.
What is overdispersion?
Overdispersion refers to the situation when the variance of the target variable is greater than the mean.
Benefits of Using Log Link
- Ensures the model predictions are non-negative
- Makes the model easy to interpret.
- Also, it is the canonical link for the Poisson distribution and thus facilitates the convergence of the model fitting algorithm.
Pros of Using Binarization with Stepwise Selection
Allows the selection procedure to drop individual factor levels if they are statistically insignificant with respect to the base level (instead of retaining or dropping the factor variables in their entirety
Cons to Using Binarization with Stepwise Selection
May cause procedure to take significantly more time to complete.
Resulting model may also be hard to interpret.
Components of Stepwise Selection
Selection Criterion (AIC, BIC) and Selection Process (Forward, Backward)
AIC
Performance metric used to rank competing models (similar to BIC). Defined as -2l + 2p, where l is the loglikelihood of the model on the training set and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower AIC, better fit model.
BIC
Performance metric used to rank competing models (similar to AIC). Defined as -2l + ln(n)p, where l is the loglikelihood of the model on the training set, n is the number of observations, and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower BIC, better fit model.
AIC vs BIC
Both metrics demand that for the inclusion of an additional feature to improve the performance of the model, the feature must increase the loglikelihood by at least a certain amount (the penalty amount).
In general, the penalty is typically greater for BIC, so the BIC is more stringent for complex models and is a more conservative approach.
Forward Selection
Opposite of backward selection. Starts with the simplest model (model with only the intercept and no features) and progressively adds the feature that results in the greatest improvement of the model until no features can be added to improve the model.
Backward Selection
Opposite of forward selection. Starts with the most complex model (model with all features) and progressively removes the feature that causes, in its absence, the greatest improvement in the model according to a certain criterion.
Forward vs Backward Selection
Forward selection is more likely to result in a simpler model relative to backward selection given forward selection starts with a model with no features.