Predictive Analytics Flashcards
Advantages of Converting Numeric Variables to Factor Variables
- Provides added flexibility to model
- When treated as numeric variables, there is an implicit assumption that there is a monotonic relationship between these variables and the target variable in GLMS
- In decision trees, splits have to respect the order of the variable’s values
- Potential improvement in predictive accuracy due to the flexibility of capturing the effects of the variable across different values of their range
Disadvantages of Converting Numeric Variables to Factor Variables
- When converted, these variables have to be represented by a greater number of dummy variables in a GLM, which inflates the dimension of the data and possibly dilutes predictive power
- In a decision tree, the number of splits increases, which increases the computing burden and diminishes interpretability
Check to Support Conversion of Numeric Variable to Factor Variable
Examine the mean of the target variable split by different values of the integer variable. If the mean does not vary in a monotonic fashion, this supports the conversion.
Define offset
In a predictive model, an offset is a variable that serves to account for the different exposure periods of different observations and therefore yield more accurate predictions.
In a GLM, an offset is a predictor whose regression coefficient is known to be 1 a priori
Introducing Training / Test Sets
Prior to fitting any models, I split the data into a training set (70% of the observations) and a test set (30% of the observations) using stratified sampling. To check that the two sets are representative, I note that the means of the target variable in the two data sets are comparable. The predictive power of the models we will study will be performed on the training data, while the predictive performance of these models will be performed on the test data.
Define minbucket
The minimum number of observations in any terminal node of the tree. The higher the value, the smaller the number of splits and less complex the tree.
Define cp
The minimum improvement (with respect to R-squared) needed in order to make a split. The higher the value, the less complex the tree.
Define maxdepth
The maximum number of branches from the tree’s root note to the furthest terminal node. The higher the value, the more complex the tree.
Describe Cost-Complexity Pruning
Technique that performs cross-validation to evaluate the predictive performance of a tree. The algorithm divides the training data in 10 (default for xval parameter) folds , trains the tree on all but one fold, and then computes the R-squared on the held-out fold. This is performed for all values of the cp paramter greater than the defined parameter to determine which cp value yields the lowest error. The tree is then pruned using this cp value. Splits that do not fulfill the impurity reduction threshold are removed.
One-Standard-Error Rule
Chooses the smallest tree whose cross validation error is within one standard error of the minimum cross-validation error. Generally results in more simple trees.
Define ntree
The number of trees to be grown in a random forest. It is generally a good idea to set ntree to a large number to increase the variance reduction and ensure that each observation and each feature is represented at least once in the random forest. However, too large a value fo ntree may lead to excess run time.
Define mtry
Every time a split is made in a random forest, a random sample of features is taken and considered as split candidates. The number of candidates is defined by the mtry parameter. Making mtry too large can reduce the correlation between the predictions of different trees. Making mtry too small can impose severe restrictions on the tree growing process.
What is a random forest?
Ensemble method that relies on bagging to produce a large number of bootstrapped training samples over which trees are constructed in parallel. The results of the different trees are combined , which reduces the variance of predictions and prevents overfitting.
What is overdispersion?
Overdispersion refers to the situation when the variance of the target variable is greater than the mean.
Benefits of Using Log Link
- Ensures the model predictions are non-negative
- Makes the model easy to interpret.
- Also, it is the canonical link for the Poisson distribution and thus facilitates the convergence of the model fitting algorithm.
Pros of Using Binarization with Stepwise Selection
Allows the selection procedure to drop individual factor levels if they are statistically insignificant with respect to the base level (instead of retaining or dropping the factor variables in their entirety
Cons to Using Binarization with Stepwise Selection
May cause procedure to take significantly more time to complete.
Resulting model may also be hard to interpret.
Components of Stepwise Selection
Selection Criterion (AIC, BIC) and Selection Process (Forward, Backward)
AIC
Performance metric used to rank competing models (similar to BIC). Defined as -2l + 2p, where l is the loglikelihood of the model on the training set and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower AIC, better fit model.
BIC
Performance metric used to rank competing models (similar to AIC). Defined as -2l + ln(n)p, where l is the loglikelihood of the model on the training set, n is the number of observations, and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower BIC, better fit model.
AIC vs BIC
Both metrics demand that for the inclusion of an additional feature to improve the performance of the model, the feature must increase the loglikelihood by at least a certain amount (the penalty amount).
In general, the penalty is typically greater for BIC, so the BIC is more stringent for complex models and is a more conservative approach.
Forward Selection
Opposite of backward selection. Starts with the simplest model (model with only the intercept and no features) and progressively adds the feature that results in the greatest improvement of the model until no features can be added to improve the model.
Backward Selection
Opposite of forward selection. Starts with the most complex model (model with all features) and progressively removes the feature that causes, in its absence, the greatest improvement in the model according to a certain criterion.
Forward vs Backward Selection
Forward selection is more likely to result in a simpler model relative to backward selection given forward selection starts with a model with no features.
How to Interpret Coefficients of Log Link
Exponentiate the coefficients and subtract 1. This results in the algebraic changes in the target variable
What is Regularized Regression?
Alternative approach to reducing the complexity of a GLM and preventing overfitting. It works by adding a penalty term that reflects the size of the coefficients to the function to optimize (loglikelihood). This shrinks the magnitude of the estimated coefficients of features with limited predictive power towards zero. In some cases, the coefficients are forced exactly to zero, thus removing those features from the model.
Regularization vs Stepwise Selection
- Binarization of factor variables is done automatically in regularization, and each factor level is treated as a separate feature to remove
- Cross-validation can be used to optimize the regularization parameter (lambda) such that the RMSE is minimized
- Coefficient estimates are more difficult to interpret in regularization because the variables are standardized
- The glmnet package only allows a restricted set of model forms
Supervised Learning Methods
Target variable guides the analysis.
Main Methods: GLMs, Decision Trees
Unsupervised Learning Methods
Target variable is absent; interested in extracting relationships between variables in the data (lend themselves to high-dimensional datasets)
Main Methods: Principal Components Analysis, Cluster Analysis
Regression vs Classification
Regression has numeric target variable, classification has categorical target variable
Training vs Test Set
Training Set is the data used to develop the predictive model (typically the largest portion of the data)
Test Set is the data used the evaluate the predictive performance when applied to data the model has not seen before
What is Cross-Validation?
Alternative to training/test split when data set is small
Splits data into k equal folds; each fold is used once as the validation set, while the remaining folds are used as the training set
Predictive model is fit k times and predictions are combine
Common Performance Metrics Used in Regression Problems
Root Mean Squared Error (RMSE): aggregate of all prediction erros in the test set
Common Performance Metric Used in Classification Problems
Classification Error Rate: proportion of observations in the test set that are incorrectly classified
Define Bias
Difference between expected value and true value
More complex, lower bias
Define Variance
Amount by which expected value would change using a different training set
More complex, higher variance
What is Irreducible Error?
The variance of the noise. Cannot be reduced no matter how good the predictive model is
Variables vs Features
Variables are raw recorded measurements from the original dataset without any transformations
Features are derivatives of raw variables
Feature Generation
Process of developing new features based on existing variables in the data
Seeks to enhance the flexibility of the model and lower the bias of the predictions at the expense of an increase in variance
Plays more prominent role in GLMs compared to Decision Trees
Feature Selection (Removal)
Opposite of feature generation; process of dropping features with limited predictive power
Important concept in GLMs and Decision Trees
Feature Selection Methods
Forward/Backward Selection
Other Commonly Used Performance Metrics
RMSE (most common)
Loglikelihood
R-Squared (goodness of fit measure)
Chi-square
What is Binning?
Creating a categorical predictor whose levels are defined as non-overlapping intervals of the original variable
What is Binarization?
Feature generation method used for categorical variables, which turns a categorical variable into a collection of binary variables
Baseline level is automatically defaulted by alphabetical order
Good idea to make baseline level the level with most observations
Interactions Between Continuous and Categorical Predictors
Need to multiply continuous variable by each of the binary variables created from the categorical variable
To assess the extent of the interaction graphically, use a scatterplot