Predictive Analytics Flashcards
Advantages of Converting Numeric Variables to Factor Variables
- Provides added flexibility to model
- When treated as numeric variables, there is an implicit assumption that there is a monotonic relationship between these variables and the target variable in GLMS
- In decision trees, splits have to respect the order of the variable’s values
- Potential improvement in predictive accuracy due to the flexibility of capturing the effects of the variable across different values of their range
Disadvantages of Converting Numeric Variables to Factor Variables
- When converted, these variables have to be represented by a greater number of dummy variables in a GLM, which inflates the dimension of the data and possibly dilutes predictive power
- In a decision tree, the number of splits increases, which increases the computing burden and diminishes interpretability
Check to Support Conversion of Numeric Variable to Factor Variable
Examine the mean of the target variable split by different values of the integer variable. If the mean does not vary in a monotonic fashion, this supports the conversion.
Define offset
In a predictive model, an offset is a variable that serves to account for the different exposure periods of different observations and therefore yield more accurate predictions.
In a GLM, an offset is a predictor whose regression coefficient is known to be 1 a priori
Introducing Training / Test Sets
Prior to fitting any models, I split the data into a training set (70% of the observations) and a test set (30% of the observations) using stratified sampling. To check that the two sets are representative, I note that the means of the target variable in the two data sets are comparable. The predictive power of the models we will study will be performed on the training data, while the predictive performance of these models will be performed on the test data.
Define minbucket
The minimum number of observations in any terminal node of the tree. The higher the value, the smaller the number of splits and less complex the tree.
Define cp
The minimum improvement (with respect to R-squared) needed in order to make a split. The higher the value, the less complex the tree.
Define maxdepth
The maximum number of branches from the tree’s root note to the furthest terminal node. The higher the value, the more complex the tree.
Describe Cost-Complexity Pruning
Technique that performs cross-validation to evaluate the predictive performance of a tree. The algorithm divides the training data in 10 (default for xval parameter) folds , trains the tree on all but one fold, and then computes the R-squared on the held-out fold. This is performed for all values of the cp paramter greater than the defined parameter to determine which cp value yields the lowest error. The tree is then pruned using this cp value. Splits that do not fulfill the impurity reduction threshold are removed.
One-Standard-Error Rule
Chooses the smallest tree whose cross validation error is within one standard error of the minimum cross-validation error. Generally results in more simple trees.
Define ntree
The number of trees to be grown in a random forest. It is generally a good idea to set ntree to a large number to increase the variance reduction and ensure that each observation and each feature is represented at least once in the random forest. However, too large a value fo ntree may lead to excess run time.
Define mtry
Every time a split is made in a random forest, a random sample of features is taken and considered as split candidates. The number of candidates is defined by the mtry parameter. Making mtry too large can reduce the correlation between the predictions of different trees. Making mtry too small can impose severe restrictions on the tree growing process.
What is a random forest?
Ensemble method that relies on bagging to produce a large number of bootstrapped training samples over which trees are constructed in parallel. The results of the different trees are combined , which reduces the variance of predictions and prevents overfitting.
What is overdispersion?
Overdispersion refers to the situation when the variance of the target variable is greater than the mean.
Benefits of Using Log Link
- Ensures the model predictions are non-negative
- Makes the model easy to interpret.
- Also, it is the canonical link for the Poisson distribution and thus facilitates the convergence of the model fitting algorithm.
Pros of Using Binarization with Stepwise Selection
Allows the selection procedure to drop individual factor levels if they are statistically insignificant with respect to the base level (instead of retaining or dropping the factor variables in their entirety
Cons to Using Binarization with Stepwise Selection
May cause procedure to take significantly more time to complete.
Resulting model may also be hard to interpret.
Components of Stepwise Selection
Selection Criterion (AIC, BIC) and Selection Process (Forward, Backward)
AIC
Performance metric used to rank competing models (similar to BIC). Defined as -2l + 2p, where l is the loglikelihood of the model on the training set and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower AIC, better fit model.
BIC
Performance metric used to rank competing models (similar to AIC). Defined as -2l + ln(n)p, where l is the loglikelihood of the model on the training set, n is the number of observations, and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower BIC, better fit model.
AIC vs BIC
Both metrics demand that for the inclusion of an additional feature to improve the performance of the model, the feature must increase the loglikelihood by at least a certain amount (the penalty amount).
In general, the penalty is typically greater for BIC, so the BIC is more stringent for complex models and is a more conservative approach.
Forward Selection
Opposite of backward selection. Starts with the simplest model (model with only the intercept and no features) and progressively adds the feature that results in the greatest improvement of the model until no features can be added to improve the model.
Backward Selection
Opposite of forward selection. Starts with the most complex model (model with all features) and progressively removes the feature that causes, in its absence, the greatest improvement in the model according to a certain criterion.
Forward vs Backward Selection
Forward selection is more likely to result in a simpler model relative to backward selection given forward selection starts with a model with no features.
How to Interpret Coefficients of Log Link
Exponentiate the coefficients and subtract 1. This results in the algebraic changes in the target variable
What is Regularized Regression?
Alternative approach to reducing the complexity of a GLM and preventing overfitting. It works by adding a penalty term that reflects the size of the coefficients to the function to optimize (loglikelihood). This shrinks the magnitude of the estimated coefficients of features with limited predictive power towards zero. In some cases, the coefficients are forced exactly to zero, thus removing those features from the model.
Regularization vs Stepwise Selection
- Binarization of factor variables is done automatically in regularization, and each factor level is treated as a separate feature to remove
- Cross-validation can be used to optimize the regularization parameter (lambda) such that the RMSE is minimized
- Coefficient estimates are more difficult to interpret in regularization because the variables are standardized
- The glmnet package only allows a restricted set of model forms
Supervised Learning Methods
Target variable guides the analysis.
Main Methods: GLMs, Decision Trees
Unsupervised Learning Methods
Target variable is absent; interested in extracting relationships between variables in the data (lend themselves to high-dimensional datasets)
Main Methods: Principal Components Analysis, Cluster Analysis
Regression vs Classification
Regression has numeric target variable, classification has categorical target variable
Training vs Test Set
Training Set is the data used to develop the predictive model (typically the largest portion of the data)
Test Set is the data used the evaluate the predictive performance when applied to data the model has not seen before
What is Cross-Validation?
Alternative to training/test split when data set is small
Splits data into k equal folds; each fold is used once as the validation set, while the remaining folds are used as the training set
Predictive model is fit k times and predictions are combine
Common Performance Metrics Used in Regression Problems
Root Mean Squared Error (RMSE): aggregate of all prediction erros in the test set
Common Performance Metric Used in Classification Problems
Classification Error Rate: proportion of observations in the test set that are incorrectly classified
Define Bias
Difference between expected value and true value
More complex, lower bias
Define Variance
Amount by which expected value would change using a different training set
More complex, higher variance
What is Irreducible Error?
The variance of the noise. Cannot be reduced no matter how good the predictive model is
Variables vs Features
Variables are raw recorded measurements from the original dataset without any transformations
Features are derivatives of raw variables
Feature Generation
Process of developing new features based on existing variables in the data
Seeks to enhance the flexibility of the model and lower the bias of the predictions at the expense of an increase in variance
Plays more prominent role in GLMs compared to Decision Trees
Feature Selection (Removal)
Opposite of feature generation; process of dropping features with limited predictive power
Important concept in GLMs and Decision Trees
Feature Selection Methods
Forward/Backward Selection
Other Commonly Used Performance Metrics
RMSE (most common)
Loglikelihood
R-Squared (goodness of fit measure)
Chi-square
What is Binning?
Creating a categorical predictor whose levels are defined as non-overlapping intervals of the original variable
What is Binarization?
Feature generation method used for categorical variables, which turns a categorical variable into a collection of binary variables
Baseline level is automatically defaulted by alphabetical order
Good idea to make baseline level the level with most observations
Interactions Between Continuous and Categorical Predictors
Need to multiply continuous variable by each of the binary variables created from the categorical variable
To assess the extent of the interaction graphically, use a scatterplot
Interactions Between Two Categorical Predictors
Need to multiply each of the binary variables created from both categorical variables
To assess the extent of the interaction graphically, use a box plot
Regularization
Alternative to forward/backward selection that shrinks the magnitude of coefficients of features with limited predictive importance towards zero
Goal is to simplify model and avoid overfitting
Ridge Method
Type of regularization in which the penalty is the sum of squares of the slope coefficients
Lasso Method
Type of regularization in which the penalty is the sum of absolute values of the slope coefficients
Elastic Net Regression Method
Type of regularization in which the penalty captures both lasso and ridge
Regularization Parameter (Lambda)
When lambda = 0, the coefficient estimates are equal to the ordinary least squares estimates
As lambda increases, the effect of regularization increases
Ridge vs Lasso
The lasso method forces coefficients to exactly zero, whereas the ridge method reduces them but not ever to zero (retains all features)
Lasso tends to produce simpler models with fewer features
Advantages of Regularization
Computationally more efficient than stepwise selection algorithms
Disadvantages of Regularization
May not produce the most interpretable model (especially for Ridge)
During model fitting, all numeric coefficients are standardized, which makes the interpretation of their estimates less intuitive
Cannot accommodate all distributions for GLMs
GLMs vs Linear Models
GLMs offer considerably more flexibility:
- The target variable can be any member of the exponential family of distributions
- GLMs have the ability to analyze situations in which the effects of the predictors on the target mean are more complex than merely additive in nature
Target Distribution for Continuous, Positive Data
Gamma and Inverse Gaussian most appropriately capture the skewness of the target variable
Ex. claim amounts, income, amount of insurance coverage
Target Distribution for Binary Data
Binomial (occurrence or non-occurrence of event)
Target Distribution for Count Data
Poisson most natural candidate
Drawback: requires mean to equal variance, otherwise overdispersion may exist
Link Function for Positive Target Mean
Log link is most natural candidate, as it ensures predictions are always positive
Also easily interpretable
Link Function for Target Mean Between 0 and 1
Logit link good candidate, as it ensures predictions are always between 0 and 1
Also easily interpretable due to connection to the log link function
What is a Canonical Link Function?
Link function that simplifies the estimation procedure
Should not always be used; need to consider interpretability
Canonical Link Function for Normal Distribution
Identity
Canonical Link Function for Binomial Distribution
Logit
Canonical Link Function for Poisson Distribution
Natural Log
Canonical Link Function for Gamma Distribution
Inverse
Canonical Link Function for Inverse Gaussian Distribution
Squared Inverse
Weights
Used in GLMs to assign a higher emphasis to observations that are averaged across more subjects of similar characteristics
Variance of each observation is inversely related to the group size
Does not affect the mean of the target variable
Offsets
Additional predictor used to account for different means of different observations
Important for offset to be on same scale as the linear predictor
Group size is positive related to the mean of the target variable
Do not affect the variance of the target variable
Method for Estimating Parameter Coefficients in Linear Models
Least Squares Regression
Method for Estimating Parameter Coefficients in GLMs
Maximum Likelihood Estimation
What is Deviance?
Goodness-of-fit measure for GLMs (compared to R-squared in linear models)
Measures the extent to which the GLM departs from the most elaborate (or saturated) model
Lower the deviance, closer the GLM is to a perfect fit
What are Deviance Residuals?
Measure of Deviance used in GLMs
Used over raw residuals because deviance residuals are normally distributed
Q-Q Plot
Displays the standardized deviance residuals against the standard normal quartiles
Confusion Matrix
Tabular display of how predictions of a binary classifier line up with the observed classes
Classification Error Rate
(False Negatives + False Positives) / n
Sensitivity
(True Positives) / (True Negatives + False Negatives)
Higher sensitivity and specificity, better classifier
Specificity
(True Negatives) / (True Negatives + False Positives)
Higher sensitivity and specificity, better classifier
Receiver Operator Characteristic Curve (ROC)
Graphical tool that displays sensitivity vs specificity
Classifier with perfect predictive performance rises quickly to top left corner
AUC
Area under the curve of a ROC plot
Measure of the predictive performance of the model (want AUC to be closest to 1)
Regression Trees
Have a quantitative target variable and use the average of the target variable in that group as the predicted variable
Classification Trees
Have qualitative target variables and use the most common class (mode) of the target variable in that group as the predicted class
Node
Point on the tree that corresponds to a subset of the data
Root Node
Node at the top of the tree representing the full data set
Terminal Node
Also referred to as leaf. The nodes at the bottom of the tree that cannot be split any further
Binary Tree
Each node only has two children
Depth
The number of branches from the root node to the furthest terminal node
Measure of Impurity in Regression Trees
Residual sum of squares
Measure of Impurity in Classifcation Trees
Most common is classification error. Other measures of node impurity are entropy and gini.
Choice of impurity measure does not have a significant impact on the performance of the tree
Pruning
Process to control the complexity of the tree structure
Similar to stepwise selection in GLMs
Start at the bottom of the tree and remove splits that do not meet a specified impurity reduction
Advantages of Decision Trees
- Generally easier to interpret compared to GLMs
- Excel in handling nonlinear relationships and do not require transformations
- Good at automatically recognizing interactions between variables
- Do not require binarization for categorical variables
- Variables are automatically selected with the most important variables appearing at the top of the tree
- Much less susceptible to model mis-specification than GLMs
- Can easily be modified to deal with missing data
Disadvantages of Decision Trees
- More prone to overfitting relative to GLMs and produce unstable predictions with more variance (small change in training data can lead to big changes in the fitted tree)
- Favor categorical features with many levels over those with few levels
- Lack of model diagnosis
Random Forests
Ensemble method that generates multiple bootstrapped samples of the training set and fits base tree models to each bootstrapped sample
Results from all base trees are combined to form an overall prediction
Randomization is performed at each step
Improves model variance and predictive performance
Advantages of Random Forests
- Much more robust than single trees
2. More precise predictions with lower variance
Disadvantages of Random Forests
- Not easily interpretable
2. Takes considerably more computational power
Boosting
Ensemble method that builds a sequence of interdependent trees using information from previously grown trees
Each iteration builds on the residuals of the prior tree
Improves model bias
Advantages of Boosting
- Improves predictive accuracy
Disadvantages of Boosting
- More vulnerable to overfitting
- Significant computational cost
- Not easily interpretable
Principal Components Analysis
Advanced technique that transforms a large number of (possibly correlated) variables to a smaller, more manageable, set of representative variables that capture much of the information in the full data set
Resulting variables are referred to as principal components and are a linear combination of the existing variables
Particularly useful for feature generation
To perform PCA on categorical variables, they must first be binarized
Scree Plot
Provides simple visual inspection method for determining number of principal components to use
Depicts the PVE (proportion of variance explained) of each PC
Elbow of Scree Plot
Point at which the PVE drops off significantly
PCs beyond the elbow have a very small PVE and can be dropped
Cluster Analysis
Partitions heterogeneous observations into a set of distinct homogeneous groups (clusters)
Observations within each cluster have similar characteristics
Two main methods: k-means clustering, hierarchical clustering
K-Means Clustering
Assigns each observation into one of k clusters, with k being defined beforehand
Algorithm automatically searches for the best configuration of the k clusters
Clusters are chosen such that the variance within each cluster is small while the variance among different clusters is large
Advisable to run the algorithm many times with different initial cluster assignments and then choose the algorithm with the lowest within-cluster variance
Features must be standardized before clustering
Elbow Method
Plot of the ratio of between-cluster variation to total variation against k
Used to determine k - see when proportion of variance explained plateaus
Hierarchical Clustering
Clustering method that does not require the choice of k at the start
Uses a dendrogram, which is a tree-based visualization of the hierarchy of clusters
Consists of a series of fusions (mergers) of clusters
Complete Linkage
Measure of intercluster dissimilarity based on the maximal pairwise distance between observations in two clusters
Single Linkage
Measure of intercluster dissimilarity based on the minimal pairwise distance between observations in two clusters
Average Linkage
Measure of intercluster dissimilarity based on the average pairwise distance between observations in two clusters
Dendrogram
Tree-based visualization of the hierarchy of clusters
Clusters towards bottom of the dendrogram are similar to one another, while clusters towards the top are far apart
K-Means vs Hierarchical Clustering
- K-Means clustering requires standardization
- Number of clusters are pre-defined in k-means clustering
- Hierarchical clustering uses nested clusters
nstart parameter
Used in cluster analysis, the nstart parameter controls the number of random selections of initial cluster centers that are used by the kmeans algorithm. Improves the chances of finding a better local optimum. Recommended 20-50
Define interaction
An interaction exists if the effect of one variable on the target variable changes with the value or level of another variable.
eta
Learning rate parameter in boosting methods. Scalar multiple between 0 and 1.
Predictions of the current tree are scaled by the eta parameter and added to the overall model. Lower eta’s “slow” the learning and generally result in more accurate models.
nrounds
Maximum number of boosting rounds. Often a large number 100-200 to ensure a sufficient number of trees are grown.