Predictive Analytics Flashcards

1
Q

Advantages of Converting Numeric Variables to Factor Variables

A
  1. Provides added flexibility to model
  2. When treated as numeric variables, there is an implicit assumption that there is a monotonic relationship between these variables and the target variable in GLMS
  3. In decision trees, splits have to respect the order of the variable’s values
  4. Potential improvement in predictive accuracy due to the flexibility of capturing the effects of the variable across different values of their range
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Disadvantages of Converting Numeric Variables to Factor Variables

A
  1. When converted, these variables have to be represented by a greater number of dummy variables in a GLM, which inflates the dimension of the data and possibly dilutes predictive power
  2. In a decision tree, the number of splits increases, which increases the computing burden and diminishes interpretability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Check to Support Conversion of Numeric Variable to Factor Variable

A

Examine the mean of the target variable split by different values of the integer variable. If the mean does not vary in a monotonic fashion, this supports the conversion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define offset

A

In a predictive model, an offset is a variable that serves to account for the different exposure periods of different observations and therefore yield more accurate predictions.

In a GLM, an offset is a predictor whose regression coefficient is known to be 1 a priori

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Introducing Training / Test Sets

A

Prior to fitting any models, I split the data into a training set (70% of the observations) and a test set (30% of the observations) using stratified sampling. To check that the two sets are representative, I note that the means of the target variable in the two data sets are comparable. The predictive power of the models we will study will be performed on the training data, while the predictive performance of these models will be performed on the test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define minbucket

A

The minimum number of observations in any terminal node of the tree. The higher the value, the smaller the number of splits and less complex the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define cp

A

The minimum improvement (with respect to R-squared) needed in order to make a split. The higher the value, the less complex the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define maxdepth

A

The maximum number of branches from the tree’s root note to the furthest terminal node. The higher the value, the more complex the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe Cost-Complexity Pruning

A

Technique that performs cross-validation to evaluate the predictive performance of a tree. The algorithm divides the training data in 10 (default for xval parameter) folds , trains the tree on all but one fold, and then computes the R-squared on the held-out fold. This is performed for all values of the cp paramter greater than the defined parameter to determine which cp value yields the lowest error. The tree is then pruned using this cp value. Splits that do not fulfill the impurity reduction threshold are removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

One-Standard-Error Rule

A

Chooses the smallest tree whose cross validation error is within one standard error of the minimum cross-validation error. Generally results in more simple trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define ntree

A

The number of trees to be grown in a random forest. It is generally a good idea to set ntree to a large number to increase the variance reduction and ensure that each observation and each feature is represented at least once in the random forest. However, too large a value fo ntree may lead to excess run time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define mtry

A

Every time a split is made in a random forest, a random sample of features is taken and considered as split candidates. The number of candidates is defined by the mtry parameter. Making mtry too large can reduce the correlation between the predictions of different trees. Making mtry too small can impose severe restrictions on the tree growing process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a random forest?

A

Ensemble method that relies on bagging to produce a large number of bootstrapped training samples over which trees are constructed in parallel. The results of the different trees are combined , which reduces the variance of predictions and prevents overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is overdispersion?

A

Overdispersion refers to the situation when the variance of the target variable is greater than the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Benefits of Using Log Link

A
  1. Ensures the model predictions are non-negative
  2. Makes the model easy to interpret.
  3. Also, it is the canonical link for the Poisson distribution and thus facilitates the convergence of the model fitting algorithm.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Pros of Using Binarization with Stepwise Selection

A

Allows the selection procedure to drop individual factor levels if they are statistically insignificant with respect to the base level (instead of retaining or dropping the factor variables in their entirety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Cons to Using Binarization with Stepwise Selection

A

May cause procedure to take significantly more time to complete.

Resulting model may also be hard to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Components of Stepwise Selection

A

Selection Criterion (AIC, BIC) and Selection Process (Forward, Backward)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

AIC

A

Performance metric used to rank competing models (similar to BIC). Defined as -2l + 2p, where l is the loglikelihood of the model on the training set and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower AIC, better fit model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

BIC

A

Performance metric used to rank competing models (similar to AIC). Defined as -2l + ln(n)p, where l is the loglikelihood of the model on the training set, n is the number of observations, and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower BIC, better fit model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

AIC vs BIC

A

Both metrics demand that for the inclusion of an additional feature to improve the performance of the model, the feature must increase the loglikelihood by at least a certain amount (the penalty amount).

In general, the penalty is typically greater for BIC, so the BIC is more stringent for complex models and is a more conservative approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Forward Selection

A

Opposite of backward selection. Starts with the simplest model (model with only the intercept and no features) and progressively adds the feature that results in the greatest improvement of the model until no features can be added to improve the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Backward Selection

A

Opposite of forward selection. Starts with the most complex model (model with all features) and progressively removes the feature that causes, in its absence, the greatest improvement in the model according to a certain criterion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Forward vs Backward Selection

A

Forward selection is more likely to result in a simpler model relative to backward selection given forward selection starts with a model with no features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
How to Interpret Coefficients of Log Link
Exponentiate the coefficients and subtract 1. This results in the algebraic changes in the target variable
26
What is Regularized Regression?
Alternative approach to reducing the complexity of a GLM and preventing overfitting. It works by adding a penalty term that reflects the size of the coefficients to the function to optimize (loglikelihood). This shrinks the magnitude of the estimated coefficients of features with limited predictive power towards zero. In some cases, the coefficients are forced exactly to zero, thus removing those features from the model.
27
Regularization vs Stepwise Selection
1. Binarization of factor variables is done automatically in regularization, and each factor level is treated as a separate feature to remove 2. Cross-validation can be used to optimize the regularization parameter (lambda) such that the RMSE is minimized 3. Coefficient estimates are more difficult to interpret in regularization because the variables are standardized 4. The glmnet package only allows a restricted set of model forms
28
Supervised Learning Methods
Target variable guides the analysis. Main Methods: GLMs, Decision Trees
29
Unsupervised Learning Methods
Target variable is absent; interested in extracting relationships between variables in the data (lend themselves to high-dimensional datasets) Main Methods: Principal Components Analysis, Cluster Analysis
30
Regression vs Classification
Regression has numeric target variable, classification has categorical target variable
31
Training vs Test Set
Training Set is the data used to develop the predictive model (typically the largest portion of the data) Test Set is the data used the evaluate the predictive performance when applied to data the model has not seen before
32
What is Cross-Validation?
Alternative to training/test split when data set is small Splits data into k equal folds; each fold is used once as the validation set, while the remaining folds are used as the training set Predictive model is fit k times and predictions are combine
33
Common Performance Metrics Used in Regression Problems
Root Mean Squared Error (RMSE): aggregate of all prediction erros in the test set
34
Common Performance Metric Used in Classification Problems
Classification Error Rate: proportion of observations in the test set that are incorrectly classified
35
Define Bias
Difference between expected value and true value More complex, lower bias
36
Define Variance
Amount by which expected value would change using a different training set More complex, higher variance
37
What is Irreducible Error?
The variance of the noise. Cannot be reduced no matter how good the predictive model is
38
Variables vs Features
Variables are raw recorded measurements from the original dataset without any transformations Features are derivatives of raw variables
39
Feature Generation
Process of developing new features based on existing variables in the data Seeks to enhance the flexibility of the model and lower the bias of the predictions at the expense of an increase in variance Plays more prominent role in GLMs compared to Decision Trees
40
Feature Selection (Removal)
Opposite of feature generation; process of dropping features with limited predictive power Important concept in GLMs and Decision Trees
41
Feature Selection Methods
Forward/Backward Selection
42
Other Commonly Used Performance Metrics
RMSE (most common) Loglikelihood R-Squared (goodness of fit measure) Chi-square
43
What is Binning?
Creating a categorical predictor whose levels are defined as non-overlapping intervals of the original variable
44
What is Binarization?
Feature generation method used for categorical variables, which turns a categorical variable into a collection of binary variables Baseline level is automatically defaulted by alphabetical order Good idea to make baseline level the level with most observations
45
Interactions Between Continuous and Categorical Predictors
Need to multiply continuous variable by each of the binary variables created from the categorical variable To assess the extent of the interaction graphically, use a scatterplot
46
Interactions Between Two Categorical Predictors
Need to multiply each of the binary variables created from both categorical variables To assess the extent of the interaction graphically, use a box plot
47
Regularization
Alternative to forward/backward selection that shrinks the magnitude of coefficients of features with limited predictive importance towards zero Goal is to simplify model and avoid overfitting
48
Ridge Method
Type of regularization in which the penalty is the sum of squares of the slope coefficients
49
Lasso Method
Type of regularization in which the penalty is the sum of absolute values of the slope coefficients
50
Elastic Net Regression Method
Type of regularization in which the penalty captures both lasso and ridge
51
Regularization Parameter (Lambda)
When lambda = 0, the coefficient estimates are equal to the ordinary least squares estimates As lambda increases, the effect of regularization increases
52
Ridge vs Lasso
The lasso method forces coefficients to exactly zero, whereas the ridge method reduces them but not ever to zero (retains all features) Lasso tends to produce simpler models with fewer features
53
Advantages of Regularization
Computationally more efficient than stepwise selection algorithms
54
Disadvantages of Regularization
May not produce the most interpretable model (especially for Ridge) During model fitting, all numeric coefficients are standardized, which makes the interpretation of their estimates less intuitive Cannot accommodate all distributions for GLMs
55
GLMs vs Linear Models
GLMs offer considerably more flexibility: 1. The target variable can be any member of the exponential family of distributions 2. GLMs have the ability to analyze situations in which the effects of the predictors on the target mean are more complex than merely additive in nature
56
Target Distribution for Continuous, Positive Data
Gamma and Inverse Gaussian most appropriately capture the skewness of the target variable Ex. claim amounts, income, amount of insurance coverage
57
Target Distribution for Binary Data
Binomial (occurrence or non-occurrence of event)
58
Target Distribution for Count Data
Poisson most natural candidate Drawback: requires mean to equal variance, otherwise overdispersion may exist
59
Link Function for Positive Target Mean
Log link is most natural candidate, as it ensures predictions are always positive Also easily interpretable
60
Link Function for Target Mean Between 0 and 1
Logit link good candidate, as it ensures predictions are always between 0 and 1 Also easily interpretable due to connection to the log link function
61
What is a Canonical Link Function?
Link function that simplifies the estimation procedure Should not always be used; need to consider interpretability
62
Canonical Link Function for Normal Distribution
Identity
63
Canonical Link Function for Binomial Distribution
Logit
64
Canonical Link Function for Poisson Distribution
Natural Log
65
Canonical Link Function for Gamma Distribution
Inverse
66
Canonical Link Function for Inverse Gaussian Distribution
Squared Inverse
67
Weights
Used in GLMs to assign a higher emphasis to observations that are averaged across more subjects of similar characteristics Variance of each observation is inversely related to the group size Does not affect the mean of the target variable
68
Offsets
Additional predictor used to account for different means of different observations Important for offset to be on same scale as the linear predictor Group size is positive related to the mean of the target variable Do not affect the variance of the target variable
69
Method for Estimating Parameter Coefficients in Linear Models
Least Squares Regression
70
Method for Estimating Parameter Coefficients in GLMs
Maximum Likelihood Estimation
71
What is Deviance?
Goodness-of-fit measure for GLMs (compared to R-squared in linear models) Measures the extent to which the GLM departs from the most elaborate (or saturated) model Lower the deviance, closer the GLM is to a perfect fit
72
What are Deviance Residuals?
Measure of Deviance used in GLMs Used over raw residuals because deviance residuals are normally distributed
73
Q-Q Plot
Displays the standardized deviance residuals against the standard normal quartiles
74
Confusion Matrix
Tabular display of how predictions of a binary classifier line up with the observed classes
75
Classification Error Rate
(False Negatives + False Positives) / n
76
Sensitivity
(True Positives) / (True Negatives + False Negatives) Higher sensitivity and specificity, better classifier
77
Specificity
(True Negatives) / (True Negatives + False Positives) Higher sensitivity and specificity, better classifier
78
Receiver Operator Characteristic Curve (ROC)
Graphical tool that displays sensitivity vs specificity Classifier with perfect predictive performance rises quickly to top left corner
79
AUC
Area under the curve of a ROC plot Measure of the predictive performance of the model (want AUC to be closest to 1)
80
Regression Trees
Have a quantitative target variable and use the average of the target variable in that group as the predicted variable
81
Classification Trees
Have qualitative target variables and use the most common class (mode) of the target variable in that group as the predicted class
82
Node
Point on the tree that corresponds to a subset of the data
83
Root Node
Node at the top of the tree representing the full data set
84
Terminal Node
Also referred to as leaf. The nodes at the bottom of the tree that cannot be split any further
85
Binary Tree
Each node only has two children
86
Depth
The number of branches from the root node to the furthest terminal node
87
Measure of Impurity in Regression Trees
Residual sum of squares
88
Measure of Impurity in Classifcation Trees
Most common is classification error. Other measures of node impurity are entropy and gini. Choice of impurity measure does not have a significant impact on the performance of the tree
89
Pruning
Process to control the complexity of the tree structure Similar to stepwise selection in GLMs Start at the bottom of the tree and remove splits that do not meet a specified impurity reduction
90
Advantages of Decision Trees
1. Generally easier to interpret compared to GLMs 2. Excel in handling nonlinear relationships and do not require transformations 3. Good at automatically recognizing interactions between variables 4. Do not require binarization for categorical variables 5. Variables are automatically selected with the most important variables appearing at the top of the tree 6. Much less susceptible to model mis-specification than GLMs 7. Can easily be modified to deal with missing data
91
Disadvantages of Decision Trees
1. More prone to overfitting relative to GLMs and produce unstable predictions with more variance (small change in training data can lead to big changes in the fitted tree) 2. Favor categorical features with many levels over those with few levels 3. Lack of model diagnosis
92
Random Forests
Ensemble method that generates multiple bootstrapped samples of the training set and fits base tree models to each bootstrapped sample Results from all base trees are combined to form an overall prediction Randomization is performed at each step Improves model variance and predictive performance
93
Advantages of Random Forests
1. Much more robust than single trees | 2. More precise predictions with lower variance
94
Disadvantages of Random Forests
1. Not easily interpretable | 2. Takes considerably more computational power
95
Boosting
Ensemble method that builds a sequence of interdependent trees using information from previously grown trees Each iteration builds on the residuals of the prior tree Improves model bias
96
Advantages of Boosting
1. Improves predictive accuracy
97
Disadvantages of Boosting
1. More vulnerable to overfitting 2. Significant computational cost 3. Not easily interpretable
98
Principal Components Analysis
Advanced technique that transforms a large number of (possibly correlated) variables to a smaller, more manageable, set of representative variables that capture much of the information in the full data set Resulting variables are referred to as principal components and are a linear combination of the existing variables Particularly useful for feature generation To perform PCA on categorical variables, they must first be binarized
99
Scree Plot
Provides simple visual inspection method for determining number of principal components to use Depicts the PVE (proportion of variance explained) of each PC
100
Elbow of Scree Plot
Point at which the PVE drops off significantly PCs beyond the elbow have a very small PVE and can be dropped
101
Cluster Analysis
Partitions heterogeneous observations into a set of distinct homogeneous groups (clusters) Observations within each cluster have similar characteristics Two main methods: k-means clustering, hierarchical clustering
102
K-Means Clustering
Assigns each observation into one of k clusters, with k being defined beforehand Algorithm automatically searches for the best configuration of the k clusters Clusters are chosen such that the variance within each cluster is small while the variance among different clusters is large Advisable to run the algorithm many times with different initial cluster assignments and then choose the algorithm with the lowest within-cluster variance Features must be standardized before clustering
103
Elbow Method
Plot of the ratio of between-cluster variation to total variation against k Used to determine k - see when proportion of variance explained plateaus
104
Hierarchical Clustering
Clustering method that does not require the choice of k at the start Uses a dendrogram, which is a tree-based visualization of the hierarchy of clusters Consists of a series of fusions (mergers) of clusters
105
Complete Linkage
Measure of intercluster dissimilarity based on the maximal pairwise distance between observations in two clusters
106
Single Linkage
Measure of intercluster dissimilarity based on the minimal pairwise distance between observations in two clusters
107
Average Linkage
Measure of intercluster dissimilarity based on the average pairwise distance between observations in two clusters
108
Dendrogram
Tree-based visualization of the hierarchy of clusters Clusters towards bottom of the dendrogram are similar to one another, while clusters towards the top are far apart
109
K-Means vs Hierarchical Clustering
1. K-Means clustering requires standardization 2. Number of clusters are pre-defined in k-means clustering 3. Hierarchical clustering uses nested clusters
110
nstart parameter
Used in cluster analysis, the nstart parameter controls the number of random selections of initial cluster centers that are used by the kmeans algorithm. Improves the chances of finding a better local optimum. Recommended 20-50
111
Define interaction
An interaction exists if the effect of one variable on the target variable changes with the value or level of another variable.
112
eta
Learning rate parameter in boosting methods. Scalar multiple between 0 and 1. Predictions of the current tree are scaled by the eta parameter and added to the overall model. Lower eta's "slow" the learning and generally result in more accurate models.
113
nrounds
Maximum number of boosting rounds. Often a large number 100-200 to ensure a sufficient number of trees are grown.