Predictive Analytics Flashcards

1
Q

Advantages of Converting Numeric Variables to Factor Variables

A
  1. Provides added flexibility to model
  2. When treated as numeric variables, there is an implicit assumption that there is a monotonic relationship between these variables and the target variable in GLMS
  3. In decision trees, splits have to respect the order of the variable’s values
  4. Potential improvement in predictive accuracy due to the flexibility of capturing the effects of the variable across different values of their range
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Disadvantages of Converting Numeric Variables to Factor Variables

A
  1. When converted, these variables have to be represented by a greater number of dummy variables in a GLM, which inflates the dimension of the data and possibly dilutes predictive power
  2. In a decision tree, the number of splits increases, which increases the computing burden and diminishes interpretability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Check to Support Conversion of Numeric Variable to Factor Variable

A

Examine the mean of the target variable split by different values of the integer variable. If the mean does not vary in a monotonic fashion, this supports the conversion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Define offset

A

In a predictive model, an offset is a variable that serves to account for the different exposure periods of different observations and therefore yield more accurate predictions.

In a GLM, an offset is a predictor whose regression coefficient is known to be 1 a priori

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Introducing Training / Test Sets

A

Prior to fitting any models, I split the data into a training set (70% of the observations) and a test set (30% of the observations) using stratified sampling. To check that the two sets are representative, I note that the means of the target variable in the two data sets are comparable. The predictive power of the models we will study will be performed on the training data, while the predictive performance of these models will be performed on the test data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Define minbucket

A

The minimum number of observations in any terminal node of the tree. The higher the value, the smaller the number of splits and less complex the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Define cp

A

The minimum improvement (with respect to R-squared) needed in order to make a split. The higher the value, the less complex the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Define maxdepth

A

The maximum number of branches from the tree’s root note to the furthest terminal node. The higher the value, the more complex the tree.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe Cost-Complexity Pruning

A

Technique that performs cross-validation to evaluate the predictive performance of a tree. The algorithm divides the training data in 10 (default for xval parameter) folds , trains the tree on all but one fold, and then computes the R-squared on the held-out fold. This is performed for all values of the cp paramter greater than the defined parameter to determine which cp value yields the lowest error. The tree is then pruned using this cp value. Splits that do not fulfill the impurity reduction threshold are removed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

One-Standard-Error Rule

A

Chooses the smallest tree whose cross validation error is within one standard error of the minimum cross-validation error. Generally results in more simple trees.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Define ntree

A

The number of trees to be grown in a random forest. It is generally a good idea to set ntree to a large number to increase the variance reduction and ensure that each observation and each feature is represented at least once in the random forest. However, too large a value fo ntree may lead to excess run time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Define mtry

A

Every time a split is made in a random forest, a random sample of features is taken and considered as split candidates. The number of candidates is defined by the mtry parameter. Making mtry too large can reduce the correlation between the predictions of different trees. Making mtry too small can impose severe restrictions on the tree growing process.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a random forest?

A

Ensemble method that relies on bagging to produce a large number of bootstrapped training samples over which trees are constructed in parallel. The results of the different trees are combined , which reduces the variance of predictions and prevents overfitting.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is overdispersion?

A

Overdispersion refers to the situation when the variance of the target variable is greater than the mean.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Benefits of Using Log Link

A
  1. Ensures the model predictions are non-negative
  2. Makes the model easy to interpret.
  3. Also, it is the canonical link for the Poisson distribution and thus facilitates the convergence of the model fitting algorithm.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Pros of Using Binarization with Stepwise Selection

A

Allows the selection procedure to drop individual factor levels if they are statistically insignificant with respect to the base level (instead of retaining or dropping the factor variables in their entirety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Cons to Using Binarization with Stepwise Selection

A

May cause procedure to take significantly more time to complete.

Resulting model may also be hard to interpret.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Components of Stepwise Selection

A

Selection Criterion (AIC, BIC) and Selection Process (Forward, Backward)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

AIC

A

Performance metric used to rank competing models (similar to BIC). Defined as -2l + 2p, where l is the loglikelihood of the model on the training set and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower AIC, better fit model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

BIC

A

Performance metric used to rank competing models (similar to AIC). Defined as -2l + ln(n)p, where l is the loglikelihood of the model on the training set, n is the number of observations, and p represents the number of parameters. Balances the goodness of fit of a model to the training data. Lower BIC, better fit model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

AIC vs BIC

A

Both metrics demand that for the inclusion of an additional feature to improve the performance of the model, the feature must increase the loglikelihood by at least a certain amount (the penalty amount).

In general, the penalty is typically greater for BIC, so the BIC is more stringent for complex models and is a more conservative approach.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Forward Selection

A

Opposite of backward selection. Starts with the simplest model (model with only the intercept and no features) and progressively adds the feature that results in the greatest improvement of the model until no features can be added to improve the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Backward Selection

A

Opposite of forward selection. Starts with the most complex model (model with all features) and progressively removes the feature that causes, in its absence, the greatest improvement in the model according to a certain criterion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Forward vs Backward Selection

A

Forward selection is more likely to result in a simpler model relative to backward selection given forward selection starts with a model with no features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

How to Interpret Coefficients of Log Link

A

Exponentiate the coefficients and subtract 1. This results in the algebraic changes in the target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is Regularized Regression?

A

Alternative approach to reducing the complexity of a GLM and preventing overfitting. It works by adding a penalty term that reflects the size of the coefficients to the function to optimize (loglikelihood). This shrinks the magnitude of the estimated coefficients of features with limited predictive power towards zero. In some cases, the coefficients are forced exactly to zero, thus removing those features from the model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Regularization vs Stepwise Selection

A
  1. Binarization of factor variables is done automatically in regularization, and each factor level is treated as a separate feature to remove
  2. Cross-validation can be used to optimize the regularization parameter (lambda) such that the RMSE is minimized
  3. Coefficient estimates are more difficult to interpret in regularization because the variables are standardized
  4. The glmnet package only allows a restricted set of model forms
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Supervised Learning Methods

A

Target variable guides the analysis.

Main Methods: GLMs, Decision Trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Unsupervised Learning Methods

A

Target variable is absent; interested in extracting relationships between variables in the data (lend themselves to high-dimensional datasets)

Main Methods: Principal Components Analysis, Cluster Analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

Regression vs Classification

A

Regression has numeric target variable, classification has categorical target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Training vs Test Set

A

Training Set is the data used to develop the predictive model (typically the largest portion of the data)

Test Set is the data used the evaluate the predictive performance when applied to data the model has not seen before

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What is Cross-Validation?

A

Alternative to training/test split when data set is small

Splits data into k equal folds; each fold is used once as the validation set, while the remaining folds are used as the training set

Predictive model is fit k times and predictions are combine

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

Common Performance Metrics Used in Regression Problems

A

Root Mean Squared Error (RMSE): aggregate of all prediction erros in the test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Common Performance Metric Used in Classification Problems

A

Classification Error Rate: proportion of observations in the test set that are incorrectly classified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Define Bias

A

Difference between expected value and true value

More complex, lower bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Define Variance

A

Amount by which expected value would change using a different training set

More complex, higher variance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

What is Irreducible Error?

A

The variance of the noise. Cannot be reduced no matter how good the predictive model is

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

Variables vs Features

A

Variables are raw recorded measurements from the original dataset without any transformations

Features are derivatives of raw variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

Feature Generation

A

Process of developing new features based on existing variables in the data

Seeks to enhance the flexibility of the model and lower the bias of the predictions at the expense of an increase in variance

Plays more prominent role in GLMs compared to Decision Trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Feature Selection (Removal)

A

Opposite of feature generation; process of dropping features with limited predictive power
Important concept in GLMs and Decision Trees

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Feature Selection Methods

A

Forward/Backward Selection

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

Other Commonly Used Performance Metrics

A

RMSE (most common)
Loglikelihood
R-Squared (goodness of fit measure)
Chi-square

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is Binning?

A

Creating a categorical predictor whose levels are defined as non-overlapping intervals of the original variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is Binarization?

A

Feature generation method used for categorical variables, which turns a categorical variable into a collection of binary variables

Baseline level is automatically defaulted by alphabetical order

Good idea to make baseline level the level with most observations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Interactions Between Continuous and Categorical Predictors

A

Need to multiply continuous variable by each of the binary variables created from the categorical variable

To assess the extent of the interaction graphically, use a scatterplot

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Interactions Between Two Categorical Predictors

A

Need to multiply each of the binary variables created from both categorical variables

To assess the extent of the interaction graphically, use a box plot

47
Q

Regularization

A

Alternative to forward/backward selection that shrinks the magnitude of coefficients of features with limited predictive importance towards zero

Goal is to simplify model and avoid overfitting

48
Q

Ridge Method

A

Type of regularization in which the penalty is the sum of squares of the slope coefficients

49
Q

Lasso Method

A

Type of regularization in which the penalty is the sum of absolute values of the slope coefficients

50
Q

Elastic Net Regression Method

A

Type of regularization in which the penalty captures both lasso and ridge

51
Q

Regularization Parameter (Lambda)

A

When lambda = 0, the coefficient estimates are equal to the ordinary least squares estimates

As lambda increases, the effect of regularization increases

52
Q

Ridge vs Lasso

A

The lasso method forces coefficients to exactly zero, whereas the ridge method reduces them but not ever to zero (retains all features)

Lasso tends to produce simpler models with fewer features

53
Q

Advantages of Regularization

A

Computationally more efficient than stepwise selection algorithms

54
Q

Disadvantages of Regularization

A

May not produce the most interpretable model (especially for Ridge)

During model fitting, all numeric coefficients are standardized, which makes the interpretation of their estimates less intuitive

Cannot accommodate all distributions for GLMs

55
Q

GLMs vs Linear Models

A

GLMs offer considerably more flexibility:

  1. The target variable can be any member of the exponential family of distributions
  2. GLMs have the ability to analyze situations in which the effects of the predictors on the target mean are more complex than merely additive in nature
56
Q

Target Distribution for Continuous, Positive Data

A

Gamma and Inverse Gaussian most appropriately capture the skewness of the target variable

Ex. claim amounts, income, amount of insurance coverage

57
Q

Target Distribution for Binary Data

A

Binomial (occurrence or non-occurrence of event)

58
Q

Target Distribution for Count Data

A

Poisson most natural candidate

Drawback: requires mean to equal variance, otherwise overdispersion may exist

59
Q

Link Function for Positive Target Mean

A

Log link is most natural candidate, as it ensures predictions are always positive

Also easily interpretable

60
Q

Link Function for Target Mean Between 0 and 1

A

Logit link good candidate, as it ensures predictions are always between 0 and 1

Also easily interpretable due to connection to the log link function

61
Q

What is a Canonical Link Function?

A

Link function that simplifies the estimation procedure

Should not always be used; need to consider interpretability

62
Q

Canonical Link Function for Normal Distribution

A

Identity

63
Q

Canonical Link Function for Binomial Distribution

A

Logit

64
Q

Canonical Link Function for Poisson Distribution

A

Natural Log

65
Q

Canonical Link Function for Gamma Distribution

A

Inverse

66
Q

Canonical Link Function for Inverse Gaussian Distribution

A

Squared Inverse

67
Q

Weights

A

Used in GLMs to assign a higher emphasis to observations that are averaged across more subjects of similar characteristics

Variance of each observation is inversely related to the group size

Does not affect the mean of the target variable

68
Q

Offsets

A

Additional predictor used to account for different means of different observations

Important for offset to be on same scale as the linear predictor

Group size is positive related to the mean of the target variable

Do not affect the variance of the target variable

69
Q

Method for Estimating Parameter Coefficients in Linear Models

A

Least Squares Regression

70
Q

Method for Estimating Parameter Coefficients in GLMs

A

Maximum Likelihood Estimation

71
Q

What is Deviance?

A

Goodness-of-fit measure for GLMs (compared to R-squared in linear models)

Measures the extent to which the GLM departs from the most elaborate (or saturated) model

Lower the deviance, closer the GLM is to a perfect fit

72
Q

What are Deviance Residuals?

A

Measure of Deviance used in GLMs

Used over raw residuals because deviance residuals are normally distributed

73
Q

Q-Q Plot

A

Displays the standardized deviance residuals against the standard normal quartiles

74
Q

Confusion Matrix

A

Tabular display of how predictions of a binary classifier line up with the observed classes

75
Q

Classification Error Rate

A

(False Negatives + False Positives) / n

76
Q

Sensitivity

A

(True Positives) / (True Negatives + False Negatives)

Higher sensitivity and specificity, better classifier

77
Q

Specificity

A

(True Negatives) / (True Negatives + False Positives)

Higher sensitivity and specificity, better classifier

78
Q

Receiver Operator Characteristic Curve (ROC)

A

Graphical tool that displays sensitivity vs specificity

Classifier with perfect predictive performance rises quickly to top left corner

79
Q

AUC

A

Area under the curve of a ROC plot

Measure of the predictive performance of the model (want AUC to be closest to 1)

80
Q

Regression Trees

A

Have a quantitative target variable and use the average of the target variable in that group as the predicted variable

81
Q

Classification Trees

A

Have qualitative target variables and use the most common class (mode) of the target variable in that group as the predicted class

82
Q

Node

A

Point on the tree that corresponds to a subset of the data

83
Q

Root Node

A

Node at the top of the tree representing the full data set

84
Q

Terminal Node

A

Also referred to as leaf. The nodes at the bottom of the tree that cannot be split any further

85
Q

Binary Tree

A

Each node only has two children

86
Q

Depth

A

The number of branches from the root node to the furthest terminal node

87
Q

Measure of Impurity in Regression Trees

A

Residual sum of squares

88
Q

Measure of Impurity in Classifcation Trees

A

Most common is classification error. Other measures of node impurity are entropy and gini.

Choice of impurity measure does not have a significant impact on the performance of the tree

89
Q

Pruning

A

Process to control the complexity of the tree structure
Similar to stepwise selection in GLMs

Start at the bottom of the tree and remove splits that do not meet a specified impurity reduction

90
Q

Advantages of Decision Trees

A
  1. Generally easier to interpret compared to GLMs
  2. Excel in handling nonlinear relationships and do not require transformations
  3. Good at automatically recognizing interactions between variables
  4. Do not require binarization for categorical variables
  5. Variables are automatically selected with the most important variables appearing at the top of the tree
  6. Much less susceptible to model mis-specification than GLMs
  7. Can easily be modified to deal with missing data
91
Q

Disadvantages of Decision Trees

A
  1. More prone to overfitting relative to GLMs and produce unstable predictions with more variance (small change in training data can lead to big changes in the fitted tree)
  2. Favor categorical features with many levels over those with few levels
  3. Lack of model diagnosis
92
Q

Random Forests

A

Ensemble method that generates multiple bootstrapped samples of the training set and fits base tree models to each bootstrapped sample

Results from all base trees are combined to form an overall prediction

Randomization is performed at each step

Improves model variance and predictive performance

93
Q

Advantages of Random Forests

A
  1. Much more robust than single trees

2. More precise predictions with lower variance

94
Q

Disadvantages of Random Forests

A
  1. Not easily interpretable

2. Takes considerably more computational power

95
Q

Boosting

A

Ensemble method that builds a sequence of interdependent trees using information from previously grown trees

Each iteration builds on the residuals of the prior tree

Improves model bias

96
Q

Advantages of Boosting

A
  1. Improves predictive accuracy
97
Q

Disadvantages of Boosting

A
  1. More vulnerable to overfitting
  2. Significant computational cost
  3. Not easily interpretable
98
Q

Principal Components Analysis

A

Advanced technique that transforms a large number of (possibly correlated) variables to a smaller, more manageable, set of representative variables that capture much of the information in the full data set

Resulting variables are referred to as principal components and are a linear combination of the existing variables

Particularly useful for feature generation

To perform PCA on categorical variables, they must first be binarized

99
Q

Scree Plot

A

Provides simple visual inspection method for determining number of principal components to use

Depicts the PVE (proportion of variance explained) of each PC

100
Q

Elbow of Scree Plot

A

Point at which the PVE drops off significantly

PCs beyond the elbow have a very small PVE and can be dropped

101
Q

Cluster Analysis

A

Partitions heterogeneous observations into a set of distinct homogeneous groups (clusters)

Observations within each cluster have similar characteristics

Two main methods: k-means clustering, hierarchical clustering

102
Q

K-Means Clustering

A

Assigns each observation into one of k clusters, with k being defined beforehand

Algorithm automatically searches for the best configuration of the k clusters

Clusters are chosen such that the variance within each cluster is small while the variance among different clusters is large

Advisable to run the algorithm many times with different initial cluster assignments and then choose the algorithm with the lowest within-cluster variance

Features must be standardized before clustering

103
Q

Elbow Method

A

Plot of the ratio of between-cluster variation to total variation against k

Used to determine k - see when proportion of variance explained plateaus

104
Q

Hierarchical Clustering

A

Clustering method that does not require the choice of k at the start

Uses a dendrogram, which is a tree-based visualization of the hierarchy of clusters

Consists of a series of fusions (mergers) of clusters

105
Q

Complete Linkage

A

Measure of intercluster dissimilarity based on the maximal pairwise distance between observations in two clusters

106
Q

Single Linkage

A

Measure of intercluster dissimilarity based on the minimal pairwise distance between observations in two clusters

107
Q

Average Linkage

A

Measure of intercluster dissimilarity based on the average pairwise distance between observations in two clusters

108
Q

Dendrogram

A

Tree-based visualization of the hierarchy of clusters

Clusters towards bottom of the dendrogram are similar to one another, while clusters towards the top are far apart

109
Q

K-Means vs Hierarchical Clustering

A
  1. K-Means clustering requires standardization
  2. Number of clusters are pre-defined in k-means clustering
  3. Hierarchical clustering uses nested clusters
110
Q

nstart parameter

A

Used in cluster analysis, the nstart parameter controls the number of random selections of initial cluster centers that are used by the kmeans algorithm. Improves the chances of finding a better local optimum. Recommended 20-50

111
Q

Define interaction

A

An interaction exists if the effect of one variable on the target variable changes with the value or level of another variable.

112
Q

eta

A

Learning rate parameter in boosting methods. Scalar multiple between 0 and 1.
Predictions of the current tree are scaled by the eta parameter and added to the overall model. Lower eta’s “slow” the learning and generally result in more accurate models.

113
Q

nrounds

A

Maximum number of boosting rounds. Often a large number 100-200 to ensure a sufficient number of trees are grown.