Predictive Analytics Flashcards

Question

How to Interpret Coefficients of Log Link

Answer 1

Exponentiate the coefficients and subtract 1. This results in the algebraic changes in the target variable

Answer 2

Alternative approach to reducing the complexity of a GLM and preventing overfitting. It works by adding a penalty term that reflects the size of the coefficients to the function to optimize (loglikelihood). This shrinks the magnitude of the estimated coefficients of features with limited predictive power towards zero. In some cases, the coefficients are forced exactly to zero, thus removing those features from the model.

Answer 3

1. Binarization of factor variables is done automatically in regularization, and each factor level is treated as a separate feature to remove 2. Cross-validation can be used to optimize the regularization parameter (lambda) such that the RMSE is minimized 3. Coefficient estimates are more difficult to interpret in regularization because the variables are standardized 4. The glmnet package only allows a restricted set of model forms

Answer 4

Target variable guides the analysis. Main Methods: GLMs, Decision Trees

Answer 5

Target variable is absent; interested in extracting relationships between variables in the data (lend themselves to high-dimensional datasets) Main Methods: Principal Components Analysis, Cluster Analysis

Answer 6

Regression has numeric target variable, classification has categorical target variable

Answer 7

Training Set is the data used to develop the predictive model (typically the largest portion of the data) Test Set is the data used the evaluate the predictive performance when applied to data the model has not seen before

Answer 8

Alternative to training/test split when data set is small Splits data into k equal folds; each fold is used once as the validation set, while the remaining folds are used as the training set Predictive model is fit k times and predictions are combine

Answer 9

Root Mean Squared Error (RMSE): aggregate of all prediction erros in the test set

Answer 10

Classification Error Rate: proportion of observations in the test set that are incorrectly classified

Answer 11

Difference between expected value and true value More complex, lower bias

Answer 12

Amount by which expected value would change using a different training set More complex, higher variance

Answer 13

The variance of the noise. Cannot be reduced no matter how good the predictive model is

Answer 14

Variables are raw recorded measurements from the original dataset without any transformations Features are derivatives of raw variables

Answer 15

Process of developing new features based on existing variables in the data Seeks to enhance the flexibility of the model and lower the bias of the predictions at the expense of an increase in variance Plays more prominent role in GLMs compared to Decision Trees

Answer 16

Opposite of feature generation; process of dropping features with limited predictive power Important concept in GLMs and Decision Trees

Answer 17

Forward/Backward Selection

Answer 18

RMSE (most common) Loglikelihood R-Squared (goodness of fit measure) Chi-square

Answer 19

Creating a categorical predictor whose levels are defined as non-overlapping intervals of the original variable

Answer 20

Feature generation method used for categorical variables, which turns a categorical variable into a collection of binary variables Baseline level is automatically defaulted by alphabetical order Good idea to make baseline level the level with most observations

Answer 21

Need to multiply continuous variable by each of the binary variables created from the categorical variable To assess the extent of the interaction graphically, use a scatterplot

Answer 22

Need to multiply each of the binary variables created from both categorical variables To assess the extent of the interaction graphically, use a box plot

Answer 23

Alternative to forward/backward selection that shrinks the magnitude of coefficients of features with limited predictive importance towards zero Goal is to simplify model and avoid overfitting

Answer 24

Type of regularization in which the penalty is the sum of squares of the slope coefficients

Answer 25

Type of regularization in which the penalty is the sum of absolute values of the slope coefficients

Answer 26

Type of regularization in which the penalty captures both lasso and ridge

Answer 27

When lambda = 0, the coefficient estimates are equal to the ordinary least squares estimates As lambda increases, the effect of regularization increases

Answer 28

The lasso method forces coefficients to exactly zero, whereas the ridge method reduces them but not ever to zero (retains all features) Lasso tends to produce simpler models with fewer features

Answer 29

Computationally more efficient than stepwise selection algorithms

Answer 30

May not produce the most interpretable model (especially for Ridge) During model fitting, all numeric coefficients are standardized, which makes the interpretation of their estimates less intuitive Cannot accommodate all distributions for GLMs

Answer 31

GLMs offer considerably more flexibility: 1. The target variable can be any member of the exponential family of distributions 2. GLMs have the ability to analyze situations in which the effects of the predictors on the target mean are more complex than merely additive in nature

Answer 32

Gamma and Inverse Gaussian most appropriately capture the skewness of the target variable Ex. claim amounts, income, amount of insurance coverage

Answer 33

Binomial (occurrence or non-occurrence of event)

Answer 34

Poisson most natural candidate Drawback: requires mean to equal variance, otherwise overdispersion may exist

Answer 35

Log link is most natural candidate, as it ensures predictions are always positive Also easily interpretable

Answer 36

Logit link good candidate, as it ensures predictions are always between 0 and 1 Also easily interpretable due to connection to the log link function

Answer 37

Link function that simplifies the estimation procedure Should not always be used; need to consider interpretability

Answer 38

Natural Log

Answer 39

Squared Inverse

Answer 40

Used in GLMs to assign a higher emphasis to observations that are averaged across more subjects of similar characteristics Variance of each observation is inversely related to the group size Does not affect the mean of the target variable

Answer 41

Additional predictor used to account for different means of different observations Important for offset to be on same scale as the linear predictor Group size is positive related to the mean of the target variable Do not affect the variance of the target variable

Answer 42

Least Squares Regression

Answer 43

Maximum Likelihood Estimation

Answer 44

Goodness-of-fit measure for GLMs (compared to R-squared in linear models) Measures the extent to which the GLM departs from the most elaborate (or saturated) model Lower the deviance, closer the GLM is to a perfect fit

Answer 45

Measure of Deviance used in GLMs Used over raw residuals because deviance residuals are normally distributed

Answer 46

Displays the standardized deviance residuals against the standard normal quartiles

Answer 47

Tabular display of how predictions of a binary classifier line up with the observed classes

Answer 48

(False Negatives + False Positives) / n

Answer 49

(True Positives) / (True Negatives + False Negatives) Higher sensitivity and specificity, better classifier

Answer 50

(True Negatives) / (True Negatives + False Positives) Higher sensitivity and specificity, better classifier

Answer 51

Graphical tool that displays sensitivity vs specificity Classifier with perfect predictive performance rises quickly to top left corner

Answer 52

Area under the curve of a ROC plot Measure of the predictive performance of the model (want AUC to be closest to 1)

Answer 53

Have a quantitative target variable and use the average of the target variable in that group as the predicted variable

Answer 54

Have qualitative target variables and use the most common class (mode) of the target variable in that group as the predicted class

Answer 55

Point on the tree that corresponds to a subset of the data

Answer 56

Node at the top of the tree representing the full data set

Answer 57

Also referred to as leaf. The nodes at the bottom of the tree that cannot be split any further

Answer 58

Each node only has two children

Answer 59

The number of branches from the root node to the furthest terminal node

Answer 60

Residual sum of squares

Answer 61

Most common is classification error. Other measures of node impurity are entropy and gini. Choice of impurity measure does not have a significant impact on the performance of the tree

Answer 62

Process to control the complexity of the tree structure Similar to stepwise selection in GLMs Start at the bottom of the tree and remove splits that do not meet a specified impurity reduction

Answer 63

1. Generally easier to interpret compared to GLMs 2. Excel in handling nonlinear relationships and do not require transformations 3. Good at automatically recognizing interactions between variables 4. Do not require binarization for categorical variables 5. Variables are automatically selected with the most important variables appearing at the top of the tree 6. Much less susceptible to model mis-specification than GLMs 7. Can easily be modified to deal with missing data

Answer 64

1. More prone to overfitting relative to GLMs and produce unstable predictions with more variance (small change in training data can lead to big changes in the fitted tree) 2. Favor categorical features with many levels over those with few levels 3. Lack of model diagnosis

Answer 65

Ensemble method that generates multiple bootstrapped samples of the training set and fits base tree models to each bootstrapped sample Results from all base trees are combined to form an overall prediction Randomization is performed at each step Improves model variance and predictive performance

Answer 66

1. Much more robust than single trees | 2. More precise predictions with lower variance

Answer 67

1. Not easily interpretable | 2. Takes considerably more computational power

Answer 68

Ensemble method that builds a sequence of interdependent trees using information from previously grown trees Each iteration builds on the residuals of the prior tree Improves model bias

Answer 69

1. Improves predictive accuracy

Answer 70

1. More vulnerable to overfitting 2. Significant computational cost 3. Not easily interpretable

Answer 71

Advanced technique that transforms a large number of (possibly correlated) variables to a smaller, more manageable, set of representative variables that capture much of the information in the full data set Resulting variables are referred to as principal components and are a linear combination of the existing variables Particularly useful for feature generation To perform PCA on categorical variables, they must first be binarized

Answer 72

Provides simple visual inspection method for determining number of principal components to use Depicts the PVE (proportion of variance explained) of each PC

Answer 73

Point at which the PVE drops off significantly PCs beyond the elbow have a very small PVE and can be dropped

Answer 74

Partitions heterogeneous observations into a set of distinct homogeneous groups (clusters) Observations within each cluster have similar characteristics Two main methods: k-means clustering, hierarchical clustering

Answer 75

Assigns each observation into one of k clusters, with k being defined beforehand Algorithm automatically searches for the best configuration of the k clusters Clusters are chosen such that the variance within each cluster is small while the variance among different clusters is large Advisable to run the algorithm many times with different initial cluster assignments and then choose the algorithm with the lowest within-cluster variance Features must be standardized before clustering

Answer 76

Plot of the ratio of between-cluster variation to total variation against k Used to determine k - see when proportion of variance explained plateaus

Answer 77

Clustering method that does not require the choice of k at the start Uses a dendrogram, which is a tree-based visualization of the hierarchy of clusters Consists of a series of fusions (mergers) of clusters

Answer 78

Measure of intercluster dissimilarity based on the maximal pairwise distance between observations in two clusters

Answer 79

Measure of intercluster dissimilarity based on the minimal pairwise distance between observations in two clusters

Answer 80

Measure of intercluster dissimilarity based on the average pairwise distance between observations in two clusters

Answer 81

Tree-based visualization of the hierarchy of clusters Clusters towards bottom of the dendrogram are similar to one another, while clusters towards the top are far apart

Answer 82

1. K-Means clustering requires standardization 2. Number of clusters are pre-defined in k-means clustering 3. Hierarchical clustering uses nested clusters

Answer 83

Used in cluster analysis, the nstart parameter controls the number of random selections of initial cluster centers that are used by the kmeans algorithm. Improves the chances of finding a better local optimum. Recommended 20-50

Answer 84

An interaction exists if the effect of one variable on the target variable changes with the value or level of another variable.

Answer 85

Learning rate parameter in boosting methods. Scalar multiple between 0 and 1. Predictions of the current tree are scaled by the eta parameter and added to the overall model. Lower eta's "slow" the learning and generally result in more accurate models.

Answer 86

Maximum number of boosting rounds. Often a large number 100-200 to ensure a sufficient number of trees are grown.