Chapter 3: Linear Models Flashcards

1
Q

what is the difference between supervised and unsupervised learning problems?

A

supervised: our goal is to understand the relationship between the target variable and the predictors/ make accurate predictions for the target variable based on the predictors (there is a target variable)
unsupervised: target variable is absent. we are more interested in extracting relationships and structures between different variables in the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

what two types of business problems are there in exam PA?

A
  1. prediction focused: the primary objective is to make an accurate prediction of the target variable on the basis of other predictors
  2. interpretation focused: we are interested in using the model to understand the true relationship between the target variable and the predictors (ex. how is the number of exams passed associated with the salary of an actuary)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How does the business problem affect the model that we create? (objective is prediction or interpretation?)

A

if the objective is to predict, then we will implement a model that produces good predictions even if it is costly to implement
if the objective is to interpret, then we can select a relatively simple, interpretable model that clearly shows the relationship between the target and predictors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

why is it important that data are consistent?
ex. numeric variables - keeping them all in the same units
categorical variables - consistent naming for the levels

A

so that they can be directly compared to one another

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

When datasets contain PII (personally identifiable information), such as social security number, address, etc, what may need to be done with the data?

A
  1. anonymize the data to remove the PII
  2. data security: ensure that personal data receives sufficient protection, such as access restrictions
  3. terms of use: be aware of the terms of use on the data.
  4. unethical data: differential treatment based on these variables in a predictive model may lead to unfair discrimination
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

what proportion of the data should we use to train our models?

A

around 70-75%

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

how do you create the test/training split? from what package?

A
library(caret)
set.seed()
partition -createDataPartition(dataset$targetvariable, p = 0.75, list = FALSE)
data.train - dataset[partition, ]
data.test - dataset[-partition, ]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

what is the purpose of a training/test split?

A

train: used to train/develop your model to estimate the signal function. typically done by optimizing a certain objective function
test: where you assess the prediction performance of your trained model according to certain performance metrics (imagining that the test set is a set of future data)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what is a good performance metric to use on regression problems?

A

test RMSE: write out the formula

we interpret this as the size of a typical prediction error in absolute value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

why do we use RMSE instead of MSE?

A

the RMSE has the same unit as the target variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what performance metric do we use for classification problems?

A

test classification error rate: write it out
the sum computes the number of test observations incorrectly classified. the division of n_test returns the proportion of misclassified observations on the test set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

when is c-v most typically used?

A

to select the values of hyperparameters, which are parameters that control some aspect of the fitting itself

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

why is c-v powerful?

A

because it can assess the prediction performance of a model without using additional test data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

how to perform k-fold cross validation?

A
  1. for a given positive integer k, randomly split the training data into k folds of approx. equal size. common choice = 10
  2. one fold is left out and the model is fitted to the remaining k-1 folds. then the fitted model is used to make a prediction for each observation in the left-out fold and a performance metric is computed on that fold.
  3. repeat this process with each fold left out in turn to get k performance values (e.g. RMSE for numeric and classification error rate for categorical)
  4. the overall prediction performance of the model can be estimated as the average of the k performance values (this is the CV error)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

is it possible to validate our model graphically?

A

yes, the plot of observed against predicted values not only should fall on the y = x line, but also have no deviations from this line

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

T/F: prediction accuracy is the same as goodness of fit

A

false. it is not.
goodness of fit measures how well a model describes the past data, but doesn’t necessarily measure how good a model will perform on future data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

what are the components of the expected test error? describe each of them.

A
  1. bias: (accuracy)
    - this is the difference between the expected value of the predictive model and the true value of the signal function.
    - bias measures the accuracy of f_hat
    - bias is the part of the test error caused by the model not being flexible enough to capture the signal
  2. variance: (precision)
    - quantifies that amount that f_hat would change if we were to estimate it on a different test set.
    - ideally, f_hat should be stable across different training sets
    - a more flexible model has a higher variance because it is more sensitive to training data.
  3. irreducible error:
    - this is the variance of the noise, independent of the choice of predictive model but inherent in the target variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

as the flexbility of the model increases, what tends to drop quicker, bias or variance?

A

bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

what are two commonly used strategies for reducing the dimensionality of a categorical predictor?

A
  1. combining sparse categories with others:
    - categories with very few observations should be the first candidates to be combined with other categories
    - it is difficult to estimate the effects of these categories on the target variable reliably if they have few observations
  2. combining similar categories:
    - if the target variable behaves similarly (with respect to mean, median, etc.) in two categories of a categorical predictor then we can reduce the dimension of the predictor by consolidating these two categories without losing much information
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

in terms of a categorical predictor: what is the difference between granularity and dimensionality?

A

dimensionality:

  1. applicability - concept specific to categorical variables
  2. comparability - we can always order categorical variables by dimension (which has more levels?)

granularity:

  • as we make a categorical predictor more granular, the information it stores becomes finer. its dimension increases and there are fewer observations at each level
  • reducing the granularity of a categorical predictor makes the information contained by the predictor less detailed and makes the number of factor levels more manageable.
    1. applicability - applies to both categorical and numeric variables
    2. comparability - not always possible to order categorical variables by granularity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

what is the optimal level of granularity for a categorical predictor

A

the level that optimizes the bias-variance trade off

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What does the penalty term do in the AIC/BIC?

A

the AIC and BIC both demand that for the inclusion of another feature to improve the performance of a model, the feature has to increase the loglikelihood by the same amount that it increases the complexity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

which has a higher penalty term? AIC or BIC?

A

BIC = more conservative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

when checking model diagnostics, what two plots are commonly used and what is evaluated on both?

A
  1. residuals vs. fitted values plot
    - equal variance (finger test - checking for homoscedasticity)
    - any patterns in the plot
    - mean 0
  2. Normal qq plot
    check that the standardized residuals lie on the y = x line
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

how could we interpret a regression coefficient in a linear model?

A

“a unit increase in X is associated with an increase of beta (the coefficient of X) in Y on average, holding all other predictors fixed”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

in situations where the relationship between Y and X does not appear to be linear, it may be desirable to expand the model to higher powers of X using polynomial regression.

what are the pros (1) and cons (2) to adding higher order terms to our model?

A

pros:
- we are able to take care of more complex relationships between the target variable and the predictors. The more polynomial terms included, the more flexible the model.

cons:

  • the regression coefficients in a polynomial regression model are much more difficult to interpret. we cannot say that a one unit increase in X, holding other variables fixed - the other polynomial terms cannot be held fixed!
  • There is no simply rule as to what power of x we should go to. for large values of order, the model becomes overly flexible. it is an iterative process to decide on the optimal value of m \
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

How could we deal with complex, nonlinear relationships (3)

A
  1. polynomial regression
  2. binning: using piecewise constant functions
  3. using piecewise linear functions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

how could we use binning to incorporate nonlinearity into a model? what are the pros (1) and cons (2) to this method?

A

we don’t treat the numeric variable as numeric. We band the numeric variable and convert it into an ordered categorical variable whose levels are defined as non-overlapping intervals over the range of the variable.
- each level is represented by a dummy variable and receives a separate regression coefficient

pros:
- liberated the regression function from assuming any particular shape. there is no definite order among the coefficients of the dummy variables corresponding to different bins, allowing the target mean to vary irregularly over the bins.

cons:

  • no simple rule as to how many bins to use. having many bins leads to sparse categories and unstable coefficient estimates
  • binning results in a loss of information.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

what is binarization?

A

the feature generation method used to properly incorporate a categorical predictor into a linear model.
- binarization turns a given categorical predictor into a collection of artificial binary (dummy) variables

30
Q

how could we interpret the p-value of a dummy variable?

A

quantifies our confidence that the corresponding regression coefficient is non-zero AKA our confidence that the target mean in that level is significantly different from the target mean in the baseline level.
lower p-value = more confidence

31
Q

How do we choose the baseline level for a categorical predictor?

A

a common practice is to set the baseline level to the most common level of the predictor

32
Q

what is the definition of an interaction?

A

an interaction arises if the expected effect of one predictor on the target variable depends on the value (or level) of another predictor

33
Q

how do we interpret the regression coefficient of an interaction term?

A

for every unit increase in x1, the expected effect of x2 on the expected value of y increases by beta

34
Q

what is the definition of collinearity?

A

we say that collinearity exists in a linear model when 2 or more features are closely if not exactly, linearly related.

35
Q

T/F: collinearity induces low variance.

A

false: it induces high variance

it inflates the standard error of the coefficients, which results in unstable coefficient estimates

36
Q

what is an implication of exact collinearity, in terms of the rank of our OLS beta formula?

A

the result of an exact linear relationship among some of the features is a rank-deficient linear model. this means that the coefficient estimates of the model cannot be determined uniquely.

37
Q

what is an implication of collinearity, in terms of interpretation of the coefficient estimates?

A

the phrase “holding all else constant”, variables that are strongly related to one another tend to move together. the interpretation of estimates becomes difficult and it is hard to separate the individual effects of the features on the target variable.

38
Q

how can you detect collinearity?

A

look at the correlation matrix of the numeric predictors.

39
Q

what are two solutions to collinearity?

A
  1. delete one of the problematic predictors.
40
Q

what is best subset selection?

A

fit a separate linear model for each possible combination of the available features, examine all the models and choose the “best subset” of features to form the best model (best model = according to criterion)

pros: robust, every possibility is looked at
cons: computationally intensive. 2^p models to compare

41
Q

explain backwards and forwards stepwise selection

A

backward: start with the model with all features and work backward. in each step of the algorithm, we drop the feature that causes the greatest improvement in the model in its absence
forward: start with the simplest model (null model) and add the feature that results in the greatest improvement in the model.

42
Q

which of forward or backward stepwise selection is more likely to get a simpler model?

A

forward selection because the starting model is simpler

43
Q

what is regularization? how does it work?

A

it is an alternative to doing stepwise selection for feature selection.

how it works: it generates coefficient estimates by minimizing a slightly different objective function. it uses the RSS as the starting point, and then incorporates a penalty term that reflects the complexity of the model.

44
Q

what are two common choices for the penalty function in regularization?

A

lasso (absolute) and ridge (squares)

45
Q

what is a more general regularization method that incorporates both lasso and ridge into one objective function?

A

elastic net. it requires an additional hyperparameter, alpha, to distribute the weight to each type of regression.

alpha = mixing coefficient

46
Q

T/F: in regularization, we need to standardize all predictors

A

True. scale matters

47
Q

what happens in regularization when the value of lambda increases from 0? when it increases to infinity?

A

as lambda increases, the effect of regularization becomes more severe. the flexibility drops, resulting in a decreased variance but an increased bias.

in most applications, the initial increase in lambda causes a substantial reduction in variance at the cost of only a slight rise in the bias.

when lambda goes to inf, the regularization penalty dominates and the estimates of the slope coefficients have no choice to be all zero

48
Q

which regularization method allows feature selection?

A

lasso.

lasso has the effect of forcing the coefficient estimates to be exactly zero, when the regularization parameter is sufficiently large.

ridge can never exactly go to 0.

49
Q

in the context of regularization, what are the two hyperparameters?

A

lambda - regularization parameter and alpha - elastic net parameter

50
Q

what are the pros (3) and cons (2) for regularization techniques for feature selection?

A

pros:

  1. by using the glmnet() function, it automatically binarizes the categorical predictors and each factor level is treated as a separate feature to be removed.
  2. a regularized model can be tuned by c-v using the same criterion (RMSE for numeric target) that will be used to judge the model against unseen test data
  3. for lasso and elastic nets, variable selection can be done.

cons:
1. the glmnet() function is restricted in terms of model forms. not all of the GLM distributions are supported
2. regularization may not produce the most interpretable model

51
Q

what is the equation for elastic nets?

A

idk

52
Q

how do we get that graphical representation of correlation between numeric variables?

A

pairs(numeric variables)

53
Q

how to use the lm() function to fit a linear model in r?

A

model <= lm(formula, data = dataset)

formula looks like y ~ . + interaction - predictor

54
Q

how to use the predict() function?

A

predict( dataset, newdata = data.test, type = “response”)

55
Q

what should you do after creating a test/training set split?

A

print(“TRAIN”)
summary(data.train)
same for test

  • describe the results of the set split (ex. the portion of obs that went into each)
  • point out that we are using stratified sampling to ensure that both the training and test sets contain similar and representative values of the response variable
  • check the summary statistics of both
56
Q

write a for loop to create ggplots for numeric variables vs. target

A

vars.numeric <= colnames(dataset[ somehow get them ])
for (i in vars.numeric) {
plot <= ggplot( dataset, aes(x = dataset[, i ], y = target)) + geom_ point() + geom_smooth() + labs( x = i)

57
Q

can you perform transformations on a predictor?

A

yes, if the univariate histogram shows the data to be skewed

58
Q

the predict function takes the argument “newdata” as what type of variable?

A

a dataframe

59
Q

how do you add a higher order term to a linear regression?

A

lm(y ~ . + interaction + I(prector ^ 2) , data = dataset)

60
Q

why is it desirable to explicitly binarize the factor variables before implementing feature selection? what are the pros (1) and cons (2)?

A

for categorical variables with 3 or more levels.

the lm() function is capable of converting categorical predictors into dummy variables behind the scenes (and binarizing the variables in advance has no impact on a fitted linear model) but it can help make the feature selection process more meaningful.

pros:
- many feature selection techniques treat factor variables as a single feature and either retain the variable with all of its levels, or remove the variable completely. it ignores the possibility that some levels within a factor variable are significant/insignificant

cons:
- each level is a separate feature, takes a lot longer for the feature selection process
- sometimes the results don’t make sense

61
Q

How do you binarize variables? what function and from what package? (this code is given usually)

A

library(caret)

dummyVars(~ pred1 + pred2 + pred3 + … , data = dataset, fullRank = TRUE)

62
Q

What is the “fullRank” argument used for in the dummyVars() function?

A

you can either set it to true or false (default value)

TRUE = appropriate for regression, allows you to have a full rank model, with baseline levels left out

FALSE = baseline levels are kept and a rank deficient model is produced. Not appropriate for regression, but can be used in PCA and Cluster analysis

63
Q

if your goal is to determine key features, should you use AIC or BIC? Forward stepwise or backward?

A

BIC tends to favour a model with fewer features and represents a more conservative approach to feature selection.

forward selection is more likely to produce a model with fewer features (because the model you start with has no features)

64
Q

how to use the stepAIC() function to perform backward stepwise selection? what package?

A

library(MASS)

  1. fit the full model
  2. stepAIC( full model, direction = “backward”, k = (depends on AIC/BIC))
65
Q

how to use the stepAIC() function to perform forward stepwise selection?

A
  1. fit the null model
  2. fit the full model
  3. stepAIC( null model, direction = “forward”, scope = list ( upper = full model, lower = null model), k = (AIC / BIC))
66
Q

When using the stepAIC function, what is the the value of k, depending on AIC or BIC?

A

AIC: no need to specify k, this is the default value (but k = 2)

BIC: k = log(nrow( training dataset - the dataset that was used to train the model)

67
Q

how to perform regularization using glmnet() function. what package?

A
  1. first need to create a design matrix using the model.matrix() function

library(glmnet)
x.values <= model.matrix( target ~ predictors, data = dataset )

  • categorical variables will be automatically transformed into dummyvariables like the dummVars() function and the design matrix is of full rank
    2. using the function

glmnet( x = x.values, y = data.train$ target variable, family = “gaussian”, lambda = c( vector of possible values - one model for each value of lambda will be outputted), alpha = c)

68
Q

what are the two components of a glmnet object?

A

it is a list of two components:

  • a0 = vector of size length(lambda) carrying the intercept value for each model created
  • beta = matrix of coefficient estimates corresponding to the different values of lambda
69
Q

what is the cv.glmnet() function?

A

used for hyperparameter tuning.
this function performs k-fold c-v, calculates the c-v error for a range of values of lambda, and produces the optimal value of lambda as a by-product.

70
Q

how do you use the cv.glmnet() function? do you need to set.seed()?

A

yes, need to set.seed because it is c-v (randomized dividing of the training set k times)

set.seed(n)
n <= cv.glmnet( x = design matrix, y = data.train$target, family = “gaussian”, alpha = 0.5)
plot(n)

this produces a graph, the first line = value of lamba that has minimized c-v error. second line = 1 standard error rule (within one std.dev of the minimum error)

take out either using n$lambda.min or n$lambda.1se

71
Q

in regularization (elastic net) how do we choose an optimal value of alpha? can we use hyperparameter tuning (glmnet() )?

A

no, we have to use trial and error

  1. set up a grid of values of alpha, usually 0, 0.5, 1 (if more values are desired, use a for loop)
  2. for each value of alpha, select the corresponding optimal value of lambda and evaluate the corresponding c-v error
  3. adopt the value of alpha that produces the smallest c-v error