Chapter 3: Linear Models Flashcards

Question

how could we interpret a regression coefficient in a linear model?

Answer 1

"a unit increase in X is associated with an increase of beta (the coefficient of X) in Y on average, holding all other predictors fixed"

Answer 2

pros: - we are able to take care of more complex relationships between the target variable and the predictors. The more polynomial terms included, the more flexible the model. cons: - the regression coefficients in a polynomial regression model are much more difficult to interpret. we cannot say that a one unit increase in X, holding other variables fixed - the other polynomial terms cannot be held fixed! - There is no simply rule as to what power of x we should go to. for large values of order, the model becomes overly flexible. it is an iterative process to decide on the optimal value of m \

Answer 3

1. polynomial regression 2. binning: using piecewise constant functions 3. using piecewise linear functions

Answer 4

we don't treat the numeric variable as numeric. We band the numeric variable and convert it into an ordered categorical variable whose levels are defined as non-overlapping intervals over the range of the variable. - each level is represented by a dummy variable and receives a separate regression coefficient pros: - liberated the regression function from assuming any particular shape. there is no definite order among the coefficients of the dummy variables corresponding to different bins, allowing the target mean to vary irregularly over the bins. cons: - no simple rule as to how many bins to use. having many bins leads to sparse categories and unstable coefficient estimates - binning results in a loss of information.

Answer 5

the feature generation method used to properly incorporate a categorical predictor into a linear model. - binarization turns a given categorical predictor into a collection of artificial binary (dummy) variables

Answer 6

quantifies our confidence that the corresponding regression coefficient is non-zero AKA our confidence that the target mean in that level is significantly different from the target mean in the baseline level. lower p-value = more confidence

Answer 7

a common practice is to set the baseline level to the most common level of the predictor

Answer 8

an interaction arises if the expected effect of one predictor on the target variable depends on the value (or level) of another predictor

Answer 9

for every unit increase in x1, the expected effect of x2 on the expected value of y increases by beta

Answer 10

we say that collinearity exists in a linear model when 2 or more features are closely if not exactly, linearly related.

Answer 11

false: it induces high variance | it inflates the standard error of the coefficients, which results in unstable coefficient estimates

Answer 12

the result of an exact linear relationship among some of the features is a rank-deficient linear model. this means that the coefficient estimates of the model cannot be determined uniquely.

Answer 13

the phrase "holding all else constant", variables that are strongly related to one another tend to move together. the interpretation of estimates becomes difficult and it is hard to separate the individual effects of the features on the target variable.

Answer 14

look at the correlation matrix of the numeric predictors.

Answer 15

1. delete one of the problematic predictors.

Answer 16

fit a separate linear model for each possible combination of the available features, examine all the models and choose the "best subset" of features to form the best model (best model = according to criterion) pros: robust, every possibility is looked at cons: computationally intensive. 2^p models to compare

Answer 17

backward: start with the model with all features and work backward. in each step of the algorithm, we drop the feature that causes the greatest improvement in the model in its absence forward: start with the simplest model (null model) and add the feature that results in the greatest improvement in the model.

Answer 18

forward selection because the starting model is simpler

Answer 19

it is an alternative to doing stepwise selection for feature selection. how it works: it generates coefficient estimates by minimizing a slightly different objective function. it uses the RSS as the starting point, and then incorporates a penalty term that reflects the complexity of the model.

Answer 20

lasso (absolute) and ridge (squares)

Answer 21

elastic net. it requires an additional hyperparameter, alpha, to distribute the weight to each type of regression. alpha = mixing coefficient

Answer 22

True. scale matters

Answer 23

as lambda increases, the effect of regularization becomes more severe. the flexibility drops, resulting in a decreased variance but an increased bias. in most applications, the initial increase in lambda causes a substantial reduction in variance at the cost of only a slight rise in the bias. when lambda goes to inf, the regularization penalty dominates and the estimates of the slope coefficients have no choice to be all zero

Answer 24

lasso. lasso has the effect of forcing the coefficient estimates to be exactly zero, when the regularization parameter is sufficiently large. ridge can never exactly go to 0.

Answer 25

lambda - regularization parameter and alpha - elastic net parameter

Answer 26

pros: 1. by using the glmnet() function, it automatically binarizes the categorical predictors and each factor level is treated as a separate feature to be removed. 2. a regularized model can be tuned by c-v using the same criterion (RMSE for numeric target) that will be used to judge the model against unseen test data 3. for lasso and elastic nets, variable selection can be done. cons: 1. the glmnet() function is restricted in terms of model forms. not all of the GLM distributions are supported 2. regularization may not produce the most interpretable model

Answer 27

pairs(numeric variables)

Answer 28

model <= lm(formula, data = dataset) | formula looks like y ~ . + interaction - predictor

Answer 29

predict( dataset, newdata = data.test, type = "response")

Answer 30

print("TRAIN") summary(data.train) same for test - describe the results of the set split (ex. the portion of obs that went into each) - point out that we are using stratified sampling to ensure that both the training and test sets contain similar and representative values of the response variable - check the summary statistics of both

Answer 31

vars.numeric <= colnames(dataset[ somehow get them ]) for (i in vars.numeric) { plot <= ggplot( dataset, aes(x = dataset[, i ], y = target)) + geom_ point() + geom_smooth() + labs( x = i)

Answer 32

yes, if the univariate histogram shows the data to be skewed

Answer 33

a dataframe

Answer 34

lm(y ~ . + interaction + I(prector ^ 2) , data = dataset)

Answer 35

for categorical variables with 3 or more levels. the lm() function is capable of converting categorical predictors into dummy variables behind the scenes (and binarizing the variables in advance has no impact on a fitted linear model) but it can help make the feature selection process more meaningful. pros: - many feature selection techniques treat factor variables as a single feature and either retain the variable with all of its levels, or remove the variable completely. it ignores the possibility that some levels within a factor variable are significant/insignificant cons: - each level is a separate feature, takes a lot longer for the feature selection process - sometimes the results don't make sense

Answer 36

library(caret) dummyVars(~ pred1 + pred2 + pred3 + ... , data = dataset, fullRank = TRUE)

Answer 37

you can either set it to true or false (default value) TRUE = appropriate for regression, allows you to have a full rank model, with baseline levels left out FALSE = baseline levels are kept and a rank deficient model is produced. Not appropriate for regression, but can be used in PCA and Cluster analysis

Answer 38

BIC tends to favour a model with fewer features and represents a more conservative approach to feature selection. forward selection is more likely to produce a model with fewer features (because the model you start with has no features)

Answer 39

library(MASS) 1. fit the full model 2. stepAIC( full model, direction = "backward", k = (depends on AIC/BIC))

Answer 40

1. fit the null model 2. fit the full model 3. stepAIC( null model, direction = "forward", scope = list ( upper = full model, lower = null model), k = (AIC / BIC))

Answer 41

AIC: no need to specify k, this is the default value (but k = 2) BIC: k = log(nrow( training dataset - the dataset that was used to train the model)

Answer 42

1. first need to create a design matrix using the model.matrix() function library(glmnet) x.values <= model.matrix( target ~ predictors, data = dataset ) - categorical variables will be automatically transformed into dummyvariables like the dummVars() function and the design matrix is of full rank 2. using the function glmnet( x = x.values, y = data.train$ target variable, family = "gaussian", lambda = c( vector of possible values - one model for each value of lambda will be outputted), alpha = c)

Answer 43

it is a list of two components: - a0 = vector of size length(lambda) carrying the intercept value for each model created - beta = matrix of coefficient estimates corresponding to the different values of lambda

Answer 44

used for hyperparameter tuning. this function performs k-fold c-v, calculates the c-v error for a range of values of lambda, and produces the optimal value of lambda as a by-product.

Answer 45

yes, need to set.seed because it is c-v (randomized dividing of the training set k times) set.seed(n) n <= cv.glmnet( x = design matrix, y = data.train$target, family = "gaussian", alpha = 0.5) plot(n) this produces a graph, the first line = value of lamba that has minimized c-v error. second line = 1 standard error rule (within one std.dev of the minimum error) take out either using n$lambda.min or n$lambda.1se

Answer 46

no, we have to use trial and error 1. set up a grid of values of alpha, usually 0, 0.5, 1 (if more values are desired, use a for loop) 2. for each value of alpha, select the corresponding optimal value of lambda and evaluate the corresponding c-v error 3. adopt the value of alpha that produces the smallest c-v error

Chapter 3: Linear Models Flashcards

(71 cards)