Module 6 - GLM Flashcards

Question 1

Q

Overdispersion definition? how to fix?

Answer

A

1) Variance of response is greater than the mean in the Poisson GLM regression

2) Use quasi-likelihood (quasi Poisson)
- Estimates will be the same, but standard error of estimates will be different

Question 2

Q

Interaction terms: why does this occur?

Answer

A

1) Occurs when the response depends on the relationship between a combination of features

Question 3

Q

Interaction terms: why use underlying variables?

Answer

A

Hierarchy principle: if we include in an interaction in a model, should also include the main effects, even if insignificant p-values

1) Interactions hard to interpret in a model without main effects
2) Interactions also contain main effects, even if main model has no main effects

Question 4

Q

R^2, what does it mean? Problem with the measure and solution?

Answer

A

1) Fraction of variance explained by the model = fraction of variance reduced

2) Adding a predictor always increases its value
- Fix: adjusted R^2 adds a penalty for more parameters

Question 5

Q

Collinearity, definition? How to deal with it? (2)

Answer

A

-2 or more predictor variables are related to each other

solutions:

1) Drop one of the problematic variables
2) Combine collinear variables together into a single predictor

Question 6

Q

Offsets, definition? Why are they used?

Answer

A

Variable for which the effect on the response is known. Therefore, the coefficient does not need to be estimated. (Beta = 1)
But the GLM still needs to be made aware of the existence of the offsetted variable so that the estimated coefficients for the OTHER variables are optimal in its presence
Used to adjust for exposure

Question 7

Q

Prior weights, definition? why used?

Answer

A

Give information about the credibility of each observation in the model.
Assign a great credibility to rows that represent a greater number of risks in the estimation of the model coefficients.
Weight variable specifies the weight given to each record in the estimation process.
ex: 1 year of exposure vs 1 month of exposure

-Observations with HIGHER EXPOSURE deemed to have LOWER VARIANCE

Question 8

Q

Deviance, definition? how does it work?

Answer

A

Measure of goodness of fit of a GLM
Compares likelihood with/without parameters
Smaller deviance = Better model

Question 9

Q

Homoscedasticity, definition?

Answer

A

Error terms have constant variance

e ~ N( 0 , sigma)

Question 10

Q

Graph to use for:

1) Observations that have too large an impact on coefficients?
2) Normality of the distribution of residuals?
3) Homogeneity of the variance and linearity of relationship? (2)

Answer

A

1) Residuals vs Leverage
2) Normal vs Q-Q
3) Residuals vs Fitted, Scale-Location

Question 11

Q

Alpha parameters for:

1) Lasso?
2) Elastic Net?
3) Ridge regression

Answer

A

1) Lasso: Alpha = 1
2) Elastic Net: 0 < Alpha < 1
3) Ridge: Alpha = 0

Question 12

Q

Difference between lasso and ridge? What is one better than the other at?

Answer

A

1) With Lasso, optimal solution can reduce a coefficient to exactly = 0
- Which cannot happen with ridge
- Thus, lasso can completely remove a feature

2) Lasso: Better at feature selection
- Ridge: Better at fit

Question 13

Q

With lasso, as Lambda increases…?

Answer

A

More of the features will be eliminated

- Larger coefficients shrink at a much faster rate than smaller coefficients

Question 14

Q

Limitation of feature selection using regularization techniques?

Answer

A

Automatic method -> so not always most interpretable

Question 15

Q

Cross-Validation, explained, steps?

Answer

A

Repeating the validation step with different training/test samples

1) Train model on k-1 parts, predict and record error on validation
2) Repeat K times, for all possibilities

3) Calculate the errors for each -> the CV error will equal the weighted sum of errors
- Take the average of evaluation from each fold

Question 16

Q

GLM to use for: (Family and link function)

Probability or binary?

Answer

A

Family = binomial

Link function = Logit

Question 17

Q

GLM to use for: (Family and link function)

Count

Answer

A

Family = poisson or quasipoisson

Link function = Log

Question 18

Q

GLM to use for: (Family and link function)

Continuous positive?

Answer

A

Family = Gamma, Inverse Gaussian

Link function = Log (for both)

Question 19

Q

Advantages of using an alternative fitting procedure, such as subset selection and shrinkage, instead of least squares? (3)

Answer

A

1) Simpler model
2) Improved prediction accuracy
3) Results easier to interpret

Question 20

Q

Lasso Regularized regression

1) Advantages? (2)

Answer

A

1) Binarization always done (through the use of the model matrix), and each factor level treated as a separate feature
2) Variable selection is automatic, using CV to minimize prediction error rather than a proxy such as AIC or hypothesis tests

Question 21

Q

Lasso Regularized regression

1) Disadvantages? (1)

Answer

A

1) Since variables are scaled, the estimated coefficients are difficult to interpret

Question 22

Q

Classification tree using Cost-Complexity Pruning

1) Advantages? (4)

Answer

A

1) Easy to explain and present due to if/else nature
2) Automatically removes variables (by not showing up in the tree), allowing interpretation to focus on the most significant factors
3) More easily adapts to non-linear relationships
4) Automatically captures interaction effects

Question 23

Q

Classification tree using Cost-Complexity Pruning

1) Disadvantages? (3)

Answer

A

1) Danger of overfitting

2) Resulting Tree can be highly dependent on the training set

Question 24

Q

Random Forest

1) Advantages? (2)

Answer

A

1) Reduces overfitting and variance by allowing results from multiple trees to be combined
2) Uses CV to set the tuning parameters

Question 25

Q

Random Forest

1) Disadvantages? (3)

Answer

A

1) Difficult to interpret
2) Longer runtime
3) Difficult to implement

Question 26

Q

Disadvantage of stepAIC algorithm for factor variables?

How to solve the issue?

Answer

A

1) stepAIC treats factor variables as a single feature
- As such, it either retains the variable with its existing variables or removes the variable entirely.
- Does not allow for the possibility that individual factor levels may be insignificant with regard to the base level, or insignificantly different from other levels
2) Solution: binarize the factor variables (1 or 0)

Question 27

Q

Interpret AIC between models?

Answer

A

Model with the LOWEST AIC is the best

Question 28

Q

True or false

-When using regularization methods, numeric features should be standardized

Answer

A

TRUE

-Standardization gives each feature an equal chance of having its coefficient altered

Question 29

Q

Why do you create an interaction term?

Answer

A

Interaction term because we suspect the effect on the target of one variable is influenced by the level of another variable
Multiplying variables together/creating flags allows algorithms to pick out the patterns and interactions MUCH MORE EASILY than hoping that the algorithm finds them itself

Question 30

Q

Advantages of GLMs? (4)

Answer

A

1) Intuitive to understand and easily communicated
2) Allow for non-normal distributions
3) Allow a functional relationship between the target, and a linear function of the variables. Can show the effect of a predictor variable on the target variable in terms of magnitude and direction (+/-)
4) good for modelling continuous response variables

Question 31

Q

Disadvantages of GLMs (4)

Answer

A

1) CANNOT capture NON-LINEAR relationships
2) Sensitive to the choice of features included
3) Risk of collinearity producing suboptimal models
4) Underlying assumptions may not always be met

Question 32

Q

Explain regularization in 1 sentence

Answer

A

-Regularizations are techniques used to REDUCE THE VARIANCE by PENALIZING THE MODEL FOR ADDING MORE PREDICTORS to avoid the risk of overfitting.

Question 33

Q

Why do we use stepwise variable selection with AIC? (2)

Answer

A

To remove unimportant variables

- To reduce the risk of overfitting

Question 34

Q

What do p-values represent/express?

Answer

A

They express the significant of the variables

Smaller the p-value, the more significant the variable is
Less than 0.05 is considered statistically significant

Question 35

Q

What does this do:

Drop1(glm.freq, test = “LRT”)

Answer

A

Conducts a likelihood ratio test for the model

- Small p value = variable is highly significant

Question 36

Q

Stepwise Regression, advantages? (2)

Answer

A

Automatic method and fast

- Can manage large amounts of predictor variables to choose the best ones from the available options

Question 37

Q

Stepwise Regression, disadvantages? (2)

Answer

A

Problems with correlated variables: if two predictor variables in the model are highly correlated, only 1 may make it into the model.
Greedy nature of the algorithm. It assumes each step, is going to move you closer to a good solution. For many reasons, this is a bad assumption. Since it is automatic, some variables may be removed from the model when they are important to be included in.
therefore, there is no guarantee it will yield the best subset of features among all possible combinations

Question 38

Q

Bias variance tradeoff explained:

High bias = ?
High variance = ?

Answer

A

High bias = inaccurate model
- Does not have the capacity to capture the signal in the data
High variance = overfits to the data it was trained on
- Won’t generalize well to unseen data

Question 39

Q

Describe a GLM in a phrase

Answer

A

Models that

Take all significant variables into account
Assess the relative importance of each predictor
While also creating an easy to implement formula to calculate a prediction for a given observation

Question 40

Q

Explain what a glm family is?

Answer

A

Family refers to the distribution that the target variable is assumed to follow

-Will impact how the algorithm fits the model

Question 41

Q

What is the purpose of the link function?

Answer

A

Link function is used to force the mean of the prediction for a specific observation to be in a specific range

Question 42

Q

Why is overfitting bad (i.e addition of additional predictor variables)?

Answer

A

Adding additional variables can improve fit to the training data, but may actually decrease fit against unseen (testing data)

Question 43

Q

Information criterion, use? Explain

Answer

A

Reduces overfitting by demanding that an additional variable increase the loglikelihood by a specific amount in order to be added

Question 44

Q

AIC vs BIC, comparison?

Answer

A

AIC
-Adding a variable requires an increase in loglikelihood of 2 per parameter added

BIC
-Adding a variable requires an increase in loglikelihood of ln(n) to be added

Therefore, BIC is a more conservative approach, since there is a greater penalty of each parameter

Question 45

Q

AIC vs BIC, when would you use which?

Answer

A

BIC penalizes more severely than AIC
-Therefore, if you’re trying to understand the key variables related to the target variable (i.e smallest # possible), better to use BIC since it is more conservative

Question 46

Q

Forward vs backward selection -> why use one or the other?

Answer

A

Forward selection is more likely to end up with fewer variables -> resulting in a SIMPLER MODEL

Question 47

Q

In R, how to treat a numeric variable that has distinct cases?

Answer

A

Convert it into a factor variable

Question 48

Q

Why is RMSE used as a regression performance indicator over MSE?

Answer

A

RMSE has the same unit as the target variable, making its value easier to interpret

Question 49

Q

Explain strategies for reducing the dimension of a categorical variable?

Answer

A

Combining similar categories
- Categories with similar values of the target variable (mean, median, etc)
Combining sparse categories together
- Categories with few observations
- Treating variables as “other”

Question 50

Q

Why do you use k-1 for k predictors?

Answer

A

or else, there is a perfect linear relationship (collinearity), which will destabilize the model fitting process

Question 51

Q

How do you choose the baseline level for a categorical predictor?

Answer

A

Choose the one with the most observations (default), or choose the one that ‘makes the most sense logically’

Question 52

Q

Pros of using regularization for feature selection in R? (2)

Answer

A

glmnet() allows binarization of categorical predictors in advance.
- This allows us to assess the significance of individual factor levels, not just the significance of the entire categorical predictor
Regularization is computationally more efficient than stepwise selection algorithms

Question 53

Q

Cons of using regularization for feature selection in R? (2)

Answer

A

Regularization may not produce the most interpretable model
- especially for ridge, since all features are retained
glmnet() is restricted in terms of model forms -> it can’t accommodate all of the distributions for GLMs
- Ex: it does not cover gamma model

Question 54

Q

For AIC and BIC, how are features added?

Answer

A

The features must INCREASE the LOGLIKELIHOOD by the following amounts to be included:

AIC: 2
BIC: ln(n)

Question 55

Q

Why is having collinear (highly correlated variables) in a model bad?

Answer

A

This means you’re entering the same information in the model twice
This makes it difficult for the GLM to separate the individual effects of the collinear variables on the target , causing instability in the model

Question 56

Q

how does drop1() choose the variables using p-value

Answer

A

For each variable, it tries to answer “Does the feature in question provide additional predictive value IN THE PRESENCE of OTHER FEATURES?”

if not, which is the one that does not matter?

Question 57

Q

When to use weights vs offsets in terms of observed data?

Answer

A

Weights: use when observations of target variable are averages across the members of the same group (ex: 1 record = avg of 0.5 claim counts over 100 policyholders)
Offsets: use when observations are values aggregated over members of the same group
(ex: 1 record = 50 claim counts over 100 policyholders)

Question 58

Q

Weights and offsets, how do they affect the mean and variance

Answer

A

Weights: records with more weight have less variance and therefore more reliable, but this does not affect the mean of the target

Offsets: group size is positively related to the mean of the target variable, but leaves its variance unaffected

Question 59

Q

Classification: how does the cutoff work?

Answer

A

Binary classifiers only predict probabilities, but that doesn’t directly say if the event is predicted or not

Cutoff: if predicted probability is above threshold, then prediction, else not.

Question 60

Q

True or False

-when fitting models by maximum likelihood, additional variables never decrease the loglikelihood value

Question 61

Q

explain forward selection succintly

Answer

A

you start with no variables and then add variables until there is no improvement by the selected criterion

Question 62

Q

How does regularization work?

Answer

A

it adds a penalty to the logliklihood that relates to the size of the coefficients
-This dimisnishes the effect, particularly for features that have limited predictive power

Question 63

Q

Compare the ridge vs LASSO penalty

Answer

A

Ridge: penalty is proportional to the sum of squares of the estimated coefficients
Lasso: penalty is proportional to the sum of the absolute value of the estimated coefficients

Question 64

Q

True or false

-Regularization methods require binarization of categorical variables

Answer 63

A

the q-q plot displays the standardized deviance residuals
It is used to assess the adequacy of the fitted GLM. if the model is correctly specified ,then these standardized deviance resiudals should be approximately normally distributed