Module 6 - GLM Flashcards

1
Q

Overdispersion definition? how to fix?

A

1) Variance of response is greater than the mean in the Poisson GLM regression

2) Use quasi-likelihood (quasi Poisson)
- Estimates will be the same, but standard error of estimates will be different

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Interaction terms: why does this occur?

A

1) Occurs when the response depends on the relationship between a combination of features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Interaction terms: why use underlying variables?

A

Hierarchy principle: if we include in an interaction in a model, should also include the main effects, even if insignificant p-values

1) Interactions hard to interpret in a model without main effects
2) Interactions also contain main effects, even if main model has no main effects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

R^2, what does it mean? Problem with the measure and solution?

A

1) Fraction of variance explained by the model = fraction of variance reduced

2) Adding a predictor always increases its value
- Fix: adjusted R^2 adds a penalty for more parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Collinearity, definition? How to deal with it? (2)

A

-2 or more predictor variables are related to each other

solutions:

1) Drop one of the problematic variables
2) Combine collinear variables together into a single predictor

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Offsets, definition? Why are they used?

A
  • Variable for which the effect on the response is known. Therefore, the coefficient does not need to be estimated. (Beta = 1)
  • But the GLM still needs to be made aware of the existence of the offsetted variable so that the estimated coefficients for the OTHER variables are optimal in its presence
  • Used to adjust for exposure
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Prior weights, definition? why used?

A
  • Give information about the credibility of each observation in the model.
  • Assign a great credibility to rows that represent a greater number of risks in the estimation of the model coefficients.
  • Weight variable specifies the weight given to each record in the estimation process.
    ex: 1 year of exposure vs 1 month of exposure

-Observations with HIGHER EXPOSURE deemed to have LOWER VARIANCE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Deviance, definition? how does it work?

A
  • Measure of goodness of fit of a GLM
  • Compares likelihood with/without parameters
  • Smaller deviance = Better model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Homoscedasticity, definition?

A

Error terms have constant variance

e ~ N( 0 , sigma)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Graph to use for:

1) Observations that have too large an impact on coefficients?
2) Normality of the distribution of residuals?
3) Homogeneity of the variance and linearity of relationship? (2)

A

1) Residuals vs Leverage
2) Normal vs Q-Q
3) Residuals vs Fitted, Scale-Location

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Alpha parameters for:

1) Lasso?
2) Elastic Net?
3) Ridge regression

A

1) Lasso: Alpha = 1
2) Elastic Net: 0 < Alpha < 1
3) Ridge: Alpha = 0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Difference between lasso and ridge? What is one better than the other at?

A

1) With Lasso, optimal solution can reduce a coefficient to exactly = 0
- Which cannot happen with ridge
- Thus, lasso can completely remove a feature

2) Lasso: Better at feature selection
- Ridge: Better at fit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

With lasso, as Lambda increases…?

A
  • More of the features will be eliminated

- Larger coefficients shrink at a much faster rate than smaller coefficients

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Limitation of feature selection using regularization techniques?

A

Automatic method -> so not always most interpretable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Cross-Validation, explained, steps?

A

Repeating the validation step with different training/test samples

1) Train model on k-1 parts, predict and record error on validation
2) Repeat K times, for all possibilities

3) Calculate the errors for each -> the CV error will equal the weighted sum of errors
- Take the average of evaluation from each fold

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

GLM to use for: (Family and link function)

Probability or binary?

A

Family = binomial

Link function = Logit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

GLM to use for: (Family and link function)

Count

A

Family = poisson or quasipoisson

Link function = Log

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

GLM to use for: (Family and link function)

Continuous positive?

A

Family = Gamma, Inverse Gaussian

Link function = Log (for both)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Advantages of using an alternative fitting procedure, such as subset selection and shrinkage, instead of least squares? (3)

A

1) Simpler model
2) Improved prediction accuracy
3) Results easier to interpret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Lasso Regularized regression

1) Advantages? (2)

A

1) Binarization always done (through the use of the model matrix), and each factor level treated as a separate feature
2) Variable selection is automatic, using CV to minimize prediction error rather than a proxy such as AIC or hypothesis tests

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Lasso Regularized regression

1) Disadvantages? (1)

A

1) Since variables are scaled, the estimated coefficients are difficult to interpret

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Classification tree using Cost-Complexity Pruning

1) Advantages? (4)

A

1) Easy to explain and present due to if/else nature
2) Automatically removes variables (by not showing up in the tree), allowing interpretation to focus on the most significant factors
3) More easily adapts to non-linear relationships
4) Automatically captures interaction effects

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Classification tree using Cost-Complexity Pruning

1) Disadvantages? (3)

A

1) Danger of overfitting

2) Resulting Tree can be highly dependent on the training set

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Random Forest

1) Advantages? (2)

A

1) Reduces overfitting and variance by allowing results from multiple trees to be combined
2) Uses CV to set the tuning parameters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Random Forest

1) Disadvantages? (3)

A

1) Difficult to interpret
2) Longer runtime
3) Difficult to implement

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Disadvantage of stepAIC algorithm for factor variables?

How to solve the issue?

A

1) stepAIC treats factor variables as a single feature
- As such, it either retains the variable with its existing variables or removes the variable entirely.
- Does not allow for the possibility that individual factor levels may be insignificant with regard to the base level, or insignificantly different from other levels
2) Solution: binarize the factor variables (1 or 0)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Interpret AIC between models?

A

Model with the LOWEST AIC is the best

28
Q

True or false

-When using regularization methods, numeric features should be standardized

A

TRUE

-Standardization gives each feature an equal chance of having its coefficient altered

29
Q

Why do you create an interaction term?

A
  • Interaction term because we suspect the effect on the target of one variable is influenced by the level of another variable
  • Multiplying variables together/creating flags allows algorithms to pick out the patterns and interactions MUCH MORE EASILY than hoping that the algorithm finds them itself
30
Q

Advantages of GLMs? (4)

A

1) Intuitive to understand and easily communicated
2) Allow for non-normal distributions
3) Allow a functional relationship between the target, and a linear function of the variables. Can show the effect of a predictor variable on the target variable in terms of magnitude and direction (+/-)
4) good for modelling continuous response variables

31
Q

Disadvantages of GLMs (4)

A

1) CANNOT capture NON-LINEAR relationships
2) Sensitive to the choice of features included
3) Risk of collinearity producing suboptimal models
4) Underlying assumptions may not always be met

32
Q

Explain regularization in 1 sentence

A

-Regularizations are techniques used to REDUCE THE VARIANCE by PENALIZING THE MODEL FOR ADDING MORE PREDICTORS to avoid the risk of overfitting.

33
Q

Why do we use stepwise variable selection with AIC? (2)

A
  • To remove unimportant variables

- To reduce the risk of overfitting

34
Q

What do p-values represent/express?

A

They express the significant of the variables

  • Smaller the p-value, the more significant the variable is
  • Less than 0.05 is considered statistically significant
35
Q

What does this do:

Drop1(glm.freq, test = “LRT”)

A
  • Conducts a likelihood ratio test for the model

- Small p value = variable is highly significant

36
Q

Stepwise Regression, advantages? (2)

A
  • Automatic method and fast

- Can manage large amounts of predictor variables to choose the best ones from the available options

37
Q

Stepwise Regression, disadvantages? (2)

A
  • Problems with correlated variables: if two predictor variables in the model are highly correlated, only 1 may make it into the model.
  • Greedy nature of the algorithm. It assumes each step, is going to move you closer to a good solution. For many reasons, this is a bad assumption. Since it is automatic, some variables may be removed from the model when they are important to be included in.
  • therefore, there is no guarantee it will yield the best subset of features among all possible combinations
38
Q

Bias variance tradeoff explained:

  • High bias = ?
  • High variance = ?
A
  1. High bias = inaccurate model
    - Does not have the capacity to capture the signal in the data
  2. High variance = overfits to the data it was trained on
    - Won’t generalize well to unseen data
39
Q

Describe a GLM in a phrase

A

Models that

  • Take all significant variables into account
  • Assess the relative importance of each predictor
  • While also creating an easy to implement formula to calculate a prediction for a given observation
40
Q

Explain what a glm family is?

A

Family refers to the distribution that the target variable is assumed to follow

-Will impact how the algorithm fits the model

41
Q

What is the purpose of the link function?

A

Link function is used to force the mean of the prediction for a specific observation to be in a specific range

42
Q

Why is overfitting bad (i.e addition of additional predictor variables)?

A

Adding additional variables can improve fit to the training data, but may actually decrease fit against unseen (testing data)

43
Q

Information criterion, use? Explain

A

Reduces overfitting by demanding that an additional variable increase the loglikelihood by a specific amount in order to be added

44
Q

AIC vs BIC, comparison?

A

AIC
-Adding a variable requires an increase in loglikelihood of 2 per parameter added

BIC
-Adding a variable requires an increase in loglikelihood of ln(n) to be added

Therefore, BIC is a more conservative approach, since there is a greater penalty of each parameter

45
Q

AIC vs BIC, when would you use which?

A

BIC penalizes more severely than AIC
-Therefore, if you’re trying to understand the key variables related to the target variable (i.e smallest # possible), better to use BIC since it is more conservative

46
Q

Forward vs backward selection -> why use one or the other?

A

Forward selection is more likely to end up with fewer variables -> resulting in a SIMPLER MODEL

47
Q

In R, how to treat a numeric variable that has distinct cases?

A

Convert it into a factor variable

48
Q

Why is RMSE used as a regression performance indicator over MSE?

A

RMSE has the same unit as the target variable, making its value easier to interpret

49
Q

Explain strategies for reducing the dimension of a categorical variable?

A
  1. Combining similar categories
    - Categories with similar values of the target variable (mean, median, etc)
  2. Combining sparse categories together
    - Categories with few observations
    - Treating variables as “other”
50
Q

Why do you use k-1 for k predictors?

A

or else, there is a perfect linear relationship (collinearity), which will destabilize the model fitting process

51
Q

How do you choose the baseline level for a categorical predictor?

A

Choose the one with the most observations (default), or choose the one that ‘makes the most sense logically’

52
Q

Pros of using regularization for feature selection in R? (2)

A
  1. glmnet() allows binarization of categorical predictors in advance.
    - This allows us to assess the significance of individual factor levels, not just the significance of the entire categorical predictor
  2. Regularization is computationally more efficient than stepwise selection algorithms
53
Q

Cons of using regularization for feature selection in R? (2)

A
  1. Regularization may not produce the most interpretable model
    - especially for ridge, since all features are retained
  2. glmnet() is restricted in terms of model forms -> it can’t accommodate all of the distributions for GLMs
    - Ex: it does not cover gamma model
54
Q

For AIC and BIC, how are features added?

A

The features must INCREASE the LOGLIKELIHOOD by the following amounts to be included:

AIC: 2
BIC: ln(n)

55
Q

Why is having collinear (highly correlated variables) in a model bad?

A
  • This means you’re entering the same information in the model twice
  • This makes it difficult for the GLM to separate the individual effects of the collinear variables on the target , causing instability in the model
56
Q

how does drop1() choose the variables using p-value

A

For each variable, it tries to answer “Does the feature in question provide additional predictive value IN THE PRESENCE of OTHER FEATURES?”

  • if not, which is the one that does not matter?
57
Q

When to use weights vs offsets in terms of observed data?

A
  • Weights: use when observations of target variable are averages across the members of the same group (ex: 1 record = avg of 0.5 claim counts over 100 policyholders)
  • Offsets: use when observations are values aggregated over members of the same group
    (ex: 1 record = 50 claim counts over 100 policyholders)
58
Q

Weights and offsets, how do they affect the mean and variance

A

Weights: records with more weight have less variance and therefore more reliable, but this does not affect the mean of the target

Offsets: group size is positively related to the mean of the target variable, but leaves its variance unaffected

59
Q

Classification: how does the cutoff work?

A

Binary classifiers only predict probabilities, but that doesn’t directly say if the event is predicted or not

Cutoff: if predicted probability is above threshold, then prediction, else not.

60
Q

True or False

-when fitting models by maximum likelihood, additional variables never decrease the loglikelihood value

A

TRUE

61
Q

explain forward selection succintly

A

you start with no variables and then add variables until there is no improvement by the selected criterion

62
Q

How does regularization work?

A

it adds a penalty to the logliklihood that relates to the size of the coefficients
-This dimisnishes the effect, particularly for features that have limited predictive power

63
Q

Compare the ridge vs LASSO penalty

A
  • Ridge: penalty is proportional to the sum of squares of the estimated coefficients
  • Lasso: penalty is proportional to the sum of the absolute value of the estimated coefficients
64
Q

True or false

-Regularization methods require binarization of categorical variables

A

TRUE

65
Q

What does the qq plot show? how does it relate to GLM fitting?

A
  • the q-q plot displays the standardized deviance residuals
  • It is used to assess the adequacy of the fitted GLM. if the model is correctly specified ,then these standardized deviance resiudals should be approximately normally distributed