Module 6 - GLM Flashcards
Overdispersion definition? how to fix?
1) Variance of response is greater than the mean in the Poisson GLM regression
2) Use quasi-likelihood (quasi Poisson)
- Estimates will be the same, but standard error of estimates will be different
Interaction terms: why does this occur?
1) Occurs when the response depends on the relationship between a combination of features
Interaction terms: why use underlying variables?
Hierarchy principle: if we include in an interaction in a model, should also include the main effects, even if insignificant p-values
1) Interactions hard to interpret in a model without main effects
2) Interactions also contain main effects, even if main model has no main effects
R^2, what does it mean? Problem with the measure and solution?
1) Fraction of variance explained by the model = fraction of variance reduced
2) Adding a predictor always increases its value
- Fix: adjusted R^2 adds a penalty for more parameters
Collinearity, definition? How to deal with it? (2)
-2 or more predictor variables are related to each other
solutions:
1) Drop one of the problematic variables
2) Combine collinear variables together into a single predictor
Offsets, definition? Why are they used?
- Variable for which the effect on the response is known. Therefore, the coefficient does not need to be estimated. (Beta = 1)
- But the GLM still needs to be made aware of the existence of the offsetted variable so that the estimated coefficients for the OTHER variables are optimal in its presence
- Used to adjust for exposure
Prior weights, definition? why used?
- Give information about the credibility of each observation in the model.
- Assign a great credibility to rows that represent a greater number of risks in the estimation of the model coefficients.
- Weight variable specifies the weight given to each record in the estimation process.
ex: 1 year of exposure vs 1 month of exposure
-Observations with HIGHER EXPOSURE deemed to have LOWER VARIANCE
Deviance, definition? how does it work?
- Measure of goodness of fit of a GLM
- Compares likelihood with/without parameters
- Smaller deviance = Better model
Homoscedasticity, definition?
Error terms have constant variance
e ~ N( 0 , sigma)
Graph to use for:
1) Observations that have too large an impact on coefficients?
2) Normality of the distribution of residuals?
3) Homogeneity of the variance and linearity of relationship? (2)
1) Residuals vs Leverage
2) Normal vs Q-Q
3) Residuals vs Fitted, Scale-Location
Alpha parameters for:
1) Lasso?
2) Elastic Net?
3) Ridge regression
1) Lasso: Alpha = 1
2) Elastic Net: 0 < Alpha < 1
3) Ridge: Alpha = 0
Difference between lasso and ridge? What is one better than the other at?
1) With Lasso, optimal solution can reduce a coefficient to exactly = 0
- Which cannot happen with ridge
- Thus, lasso can completely remove a feature
2) Lasso: Better at feature selection
- Ridge: Better at fit
With lasso, as Lambda increases…?
- More of the features will be eliminated
- Larger coefficients shrink at a much faster rate than smaller coefficients
Limitation of feature selection using regularization techniques?
Automatic method -> so not always most interpretable
Cross-Validation, explained, steps?
Repeating the validation step with different training/test samples
1) Train model on k-1 parts, predict and record error on validation
2) Repeat K times, for all possibilities
3) Calculate the errors for each -> the CV error will equal the weighted sum of errors
- Take the average of evaluation from each fold
GLM to use for: (Family and link function)
Probability or binary?
Family = binomial
Link function = Logit
GLM to use for: (Family and link function)
Count
Family = poisson or quasipoisson
Link function = Log
GLM to use for: (Family and link function)
Continuous positive?
Family = Gamma, Inverse Gaussian
Link function = Log (for both)
Advantages of using an alternative fitting procedure, such as subset selection and shrinkage, instead of least squares? (3)
1) Simpler model
2) Improved prediction accuracy
3) Results easier to interpret
Lasso Regularized regression
1) Advantages? (2)
1) Binarization always done (through the use of the model matrix), and each factor level treated as a separate feature
2) Variable selection is automatic, using CV to minimize prediction error rather than a proxy such as AIC or hypothesis tests
Lasso Regularized regression
1) Disadvantages? (1)
1) Since variables are scaled, the estimated coefficients are difficult to interpret
Classification tree using Cost-Complexity Pruning
1) Advantages? (4)
1) Easy to explain and present due to if/else nature
2) Automatically removes variables (by not showing up in the tree), allowing interpretation to focus on the most significant factors
3) More easily adapts to non-linear relationships
4) Automatically captures interaction effects
Classification tree using Cost-Complexity Pruning
1) Disadvantages? (3)
1) Danger of overfitting
2) Resulting Tree can be highly dependent on the training set
Random Forest
1) Advantages? (2)
1) Reduces overfitting and variance by allowing results from multiple trees to be combined
2) Uses CV to set the tuning parameters
Random Forest
1) Disadvantages? (3)
1) Difficult to interpret
2) Longer runtime
3) Difficult to implement
Disadvantage of stepAIC algorithm for factor variables?
How to solve the issue?
1) stepAIC treats factor variables as a single feature
- As such, it either retains the variable with its existing variables or removes the variable entirely.
- Does not allow for the possibility that individual factor levels may be insignificant with regard to the base level, or insignificantly different from other levels
2) Solution: binarize the factor variables (1 or 0)
Interpret AIC between models?
Model with the LOWEST AIC is the best
True or false
-When using regularization methods, numeric features should be standardized
TRUE
-Standardization gives each feature an equal chance of having its coefficient altered
Why do you create an interaction term?
- Interaction term because we suspect the effect on the target of one variable is influenced by the level of another variable
- Multiplying variables together/creating flags allows algorithms to pick out the patterns and interactions MUCH MORE EASILY than hoping that the algorithm finds them itself
Advantages of GLMs? (4)
1) Intuitive to understand and easily communicated
2) Allow for non-normal distributions
3) Allow a functional relationship between the target, and a linear function of the variables. Can show the effect of a predictor variable on the target variable in terms of magnitude and direction (+/-)
4) good for modelling continuous response variables
Disadvantages of GLMs (4)
1) CANNOT capture NON-LINEAR relationships
2) Sensitive to the choice of features included
3) Risk of collinearity producing suboptimal models
4) Underlying assumptions may not always be met
Explain regularization in 1 sentence
-Regularizations are techniques used to REDUCE THE VARIANCE by PENALIZING THE MODEL FOR ADDING MORE PREDICTORS to avoid the risk of overfitting.
Why do we use stepwise variable selection with AIC? (2)
- To remove unimportant variables
- To reduce the risk of overfitting
What do p-values represent/express?
They express the significant of the variables
- Smaller the p-value, the more significant the variable is
- Less than 0.05 is considered statistically significant
What does this do:
Drop1(glm.freq, test = “LRT”)
- Conducts a likelihood ratio test for the model
- Small p value = variable is highly significant
Stepwise Regression, advantages? (2)
- Automatic method and fast
- Can manage large amounts of predictor variables to choose the best ones from the available options
Stepwise Regression, disadvantages? (2)
- Problems with correlated variables: if two predictor variables in the model are highly correlated, only 1 may make it into the model.
- Greedy nature of the algorithm. It assumes each step, is going to move you closer to a good solution. For many reasons, this is a bad assumption. Since it is automatic, some variables may be removed from the model when they are important to be included in.
- therefore, there is no guarantee it will yield the best subset of features among all possible combinations
Bias variance tradeoff explained:
- High bias = ?
- High variance = ?
- High bias = inaccurate model
- Does not have the capacity to capture the signal in the data - High variance = overfits to the data it was trained on
- Won’t generalize well to unseen data
Describe a GLM in a phrase
Models that
- Take all significant variables into account
- Assess the relative importance of each predictor
- While also creating an easy to implement formula to calculate a prediction for a given observation
Explain what a glm family is?
Family refers to the distribution that the target variable is assumed to follow
-Will impact how the algorithm fits the model
What is the purpose of the link function?
Link function is used to force the mean of the prediction for a specific observation to be in a specific range
Why is overfitting bad (i.e addition of additional predictor variables)?
Adding additional variables can improve fit to the training data, but may actually decrease fit against unseen (testing data)
Information criterion, use? Explain
Reduces overfitting by demanding that an additional variable increase the loglikelihood by a specific amount in order to be added
AIC vs BIC, comparison?
AIC
-Adding a variable requires an increase in loglikelihood of 2 per parameter added
BIC
-Adding a variable requires an increase in loglikelihood of ln(n) to be added
Therefore, BIC is a more conservative approach, since there is a greater penalty of each parameter
AIC vs BIC, when would you use which?
BIC penalizes more severely than AIC
-Therefore, if you’re trying to understand the key variables related to the target variable (i.e smallest # possible), better to use BIC since it is more conservative
Forward vs backward selection -> why use one or the other?
Forward selection is more likely to end up with fewer variables -> resulting in a SIMPLER MODEL
In R, how to treat a numeric variable that has distinct cases?
Convert it into a factor variable
Why is RMSE used as a regression performance indicator over MSE?
RMSE has the same unit as the target variable, making its value easier to interpret
Explain strategies for reducing the dimension of a categorical variable?
- Combining similar categories
- Categories with similar values of the target variable (mean, median, etc) - Combining sparse categories together
- Categories with few observations
- Treating variables as “other”
Why do you use k-1 for k predictors?
or else, there is a perfect linear relationship (collinearity), which will destabilize the model fitting process
How do you choose the baseline level for a categorical predictor?
Choose the one with the most observations (default), or choose the one that ‘makes the most sense logically’
Pros of using regularization for feature selection in R? (2)
- glmnet() allows binarization of categorical predictors in advance.
- This allows us to assess the significance of individual factor levels, not just the significance of the entire categorical predictor - Regularization is computationally more efficient than stepwise selection algorithms
Cons of using regularization for feature selection in R? (2)
- Regularization may not produce the most interpretable model
- especially for ridge, since all features are retained - glmnet() is restricted in terms of model forms -> it can’t accommodate all of the distributions for GLMs
- Ex: it does not cover gamma model
For AIC and BIC, how are features added?
The features must INCREASE the LOGLIKELIHOOD by the following amounts to be included:
AIC: 2
BIC: ln(n)
Why is having collinear (highly correlated variables) in a model bad?
- This means you’re entering the same information in the model twice
- This makes it difficult for the GLM to separate the individual effects of the collinear variables on the target , causing instability in the model
how does drop1() choose the variables using p-value
For each variable, it tries to answer “Does the feature in question provide additional predictive value IN THE PRESENCE of OTHER FEATURES?”
- if not, which is the one that does not matter?
When to use weights vs offsets in terms of observed data?
- Weights: use when observations of target variable are averages across the members of the same group (ex: 1 record = avg of 0.5 claim counts over 100 policyholders)
- Offsets: use when observations are values aggregated over members of the same group
(ex: 1 record = 50 claim counts over 100 policyholders)
Weights and offsets, how do they affect the mean and variance
Weights: records with more weight have less variance and therefore more reliable, but this does not affect the mean of the target
Offsets: group size is positively related to the mean of the target variable, but leaves its variance unaffected
Classification: how does the cutoff work?
Binary classifiers only predict probabilities, but that doesn’t directly say if the event is predicted or not
Cutoff: if predicted probability is above threshold, then prediction, else not.
True or False
-when fitting models by maximum likelihood, additional variables never decrease the loglikelihood value
TRUE
explain forward selection succintly
you start with no variables and then add variables until there is no improvement by the selected criterion
How does regularization work?
it adds a penalty to the logliklihood that relates to the size of the coefficients
-This dimisnishes the effect, particularly for features that have limited predictive power
Compare the ridge vs LASSO penalty
- Ridge: penalty is proportional to the sum of squares of the estimated coefficients
- Lasso: penalty is proportional to the sum of the absolute value of the estimated coefficients
True or false
-Regularization methods require binarization of categorical variables
TRUE
What does the qq plot show? how does it relate to GLM fitting?
- the q-q plot displays the standardized deviance residuals
- It is used to assess the adequacy of the fitted GLM. if the model is correctly specified ,then these standardized deviance resiudals should be approximately normally distributed