Generalized Linear Models (25-35%) Flashcards by Q Volo

What are the Ordinary Linear Model (OLS) assumptions?

The mean of the target variable is a linear function of the predictor variables
The variable of the target variable is constant, regardless of the values of the predictor variables
Given the predictor variables, the target variable has a normal distribution
Given the predictor variables, the observations are independent

How well did you know this?

Not at all

Perfectly

What are some common situations where the linear model assumptions do not hold?

The range of target variables is positive, as is generally the case with insurance claim severity and claim counts. The normal distribution allows for negative values, and hence the model may predict negative outcomes.
The variance of the target variable depends on the mean. This violates the constant variance assumption. For example, those for whom larger claims are predicted may also have a larger variance of those claims.
The target variable is binary. The restriction to 0 and 1 responses does not fit the normal distribution, as predicted values can easily go out of this range.
The relationship of the predictor variables to the target may not be linear. A common examples of non-linearity is a multiplicative relationship.

How well did you know this?

Not at all

Perfectly

What are the Generalized Linear Models assumptions?

Given the predictor variable values, the target variables are independent (this is unchanged).
Given the predictor variable values, the target variable’s distribution is a member of the exponential family.
Given the predictor variable values, the expected value of the target variables is mu=g^-1(n), n=xb. Where g is called the link function and g^-1 is its inverse

Note: if the conditional distribution of the target variable is normal (which is a member of the exponential family) and the link function is simply g(mu)- = mu, we have the ordinary regression model.

How well did you know this?

Not at all

Perfectly

What are the commonly used link functions?

Identity: g(η) = η, g−1(η) = η
Log: g(η) = log(η), g−1(η) = exp(η)
Reciprocal: g(η) = 1 / η, g−1(η) = 1 / η
Logit: g(η) = log[η / (1 − η)], g−1(η) = eη / (1 + eη)

How well did you know this?

Not at all

Perfectly

What is a key assumption of ordinary regression?

The conditional (given the values of the predictors) distribution of the response variable is normal.

How well did you know this?

Not at all

Perfectly

Normal distribution observations are?

Symmetric about the mean
Continuous
Can assume all positive and negative values

How well did you know this?

Not at all

Perfectly

Exponential distributions for GLMs

Binary - when the response variable is binary (zero or one). Solution is to use logistic regression.

Count Data - if data is in the form of a whole number of occurrence, such as claim counts, it may be best to use Poisson regression because it places positive probability on 0,1,2,… Here, the mean is a positive number, so a link function that forces positive number is best. The log link is commonly used.
NOTE: Poisson assumes the mean and variance are equal.

Continuous positive value data - regression models include gamma, lognormal, and inverse Gaussian distributions. These would generally be used with a log link function to ensure positive predictions (with a log link, the linear predictor is exponentiated, and exponentiation always produces a positive value)

Positive and negative distribution values - normal distribution with a log link function. The predictions of the mean value for new observations will always be positive.

Tweedie distribution - an in-between distribution of Poisson and gamma where the variance power is between 1 and 2. An important feature of this distribution is that it has discrete probability at zero (no claims) and then continuous probability on positive claim values.

How well did you know this?

Not at all

Perfectly

Define Overdispersion.

When the variance of the response is greater than the mean.

One simple fix to account for this is to use the quasi poisson family GLM instead of the Poisson. Note:the estimates are the same. The standard error of the estimates is different, though, which affects any hypothesis tests to be done. Also, note that the p-values for the hypothesis tests regarding the coefficients are all larger.

How well did you know this?

Not at all

Perfectly

Explain what the Hypothesis test is.

The null hypothesis (H0) is when the corresponding predictor is equal to 0. The alternative hypothesis (G1) is when the predictor variable is not equal to 0.

The test statistic for this test follows a t distribution and is provided for each predictor in the column labeled “t-value.” The corresponding p-value is in the column labeled “Pr( > |t|).” This can be interpreted as the probability of observing a test statistic more extreme than the observed test statistic, given the null hypothesis is true. As with most hypothesis tests, when the p-value is low, the null hypothesis is rejected, and when it is high, it is said that we fail to reject the null hypothesis.

How well did you know this?

Not at all

Perfectly

Define R^2 and adjusted R^2 values.

They are measures of goodness of fit. A higher value typically suggests a model that follows the data point better. One problem with R2 is that adding a predictor always increases its value. This violates the idea that a simpler model that performs almost as well is better. Adjusted R2 adds a penalty for more parameters.

How well did you know this?

Not at all

Perfectly

Define Akaike Information Criterion (AIC).

It is helpful in comparing models.

A lower AIC suggests the model is a better fit for the data than a higher AIC.

How well did you know this?

Not at all

Perfectly

Describe deviance.

Deviance is a measure of goodness of fit of a generalized linear model. It is similar to the sum of squared errors. The default value is called the null deviance and is the deviance measure when the target is predicted using its sample mean (so it is similar to the total sum of squares)

Deviance Summary table on pdf page 31

How well did you know this?

Not at all

Perfectly

Define hypothesis tests.

They conduct a likelihood ratio test, and for the sample model, the very small p-value indicates that this variable is highly significant. There is a simple command that conducts this test on all the current variables ie. drop1(glm.freq, test=”LRT”)

How well did you know this?

Not at all

Perfectly

Define Fisher Scoring.

Fisher’s Scoring algorithm is related to Newton’s method for solving maximum likelihood problems numerically.

How well did you know this?

Not at all

Perfectly

Define Akaike Information Criterion (AIC).

AIC, as well as other information criteria, provides a way to access the quality of your model through comparison to related models. It’s based on the deviance, but penalizes it for making the model more complicated. However, the value of the AIC on its own is not meaningful; it needs to be used in comparison with the AIC of another model where you would select the model with the smallest AIC.

Example of AIC summary on pdf page 31.

Drop variables that do not add predictive value but it is advised to no drop more than one at a time. Two variables may appear to lack predictive power, but when one is dropped, the other’s value may increase. Keep in mind that the test does not answer the question “is this variable valuable?” but rather”in the presence of the other variables, does this variable provide additional value?”.

How well did you know this?

Not at all

Perfectly

Define residuals.

Study These Flashcards

Residuals are the difference between predicted and observed

Define Prior Weights.

Study These Flashcards

Prior weights give information about the credibility of each observation in the model.

Note: observations with higher exposure (weights) are deemed to have lower variance, and the model will consequently be more influenced by these observations.

Define Offsets.

Study These Flashcards

At times, a coefficients’ value is known and so it doesn’t need to be estimated.

Describe an interaction.

Study These Flashcards

An interaction occurs when the response depends on the relationship between a combination of features, rather than just the features in isolation.

Describe feature selection using regularization.

Study These Flashcards

One of the goals of feature selection is to reduce the complexity of the model. A model becomes complex when the number of parameters is large, which loosely means that the model is excessively dependent on the features to which the parameters relate, leading to overfitting to the training data and poor generalization performance on unseen data. Feature removal is equivalent to setting some of those parameters to 0 (so the feature has no effect on the model). An intermediate position is to reduce the absolute value of the parameters, making them more nearly, but not exactly, 0. A natural thing to do is to modify the objective function to reflect this aim. We can do this by adding a regularization term to the objective function that restricts the magnitude of our parameters.

This concept of adding a penalty term to the objective we are trying to minimize in a given model fit is known as regularization (sometimes equivalently “penalization”).

Describe ridge regression.

Study These Flashcards

Ridge regression refers to regression using the sum of squared parameters as the regularization term.

note: as λ increases, the coefficients shrink, with the larger coefficients shrinking at a much faster rate than the smaller coefficients.

Describe lasso regression.

Study These Flashcards

Lasso regression refers to regression using the absolute value of the parameters as the regularization term. The optimal solution (with penalty) can reduce a coefficient to exactly zero, which cannot happen with ridge.

With lasso, as lambda increases, more of the features will be eliminated. In the limit, as with ridge, all the coefficients will be 0.

The penalty with ridge is larger than that for lasso when the coefficients are greater than 1 (chart on pdf pg 39). Conversely, for parameter values less than 1, the lasso penalty is greater than the ridge penalty. This means that with lasso, parameter values less than 1 will decrease faster relative to ridge. When the parameters are large, ridge will provide more shrinkage.

Describe elastic net.

Study These Flashcards

Lasso tends to be better at feature selection, but ridge generally results in an overall better fit. Sometimes, one will be better than the other for fitting a model, depending on the data and your preferences/implementation constraints. But why should we have to settle for just one? Why don’t we combine the benefits of ridge and lasso into one objective function? We can, and that is called elastic net regression.

f(beta) = (1-a)B^2 + (a)|B|, where 0 ≤ α ≤ 1

Limitations/disadvantages of regulation techniques?

Study These Flashcards

Performing feature selection as part of the optimization of a model is an automatic method and my not always results in the most interpretable model
It is a model-dependent technique, so the features we select as part of a linear regression using the lasso method have been optimized for linear regression, and may not necessarily translate to other model forms (although they can still be useful in other models). Ultimately, we can perform feature selection using regularization for different model forms, but due to the complexity of some objective functions, this can sometimes be suboptimal or intractable.

Using AIC for feature selection.

AIC can be used to reduce the number of variables in a regression setting. One way of viewing this process is that employing AIC performs an adjustment to the training error that accounts for potential overfitting. Thus, variable selection using AIC is an indirect method for reducing variance.

Define hyperparameters.

Alpha and lambda are constants that are specified (ie. they aren't optimized like the parameters are). They are part of the model, and while we can change them, they aren’t optimized as part of the core optimization algorithm. Changing these values change the output of the model.

Describe Cross Validation (CV).

CV involves repeating the validation step with different training and test samples. When the validation is performed many times, we are left with a population of test results, which represents the performance of the various model instances. This allows us to compare both the stability and generality of our models. Typically, modelers will plot the mean and standard error of their population of tests for each model instance, and select the one with the least complexity or that is the most useful in other ways as the “best” model.

Define training data, validation data, and testing data.

1. Training data - is what we use to develop the models. We run it though different algorithms (ie. GLMs, decision trees) and adjust the hyperparameters (ie. splitting criteria in the decision tree) within an algorithm 2. Validation data - is used to measure the fitness of algorithms and selectparameters. Using validation data will help you settle on which parameters and algorithms work best. Using this process, you will test many hypotheses on which features to use, which algorithms to develop, and which set of parameters achieve the highest accuracy. Validation data helps you to gain confidence in the decisions along the way, knowing that your model will work on data that was not used for training. Because this validation set has contributed to your model development, it can’t be considered “unseen” data (you, the modeler, have “seen” it and used it to influence your decisions). Testing data, in this case, serves as the final validation. 3. Testing data - it should only be used once, at the end of the model development process. The score that is computed using your final model, applied to the test data, is the FINAL measure of your model performance. If the score is low and you go back to adjust your model accordingly, you are just treating it as another validation set, which defeats the purpose of reserving unseen data. This could lead to unsatisfactory performance when the model is deployed. If you find that your model has performed poorly on the testing data, then it is likely that it was overfit to the training data, which needs to be rectified. In this case, you need to resample the training, validation, and test data and repeat the modeling exercise to avoid overfitting to the test data. note: . In the case where you can’t afford to use 20-40% of the data for validation and testing, cross validation is used to address the credibility problem. Note that cross validation replaces the need to split data for training and validation but not for testing—we still need an unseen testing set for the final evaluation. But this can be a much smaller portion of your data.

What are the steps in conducting k-fold cross validation? ie. k=5

1. set aside the testing data and forget about it until the end 2. split the remaining data into k folds (5), each containing 20% of the original data 3. during the model development process, you will use k-1 folds (4) for training and 1 for validation. 4. Repeat step 3, k times (5).

Explain how to choose alpha and lambda for CV.

1. split the data into training and test sets (ie. 80%/20% or otherwise) 2. split the training data into k folds (for k fold cross validation) 3. select a list of values you want to test ie. lambda=(0.01, 0.1, 0.5, 1) and alpha=(0, 0.25, 0.5, 0.75, 1) 4. for each pair of values (or single value for lasso or ridge): - train the model on (k-1) folds of the training data using those values - calculate the prediction error on the remaining fold of training data - repeat k times for each fold in the training data (CV) - calculate the average error rate across all iterations in the cross validation (CV error) 5. take the pair of values that gives the lowest CV error 6. train a model on all of the training data using the optimal hyperparameters 7. test the model on the test data

Distributions and link functions

1. a) gamma distribution - takes on only positive values b) log link - ensures a positive prediction 2. a) inverse gaussian distribution - takes on only positive values and has a fatter tail than gamma b) log link - ensures a positive prediction 3. a) normal distribution - logarithm can be negative b) identify link - acceptable to predict negative values

Generalized Linear Models (25-35%) Flashcards

(31 cards)