GLM1 Flashcards

1
Q

2 main components of a Generalized Linear Model.

A

random component

systematic component

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

random component

A

each yi is assumed to be independent and to come from the exponential family of distributions with mean µi and variance Var(yi) = φV(µi)/ωi .

φ is called the dispersion parameter and is a constant used to scale the variance.

V(µ) is called the variance function and it describes the relationship between the variance and mean for a selected distribution type.

ωi are known as weights and assign a weight to each observation i.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

systematic component

A

which is of the form g(µi) = β0 + β1xi1 + β2xi2 +···+ βpxip +offset

right hand side is known as the linear predictor

offset term is optional and allows you to manually specify the estimates for certain variables

x’s are the predictor variables

g(µ) is called the link function

β0 is called the intercept term, and the other β’s are called the coefficients of the model. These are what we want to estimate. Once we know the β’s, we can plug in known values for the xi variables and calculate the predicted values of the yi variables (i.e., µi).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

link function

A

allows for transformations of the linear predictor

For rating plans, the log link function g(µ) = ln(µ) is typically used since it transforms the linear predictor into a multiplicative structure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

3 advantages of using a log link function when building a pure premium model for use in creating a rating plan

A

A log link allows for a multiplicative rating plan, which has the following advantages:

  • Simple and practical to implement.
  • It guarantees positive premiums.
  • Impact of risk characteristics is more intuitive.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

choice of the base level for the Gender variable

A

Since Female has more observations than Male, Female should be the base level for Gender.

Choosing a level with fewer observations as the base level will still result in the same predicted relativities for that variable (re-based to the chosen base level), but there will be wider confidence intervals around the estimated coefficients

Both the standard error and p-value will increase since a level with fewer observations is being made the new base level, and the model will have less confidence in the base level estimate (in addition to the confidence in the difference between the base level and Territory C).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

g(µ) = β0 +β1ln(InsuredAge)+β2Male+β3TerrA+β4TerrC+β5Male∗TerrA+β6Male∗TerrC

explain the meaning of each of the β parameters in the model.

A

β0 is the intercept (base level - female in territory B at age 1)

β1 is the change in g(µ) for a unit change in the natural log of insured age

β2 is the change in g(µ) for being male instead of female

β3 is the change in g(µ) for being in territory A instead of B

β4 is the change in g(µ) for being in territory C instead of B

β5 is the additional interaction effect on g(µ) for being male and in territory A

β6 is the additional interaction effect on g(µ) for being male and in territory C

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

2 common choices for modeling claim severity

and why

A

Gamma and Inverse Gaussian distributions

Claim severity distributions tend to be right-skewed and have a lower bound at 0. Both Gamma and Inverse Gaussian distributions exhibit these properties.

Inverse Gaussian has a sharper peak and wider tail than Gamma (for the same mean and variance), so the Inverse Gaussian is more appropriate for severity distributions that are more skewed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

When creating a GLM with a log link function, it is generally recommended that continuous predictor variables

& 2 exceptions

A

be logged

Logging continuous variables allows for more flexibility in fitting different curve shapes to the data, since an unlogged variable will imply that the only relationship is exponential growth

2 exceptions: Using a time variable(e.g. year) to pick up time effects & when the variables contain values of 0 since ln(0) is undefined

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

main benefit of GLMs over univariate analysis

A

able to handle exposure correlation

GLMs also run into problems when predictor variables are very highly correlated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

what can happen in GLMs with highly correlated predictor variables

A

This can result in an unstable model with erratic coefficients that have high standard errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

two options for dealing with highly correlated predictor variables

A

i. Removing all highly correlated variables except one. This eliminates the high correlation in the model, but it also potentially loses some unique information contained in the eliminated variables.
ii. Use dimensionality-reduction techniques such as principal components analysis or factor analysis to create a new subset of variables from the correlated variables, and use this subset of variables in the GLM. The downside is the additional time required to do this extra analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

multicollinearity

A

Multicollinearity occurs when there is a near-perfect linear dependency among 3 or more predictor variables.

For example, suppose x1 + x2 ≈ x3.

When multicollinearity is present in a model, the model may become unstable with erratic coefficients, and it may not converge to a solution

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

one way to detect multicollinearity in a model.

A

Use the variance inflation factor (VIF) statistic, which is given for each predictor variable

measures the impact on the squared standard error for that variable due to collinearity with other predictor variables by seeing how well other predictor variables can predict the variable in question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

aliasing

A

When there is a perfect linear dependency among predictor variables, those variables are aliased

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

how GLM software can be used to correct for aliasing in a GLM.

A

Most GLM software will detect aliasing and automatically remove one of the problematic variables from the model

17
Q

2 limitations of GLMs

A

i. GLMs give full credibility: The estimated coefficients are not credibility-weighted to recognize low volumes of data or high volatility. This concern can be partially addressed by looking at p-values or standard errors.
ii. GLMs assume that the randomness of outcomes are uncorrelated

18
Q

Two examples of violations of GLMs assume that the randomness of outcomes are uncorrelated

A
  • Using a dataset with several renewals of the same policy, since the same insured over different renewals is likely to have correlated outcomes.
  • When the data can be affected by weather, the same weather events are likely to cause similar outcomes to risks in the same areas
19
Q

considerations in merging policy and claim data for use in a GLM.

A
  • Matching claims to specific vehicles/drivers (for auto) or specific coverages.
  • Checking for timing differences between datasets, such as when each dataset is updated. Timing differences can cause record matching problems.
  • Is there a unique key to merge the data (e.g., policy number)? There is the potential for orphaned claims if there is no matching policy record, or duplicating claims if there are multiple policy records.
  • What level should data be aggregated before merging? This needs to be considered along the time dimension (e.g., CY) and also the policy level versus claimant/coverage level. For commercial, location level or policy level?
  • Are there fields in the data not needed for the analysis that can be discarded? Are there fields desired that are not present that we want to try and obtain from a different data source?