Mahler Flashcards
Describe 3 advantages of Baseball data over insurance data
- Constant set of risks (teams)
In insurance, insureds leave and enter database - Baseball loss data is readily available, accurate and not subject to development
Insurance loss data is sometimes hard to compile/obtain and is subject to possible reporting errors and loss development - Each team plays roughly the same number of games (equal size loss experience)
Describe 2 methods to identify if risk parameters are shifting over time
- Chi-Square Test
H0: each years were drawn from distribution with same mean frequency
X^2 = (actual - expected)^2 / expected
if test statistic > table value with df = n-1, we reject H0 and conclude that risk parameters are shifting over time - Compare correlations between years
a. Compute correlations between results for pairs of years of all risks
b. Take straight average of correlations for each separation in time
c. Examine how avg correlation depends on time
If correlation between years closer together higher than those further apart, we can conclude risk parameters are shifting over time
Using the 2 tests, Mahler concluded risk parameters are shifting.
What is the first question Mahler wanted to answer?
Is there a difference between teams?
Elementary analysis shows there is a non-random difference between teams (different avg experience, only a small number have losing % between 49% and 51%)
A team that has been worse than average over one period of time is more likely to be worse than average over another period of time
Conclusion: if we wish to predict future experience of a team, there is useful information in past experience of that team
If we conclude risk parameters are shifting over time, what does it imply?
We want to use credibility-weighting formulas that apply more credibility on recent years and less on older years.
Describe the 3 ways to estimate losing percentage for next year
u = ground mean = 50%
yi = most recent year of actual value of x
- Xest = u
Every risk is average, ignores past data (Z=0%) - Xest = Y1
Assume most recent year repeats (Z1=100%) - Xest = ZY1 + (1-Z)u
Cred-wtg of last year and u - Xest = Z/n * sum of (Yi + (1-Z)u)
Give equal weight z/n to n most recent years of data - Xest,i+1 = ZYi + (1-Z)Xest,i
Exponential smoothing: give latest year of data weight Z and (1-Z) to prior estimate - Xest = Sum of ZiYi + (1 - Sum of Zi)u
Most general formula
Would be calculated by computer
Increasing n will never produce inferior estimate since you can always give oldest years of data 0 weight
Explain how to determine which Z to use in the methods
using either buhlmann/bayesian or classical/limited fluctuation credibility methods one determines which z will be expected to optimize selection criterion in future
One can also empirically investigate which credibility would have optimized selected criterion if it had been used in past (retrospective tests)
Describe 3 criteria used to evaluate quality of estimate
- Least Squared Error
SSE = (Xest - Xactual)^2
MSE = SSE / n
n is number of teams * number of years
The smaller the MSE, the better solution
Method 2 is preferred under this criterion
Used by B&S - Small Chance of Large Errors (Limited Fluctuations)
Measures prob that observed result differed by more than certain % from predicted result
The less is this prob, the better the solution
Method 2 is preferred under this criterion - Meyers/Dorweiler
Calculate correlation between predictions and prediction errors
The smaller the corr, the better the solution
Vector 1 = Actual%/Pred%
Vector 2 = Pred%/50%
Method 2 is preferred under this criterion
Not interested in size of errors, only in correlation
Explain why use of more years of data does not result in higher Z
Since param are shifting substantially over time, use of older data (with equal weight) leads to a worse estimate
Explain why M/D cannot help chose optimal number of years
For each choice of number of historical years used, there can be a choice of credibility that results in 0 correlation
Describe the impact of delay in data on Z and prediction accuracy
Not having the most recent year of historical data significantly increases squared error of estimate
Optimal credibility typically decreases when there is a delay in getting data
Less current info is less valuable for estimating future
Explain the results of tests on Baseball data
Optimal credibility range from 50% to 70% will perform relatively well under all 3 criteria.
If Z is close, not exactly, to optimal level, exist only relatively small impact on result.
In which case, the 3 tests would not agree on optimal Z
LSE & Limited Fluctuation are focused on limiting large errors
M/D is focused on pattern (corr) between errors and mod
A situation where errors are small but correlated with pre-to-overall avg would be preferable for first 2 but not under M/D
Contrast hierarchical clustering versus non-hierarchical
Non-hierarchical clustering means new HG represents the best partition for the given number of clusters
Hierarchical clustering requires that each new group be a subset of an existing group
How do you determine the df for Chi-Square table search.
df = n-1