Predictive Modeling Flashcards
1
Q
Factors for choosing the right predictive model (151)
A
- Correlation structure - more complicated models may be needed for data containing correlated variables
- Purposed of the analysis
- The nature of the available data
- Characteristics of the outcome variable (eg, quantitative vs. qualitative, unrestricted vs. truncated, binary choice vs. unrestricted choice)
- Distribution of the outcome variable (eg, normal vs. skewed)
- Functional relationship (eg, linear vs. non-linear) - when the equation cannot be transformed into a linear form, iterative processes or a maximum likelihood procedure may be used instead of ordinary regression methods
- Complex decision model - whether a single equation model is sufficient or a simultaneous equation model is needed (if there is more than one dependent variable)
2
Q
Steps of the data warehousing process (152)
A
- Identify which patients to include in the dataset
- Identify which data elements to merge with the patient list
- Identify what the data says about the patient (eg, create flags that describe the patient’s health and risk status)
- Attach the derived variables and flags to the patient identifiers to create a picture of the patient history
3
Q
Characteristics for assessing the quality of a model (152)
A
- Parsimony - should introduce as few variables as are necessary to produce the desired results
- Identifiability - if there are more dependent variables than independent equations, then issues such as bias will result
- Goodness of fit - variations in the outcomes variable should be explained to a high degree by the explanatory variables (measured by R^2 and other statistics)
- Theoretical consistency - results should be consistent with the analyst’s prior knowledge of the relationships between variables
- Predictive power - should predict well when applied to data that was not used in building the model
4
Q
Statistics for determining whether a model is good (153)
A
- R^2 - measures how much of the variation in the dependent variable is explained by the variation in the independent variables. A more valid measure may be Adjusted R^2 = 1 - (1 - R^2) * (N - 1) / (N - k - 1), where N = # of observations and k = # of parameters.
- Regression coefficients - examine the signs of the parameter estimates to ensure they make sense, then determine whether the value of the parameter estimate is statistically significant (using t distribution)
- F-Test - ratio of variance explained by the model divided by unexplained or error variance
- Statistics used for logistic models:
a. Hosmer-Lemeshow statistic
b. Somers’ D statistic
c. C-statistic - Multicollinearity - occurs when a linear relationship exists between the independent variables. May be addressed by removing one of the collinear variables.
- Heteroscedasticity - occurs when the error terms do not have a constant variance
- Autocorrelation - occurs when there is a correlation to the error term in the regression function
5
Q
Re-sampling methods for validating a model (154)
A
These approaches help test the model’s predictive power
- Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
- Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
- Cross-validation - subsets of data are held out for use as validating sets
- Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points