Predictive Modeling Flashcards

Question 1

Q

Factors for choosing the right predictive model (151)

Answer

A

Correlation structure - more complicated models may be needed for data containing correlated variables
Purposed of the analysis
The nature of the available data
Characteristics of the outcome variable (eg, quantitative vs. qualitative, unrestricted vs. truncated, binary choice vs. unrestricted choice)
Distribution of the outcome variable (eg, normal vs. skewed)
Functional relationship (eg, linear vs. non-linear) - when the equation cannot be transformed into a linear form, iterative processes or a maximum likelihood procedure may be used instead of ordinary regression methods
Complex decision model - whether a single equation model is sufficient or a simultaneous equation model is needed (if there is more than one dependent variable)

Question 2

Q

Steps of the data warehousing process (152)

Answer

A

Identify which patients to include in the dataset
Identify which data elements to merge with the patient list
Identify what the data says about the patient (eg, create flags that describe the patient’s health and risk status)
Attach the derived variables and flags to the patient identifiers to create a picture of the patient history

Question 3

Q

Characteristics for assessing the quality of a model (152)

Answer

A

Parsimony - should introduce as few variables as are necessary to produce the desired results
Identifiability - if there are more dependent variables than independent equations, then issues such as bias will result
Goodness of fit - variations in the outcomes variable should be explained to a high degree by the explanatory variables (measured by R^2 and other statistics)
Theoretical consistency - results should be consistent with the analyst’s prior knowledge of the relationships between variables
Predictive power - should predict well when applied to data that was not used in building the model

Question 4

Q

Statistics for determining whether a model is good (153)

Answer

A

R^2 - measures how much of the variation in the dependent variable is explained by the variation in the independent variables. A more valid measure may be Adjusted R^2 = 1 - (1 - R^2) * (N - 1) / (N - k - 1), where N = # of observations and k = # of parameters.
Regression coefficients - examine the signs of the parameter estimates to ensure they make sense, then determine whether the value of the parameter estimate is statistically significant (using t distribution)
F-Test - ratio of variance explained by the model divided by unexplained or error variance
Statistics used for logistic models:
a. Hosmer-Lemeshow statistic
b. Somers’ D statistic
c. C-statistic
Multicollinearity - occurs when a linear relationship exists between the independent variables. May be addressed by removing one of the collinear variables.
Heteroscedasticity - occurs when the error terms do not have a constant variance
Autocorrelation - occurs when there is a correlation to the error term in the regression function

Question 5

Q

Re-sampling methods for validating a model (154)

Answer

A

These approaches help test the model’s predictive power

Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
Cross-validation - subsets of data are held out for use as validating sets
Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points

(5 cards)