Predictive Modeling Flashcards

1
Q

Factors for choosing the right predictive model (151)

A
  1. Correlation structure - more complicated models may be needed for data containing correlated variables
  2. Purposed of the analysis
  3. The nature of the available data
  4. Characteristics of the outcome variable (eg, quantitative vs. qualitative, unrestricted vs. truncated, binary choice vs. unrestricted choice)
  5. Distribution of the outcome variable (eg, normal vs. skewed)
  6. Functional relationship (eg, linear vs. non-linear) - when the equation cannot be transformed into a linear form, iterative processes or a maximum likelihood procedure may be used instead of ordinary regression methods
  7. Complex decision model - whether a single equation model is sufficient or a simultaneous equation model is needed (if there is more than one dependent variable)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Steps of the data warehousing process (152)

A
  1. Identify which patients to include in the dataset
  2. Identify which data elements to merge with the patient list
  3. Identify what the data says about the patient (eg, create flags that describe the patient’s health and risk status)
  4. Attach the derived variables and flags to the patient identifiers to create a picture of the patient history
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Characteristics for assessing the quality of a model (152)

A
  1. Parsimony - should introduce as few variables as are necessary to produce the desired results
  2. Identifiability - if there are more dependent variables than independent equations, then issues such as bias will result
  3. Goodness of fit - variations in the outcomes variable should be explained to a high degree by the explanatory variables (measured by R^2 and other statistics)
  4. Theoretical consistency - results should be consistent with the analyst’s prior knowledge of the relationships between variables
  5. Predictive power - should predict well when applied to data that was not used in building the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Statistics for determining whether a model is good (153)

A
  1. R^2 - measures how much of the variation in the dependent variable is explained by the variation in the independent variables. A more valid measure may be Adjusted R^2 = 1 - (1 - R^2) * (N - 1) / (N - k - 1), where N = # of observations and k = # of parameters.
  2. Regression coefficients - examine the signs of the parameter estimates to ensure they make sense, then determine whether the value of the parameter estimate is statistically significant (using t distribution)
  3. F-Test - ratio of variance explained by the model divided by unexplained or error variance
  4. Statistics used for logistic models:
    a. Hosmer-Lemeshow statistic
    b. Somers’ D statistic
    c. C-statistic
  5. Multicollinearity - occurs when a linear relationship exists between the independent variables. May be addressed by removing one of the collinear variables.
  6. Heteroscedasticity - occurs when the error terms do not have a constant variance
  7. Autocorrelation - occurs when there is a correlation to the error term in the regression function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Re-sampling methods for validating a model (154)

A

These approaches help test the model’s predictive power

  1. Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
  2. Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
  3. Cross-validation - subsets of data are held out for use as validating sets
  4. Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly