Introduction to Modeling Flashcards
Steps of the data warehousing process (4)
- Identify which patients to include in the dataset
- Identify which data elements to merge with the patient list
- Identify what the data says about the patient (eg, create flags that describe the patient’s health and risk status)
- Attach the derived variables and flags to the patient identifiers to create a picture of the patient history
Re-sampling methods for validating a model (4)
These approaches help test the model’s predictive power
- Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
- Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
- Cross-validation - subsets of data are held out for use as validating sets
- Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points
Characteristics for assessing the quality of a model (5)
- Parsimony - should introduce as few variables as are necessary to produce the desired results
- Identifiability - if there are more dependent variables than independent equations, then issues such as bias will result
- Goodness of fit - variations in the outcomes variable should be explained to a high degree by the explanatory variables (measured by R^2 and other statistics)
- Theoretical consistency - results should be consistent with the analyst’s prior knowledge of the relationships between variables
- Predictive power - should predict well when applied to data that was not used in building the model
Adjusted R^2 Formula
Measure makes and adjustment for number of independent variables in the model. Indicates if additional independent variable improves the model more than what would be expected by chance.
Adjusted R^2 = 1-(1-R^2) x (N-1)/(N-k-1)
N = sample size k = number of independent variables (excluding the constant term)
Statistics for determining whether a model is good (7)
- R^2 - measures how much of the variation in the dependent variable is explained by the variation in the independent variables. A more valid measure may be Adjusted R^2 = 1 - (1 - R^2) * (N - 1) / (N - k - 1), where N = # of observations and k = # of parameters.
- Regression coefficients - examine the signs of the parameter estimates to ensure they make sense, then determine whether the value of the parameter estimate is statistically significant (using t distribution)
- F-Test - ratio of variance explained by the model divided by unexplained or error variance
- Statistics used for logistic models:
a. Hosmer-Lemeshow statistic
b. Somers’ D statistic
c. C-statistic - Multicollinearity - occurs when a linear relationship exists between the independent variables. May be addressed by removing one of the collinear variables.
- Heteroscedasticity - occurs when the error terms do not have a constant variance
- Autocorrelation - occurs when there is a correlation to the error term in the regression function
Methods to detect multicollinearity (3)
Mutlicollinearity results when 2+ ind variables are highly correlated.
- Pair-wise correlation between the independent variables to identify any relationships (collinearity is a problem if correlation coefficit > abs(.8))
- Regression with a few signficant t-statistics (<0.05 or lower), high R^2, low F-statistic
- Auxiliary regressions where an independet vraiable is regressed on the remaining ind variables
Methods to solve mutlicollinearity (4)
- Remove one of the collinear variables from the model
- Claculate the difference between the lagged values and current values of the variable (for time series data)
- Pooling cross-section and time series data
- Include additional data and increase the sample size
Methods to detect heteroscedasticity (2)
Heteroscedasticity is variance of errors in regression are not constant.
- Chart residuals - present if residual errors increase as the independent variable increases
- Goldgel-Quandt test
a. Sample is divided into 2 equal parts and a regression is run on each
b. Test statisc = S1/S2 where S1 and S2 are the sums of the squared residuals from each regression
c. Heteroscedasticity is present if test statistic is greater than a critical value.
Methods to solve heteroscedasticity (1)
- Perform transformation to the data and then run the regression
Methods to detect autocorrelation (2)
Autocorrelation is correlation among error terms. Potential for models that include prior year variables.
- Plot error term v. independent variable or time to detect presence of a consistent pattern
- Durbin-Watson D test
Factors for choosing the right predictive model (7)
- Correlation structure - more complicated models may be needed for data containing correlated variables
- Purposed of the analysis
- The nature of the available data
- Characteristics of the outcome variable (eg, quantitative vs. qualitative, unrestricted vs. truncated, binary choice vs. unrestricted choice)
- Distribution of the outcome variable (eg, normal vs. skewed)
- Functional relationship (eg, linear vs. non-linear) - when the equation cannot be transformed into a linear form, iterative processes or a maximum likelihood procedure may be used instead of ordinary regression methods
- Complex decision model - whether a single equation model is sufficient or a simultaneous equation model is needed (if there is more than one dependent variable)
Re-sampling methods for validating a model (4)
These approaches help test the model’s predictive power
- Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
- Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
- Cross-validation - subsets of data are held out for use as validating sets
- Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points