Introduction to Modeling Flashcards

1
Q

Steps of the data warehousing process (4)

A
  1. Identify which patients to include in the dataset
  2. Identify which data elements to merge with the patient list
  3. Identify what the data says about the patient (eg, create flags that describe the patient’s health and risk status)
  4. Attach the derived variables and flags to the patient identifiers to create a picture of the patient history
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Re-sampling methods for validating a model (4)

A

These approaches help test the model’s predictive power

  1. Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
  2. Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
  3. Cross-validation - subsets of data are held out for use as validating sets
  4. Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Characteristics for assessing the quality of a model (5)

A
  1. Parsimony - should introduce as few variables as are necessary to produce the desired results
  2. Identifiability - if there are more dependent variables than independent equations, then issues such as bias will result
  3. Goodness of fit - variations in the outcomes variable should be explained to a high degree by the explanatory variables (measured by R^2 and other statistics)
  4. Theoretical consistency - results should be consistent with the analyst’s prior knowledge of the relationships between variables
  5. Predictive power - should predict well when applied to data that was not used in building the model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Adjusted R^2 Formula

A

Measure makes and adjustment for number of independent variables in the model. Indicates if additional independent variable improves the model more than what would be expected by chance.

Adjusted R^2 = 1-(1-R^2) x (N-1)/(N-k-1)

N = sample size
k = number of independent variables (excluding the constant term)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Statistics for determining whether a model is good (7)

A
  1. R^2 - measures how much of the variation in the dependent variable is explained by the variation in the independent variables. A more valid measure may be Adjusted R^2 = 1 - (1 - R^2) * (N - 1) / (N - k - 1), where N = # of observations and k = # of parameters.
  2. Regression coefficients - examine the signs of the parameter estimates to ensure they make sense, then determine whether the value of the parameter estimate is statistically significant (using t distribution)
  3. F-Test - ratio of variance explained by the model divided by unexplained or error variance
  4. Statistics used for logistic models:
    a. Hosmer-Lemeshow statistic
    b. Somers’ D statistic
    c. C-statistic
  5. Multicollinearity - occurs when a linear relationship exists between the independent variables. May be addressed by removing one of the collinear variables.
  6. Heteroscedasticity - occurs when the error terms do not have a constant variance
  7. Autocorrelation - occurs when there is a correlation to the error term in the regression function
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Methods to detect multicollinearity (3)

A

Mutlicollinearity results when 2+ ind variables are highly correlated.

  1. Pair-wise correlation between the independent variables to identify any relationships (collinearity is a problem if correlation coefficit > abs(.8))
  2. Regression with a few signficant t-statistics (<0.05 or lower), high R^2, low F-statistic
  3. Auxiliary regressions where an independet vraiable is regressed on the remaining ind variables
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Methods to solve mutlicollinearity (4)

A
  1. Remove one of the collinear variables from the model
  2. Claculate the difference between the lagged values and current values of the variable (for time series data)
  3. Pooling cross-section and time series data
  4. Include additional data and increase the sample size
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Methods to detect heteroscedasticity (2)

A

Heteroscedasticity is variance of errors in regression are not constant.

  1. Chart residuals - present if residual errors increase as the independent variable increases
  2. Goldgel-Quandt test
    a. Sample is divided into 2 equal parts and a regression is run on each
    b. Test statisc = S1/S2 where S1 and S2 are the sums of the squared residuals from each regression
    c. Heteroscedasticity is present if test statistic is greater than a critical value.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Methods to solve heteroscedasticity (1)

A
  1. Perform transformation to the data and then run the regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Methods to detect autocorrelation (2)

A

Autocorrelation is correlation among error terms. Potential for models that include prior year variables.

  1. Plot error term v. independent variable or time to detect presence of a consistent pattern
  2. Durbin-Watson D test
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Factors for choosing the right predictive model (7)

A
  1. Correlation structure - more complicated models may be needed for data containing correlated variables
  2. Purposed of the analysis
  3. The nature of the available data
  4. Characteristics of the outcome variable (eg, quantitative vs. qualitative, unrestricted vs. truncated, binary choice vs. unrestricted choice)
  5. Distribution of the outcome variable (eg, normal vs. skewed)
  6. Functional relationship (eg, linear vs. non-linear) - when the equation cannot be transformed into a linear form, iterative processes or a maximum likelihood procedure may be used instead of ordinary regression methods
  7. Complex decision model - whether a single equation model is sufficient or a simultaneous equation model is needed (if there is more than one dependent variable)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Re-sampling methods for validating a model (4)

A

These approaches help test the model’s predictive power

  1. Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
  2. Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
  3. Cross-validation - subsets of data are held out for use as validating sets
  4. Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points
How well did you know this?
1
Not at all
2
3
4
5
Perfectly