Introduction to Modeling Flashcards

Question 1

Q

Steps of the data warehousing process (4)

Answer

A

Identify which patients to include in the dataset
Identify which data elements to merge with the patient list
Identify what the data says about the patient (eg, create flags that describe the patient’s health and risk status)
Attach the derived variables and flags to the patient identifiers to create a picture of the patient history

Question 2

Q

Re-sampling methods for validating a model (4)

Answer

A

These approaches help test the model’s predictive power

Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
Cross-validation - subsets of data are held out for use as validating sets
Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points

Question 3

Q

Characteristics for assessing the quality of a model (5)

Answer

A

Parsimony - should introduce as few variables as are necessary to produce the desired results
Identifiability - if there are more dependent variables than independent equations, then issues such as bias will result
Goodness of fit - variations in the outcomes variable should be explained to a high degree by the explanatory variables (measured by R^2 and other statistics)
Theoretical consistency - results should be consistent with the analyst’s prior knowledge of the relationships between variables
Predictive power - should predict well when applied to data that was not used in building the model

Question 4

Q

Adjusted R^2 Formula

Answer

A

Measure makes and adjustment for number of independent variables in the model. Indicates if additional independent variable improves the model more than what would be expected by chance.

Adjusted R^2 = 1-(1-R^2) x (N-1)/(N-k-1)

N = sample size
k = number of independent variables (excluding the constant term)

Question 5

Q

Statistics for determining whether a model is good (7)

Answer

A

R^2 - measures how much of the variation in the dependent variable is explained by the variation in the independent variables. A more valid measure may be Adjusted R^2 = 1 - (1 - R^2) * (N - 1) / (N - k - 1), where N = # of observations and k = # of parameters.
Regression coefficients - examine the signs of the parameter estimates to ensure they make sense, then determine whether the value of the parameter estimate is statistically significant (using t distribution)
F-Test - ratio of variance explained by the model divided by unexplained or error variance
Statistics used for logistic models:
a. Hosmer-Lemeshow statistic
b. Somers’ D statistic
c. C-statistic
Multicollinearity - occurs when a linear relationship exists between the independent variables. May be addressed by removing one of the collinear variables.
Heteroscedasticity - occurs when the error terms do not have a constant variance
Autocorrelation - occurs when there is a correlation to the error term in the regression function

Question 6

Q

Methods to detect multicollinearity (3)

Answer

A

Mutlicollinearity results when 2+ ind variables are highly correlated.

Pair-wise correlation between the independent variables to identify any relationships (collinearity is a problem if correlation coefficit > abs(.8))
Regression with a few signficant t-statistics (<0.05 or lower), high R^2, low F-statistic
Auxiliary regressions where an independet vraiable is regressed on the remaining ind variables

Question 7

Q

Methods to solve mutlicollinearity (4)

Answer

A

Remove one of the collinear variables from the model
Claculate the difference between the lagged values and current values of the variable (for time series data)
Pooling cross-section and time series data
Include additional data and increase the sample size

Question 8

Q

Methods to detect heteroscedasticity (2)

Answer

A

Heteroscedasticity is variance of errors in regression are not constant.

Chart residuals - present if residual errors increase as the independent variable increases
Goldgel-Quandt test
a. Sample is divided into 2 equal parts and a regression is run on each
b. Test statisc = S1/S2 where S1 and S2 are the sums of the squared residuals from each regression
c. Heteroscedasticity is present if test statistic is greater than a critical value.

Question 9

Q

Methods to solve heteroscedasticity (1)

Answer

A

Perform transformation to the data and then run the regression

Question 10

Q

Methods to detect autocorrelation (2)

Answer

A

Autocorrelation is correlation among error terms. Potential for models that include prior year variables.

Plot error term v. independent variable or time to detect presence of a consistent pattern
Durbin-Watson D test

Question 11

Q

Factors for choosing the right predictive model (7)

Answer

A

Correlation structure - more complicated models may be needed for data containing correlated variables
Purposed of the analysis
The nature of the available data
Characteristics of the outcome variable (eg, quantitative vs. qualitative, unrestricted vs. truncated, binary choice vs. unrestricted choice)
Distribution of the outcome variable (eg, normal vs. skewed)
Functional relationship (eg, linear vs. non-linear) - when the equation cannot be transformed into a linear form, iterative processes or a maximum likelihood procedure may be used instead of ordinary regression methods
Complex decision model - whether a single equation model is sufficient or a simultaneous equation model is needed (if there is more than one dependent variable)

Question 12

Q

Re-sampling methods for validating a model (4)

Answer

A

These approaches help test the model’s predictive power

Bootstrap - the sampling distribution of an estimator is estimated by sampling with replacement from an original sample
Jackknife - the estimate of a statistic is systematically re-computed, leaving out one observation at a time from the sample set
Cross-validation - subsets of data are held out for use as validating sets
Permutation test - a reference distribution is obtained by calculating all possible values of the test statistic under rearrangements of the labels on the observed data points

Introduction to Modeling Flashcards

(12 cards)