statistical modeling Flashcards
statistical model
a mathematical model that embodies a set of statistical assumptions concerning the generation of sample data (and similar data from a larger population)
a statistical model represents, often in considerably idealized form, the data-generating process
a statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables
standard error dependencies (models, eg parameters)
DMCS
- quality of data (eg measurement errors)
- quality of the model (ie fit / low bias)
- collinearities (these can increase standard error)
- sample size (asymptotically proportional to 1/sqrt(n))
permutation test
- a Monte Carlo method to create a sampling distribution of a test statistic, such as a model parameter, by permuting the outcome variable values relative to the predictor tuples
- eg the model is fit on each permutation, and the test statistic is recomputed
- has the advantage of retaining the exact predictor distributions and whatever collinearities exist between the predictors
(normally, the same sampling distribution would be estimated via analytic methods, such as t-distributions for linear regression parameters)
rank transformation
some parametric statistical models are amenable to the rank transform, rendering them non-parametric;
eg,
* linear regression model, Y ~ A + B + C is parametric
* to transform to non-parametric use, rank(Y) ~ A + B + C, where rank assigns an ordinal (in order) to each value of Y
rank transformations may be useful for eg outliers, but may be difficult to interpret
variance partitioning property
- for certain models, the variance “partitions” between that explained or accounted for by a model, and the remaining (residual) variation
- total variance of the samples outcome variable = variance of the model output + variance of the residual(s)
how statistical models work (Kaplan)
- statistical models partition variation
- individual case = model value + deviation = amount model can explain + what model cannot explain
three main types of statistical models
- description–describe a range or typical values of a quantity
- classification or prediction
- anticipating consequences of intervention–eg will a gas tax cause reduced consumption; related to causal modeling
ANOVA (for models)
the same methods for eg population mean ANOVA can be applied to models and residuals
- general:
- SST = SSM + SSE, where SSM variance of fitted model output, and SSE is variance of model residuals
- after correcting for degrees of freedom and making some normality assumptions, the ratio MSM/MSE can produce an F value, whence to a p-value
- this is broadly applicable (just like R^2), regardless of the model type
- per-variable effects
- each variable (model term) gets its own SS, MS, F value, and p-value
- the type of ANOVA (Type I, II, III) affects how ANOVA apportions effects among model variables, by determining how SS is computed for each term
- eg Type I (sequential sum of squares) goes in order of predictors fed to the model: SS(p_k | p_1,…,p_{k-1}) = SS(p_1,…,p_k) - SS(p_1,…,p_{k-1}), for k=1 to number of predictors
- if the predictors are correlated then Type I will give different per-predictor results, depending on ordering
some properties of covariates (in models)
- aka confounding variables or nuisance variables
- adding covariates to a model can never reduce R^2, only increase it or leave it unchanged
- if covariates are correlated with explanatory variables, their inclusion will have an effect on model coefficients (of linear models)