Chapter 12: Assumptions Starblind Flashcards
What are two tests that evaluate normality in a distribution?
The Kolmogorov–Smirnov (K-S) and Shapiro–Wilk tests.
During what analysis is it OK to not have homogeneity of variance?
During parameter estimates you can bootstrap to make up for a lack of homoscedasticity.
What is multicollinearity?
“…when there is a strong relationship between two or more predictors.”
Pertaining to normal distribution, when should you use the method of least squares?
When a distribution is normal, but not when it is not.
b0 represents what in the general linear model?
The intercept, or the value of the outcome variable when the predictor variable is zero.
Why are measures of homoscedasticity and normality fundamentally flawed?
Their goal is to detect whether the variance in a data set is equally or normally distributed.
They rely on the concept of power to fuel this prediction (given that they are based on NHST). And power comes from sample size.
When sample size is low, i.e. when heteroscedasticity or non-normality is most likely, these tests lack power.
When sample size is high, i.e. when equal/normal variance is guaranteed due to the central limit theorem, these tests have appropriate power.
What type of variables should our predictor variable contain? How do outcome variables differ?
“All predictor variables must be quantitative or categorical (with two categories), and the outcome variable must be quantitative, continuous and unbounded.”
The assumption of homogeneity of variance has been violated.
1) What two possible methods could you have calculated this with? Give a short explanation about interpreting them…
2) What does this mean for your analysis?
1) Levene’s test: Tests whether the variances between groups are equal. If less than p = 0.05 then the variances are not equal.
Hartley’s Fmax: The ratio of the variance between the group with the biggest variance and the group with the smallest variance. If greater than the critical value then the variances are not equal, in some cases we use 2 as a default value.
2) Without homoscedasticity, any formula that uses standard error is invalid (confidence intervals and test statistics).
As collinearity increases there are three issues that we should be aware of. What are these, and why should we care?
1) As collinearity increases so do estimates of the standard error of the parameters, b.
This makes your sample less representative of the population.
2) Multicollinearity limits the fit of the overall model.
Factors account for the same variance in the model
3) Multicollinearity between predictors makes it difficult to assess the individual importance of a predictor.
If the predictors are highly collinear, and each accounts for similar variance in the outcome, then how can we know which of the two variables is important?’
What are the best ways to assess additivity and linearity along with homoscedasticity?
A scatter plot graph.
Linearity: Should roughly follow a straight line without curvature.
Homoscedasticity: Should have an equal spread across the graph—with no cone shapes.
If a parameter estimate is biased, what else would you expect to have error?
“Standard errors, confidence intervals, test statistics and p-values.”
What are two ways that we measure multicollinearity in a model?
1) Variance Inflation Factor (VIF): Measures whether a predictor has a strong linear relationship with the other predictors.
> 10 = Serious multicollinearity
<10 but >1 = Potential multicollinearity
2) Tolerance Statistic (1/VIF): Reciprocal of VIF.
<0.1 = Serious multicollinearity <0.2 = Potential multicollinearity
What is kurtosis?
“Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case”
b1 represents what in the general linear model?
The parameter of the predictor variable.
Briefly explain the assumption of no external variables.
There is no relationship between external variables and the model—as they have been controlled for.