Statistics Theory L10 = Assumptions Of T-Distributions Flashcards
Assumptions of t-distributions? (5)
- Random sampling.
- Independent observations.
- Normality of the population distribution.
- Unknown population variance.
- Df is based on sample size.
Aspects of the performance of t-distributions when assumptions are not met (in real-world situations)? (2)
- Robustness of the 2-sample t-tools.
- Resistance of the 2-sample t-tools.
Robustness of the 2-sample t-tools?
= refers to a statistic being robust to departures from particular assumptions if it is valid even if assumptions are not met.
Valid?
= if uncertainty measure is approximately equal to the stated rates (eg, 95% CIs capture true value 95% of the time).
2 departures that t-distributions have to be robust to?
- Departures from normality.
- Departures from independence.
Departures from normality?
Handled by the Central Limit Theorem, where for large n (sample size), the sampling distribution is normal even if the population distribution is not.
The Central Limit Theorem is based on statistical theory and says 4 things?
If you have:
- 2 populations/groups with similar SD (σ1 = σ2), shape of distribution is similar & n1 = n2, then: validity is affected by long tails & little affected by skewness.
- 2 populations/groups with similar SD (σ1 = σ2), similar shape of distribution but n1 ≠ n2, then: validity is affected moderately by long tails, substantially by skewness.
- Skewness differs considerably, results from t-tools will be severely misleading.
- Equal samples = little effect of differing σ’s; unequal samples = larger effect of differing σ’s; unequal samples & larger effect of differing σ’s = intervals capture more or fewer true values than the nominal rate.
Departures from independence?
= are due to either the cluster effect or the serial effect or both.
Cluster effect?
= when samples that naturally occur in groupings tend to be more similar to each other.
Eg of Cluster effect?
Siblings, family.
Serial effect?
= when observations made closer together in time or space tend to be more similar.
- Similar to autocorrelation.
Consequences of both effects? (3)
- We overestimate n: evidence for a difference appears stronger than it really is.
- SEs & 95% CIs are too narrow.
- t-ratios are too large & p is too small.
Resistance of 2-sample t-tools?
= deals with outliers & resistant statistics.
Outlier?
= observation that is far from the average of the group.
Outlier attributes? (2)
- Produces long tails (t-test is unreliable).
- Could be caused by contamination.
Resistant statistic?
= value that doesn’t change much when a small part of the data changes, perhaps drastically.
Eg of a resistant statistic?
Median.
NB for these values? (3)
- T-tools based on averages are not resistant to these sorts of extreme values.
- We want the averages to be good representations of the groups.
- We don’t want 1-2 values to drive the outcome.
Strategies for the 2-sample problems (i.e., robustness & resistance)? (4)
- Consider serial or cluster effects. Groups? Repeated measures? Spatial/temporal dependence?
- Use plots, compare samples, i.e., evaluate the suitability of the t-tools.
- Consider transformations.
- Consider alternative methods that don’t depend on normality.
How do we deal with outliers? (2)
(1) Data recording/entry error?
(2) Is the outlier genuinely weird?
Data recording/entry error? (2)
- If on paper: cross & correct (don’t erase or tippex).
- If electronic: make a new data file & correct mistakes.
Is the outlier genuinely weird? (2)
- Use resistant analyses.
OR
- Provide a defensible reason to leave in/dropout the outlier.
Eg of how to deal with outliers?
Agent Orange example.
Agent Orange example? (6)
(1) Identify outliers visually. Remove outliers one-at-a-time & replot after each removal.
(2) Compute relevant statistical measures (p-values in this case), with & without outliers; full dataset, 646 removed, & 645 and 646 removed:
- Full dataset: p = 0.40.
- 1 case removed: p = 0.48.
- 2 cases removed: p = 0.54.
Therefore, no change in conclusion with or without the outliers.
(3) If conclusion is not affected by the outliers, report analysis with full data set.
(4) But examine outliers carefully to see what else can be learned.
(5) If analyses give different answers? Examine the suspects & consider what to do with those cases.
(6) Do they belong to the statistical population?
- Yes: report without suspects, report reason for dropping suspects.
- No: use resistant analyses, report both analyses (don’t use t-tests).