Chapter 24-25 Flashcards
The Gaussian Distribution Is An Unreachable
Ideal
It is a symmetrical distribution • It extends infinitely in both directions • You may know that one or both of these traits is impossible for your data.
However, the tests based on Gaussian distribution
are fairly robust to violations if
sample size is large.
- good performance from a variety of distributions.
When sample size is small, it is hard to
tell what kind of
distribution it came from.
The larger samples more closely approximate
the source
population but still don’t look perfectly Gaussian.
What a Gaussian Distribution Really Looks Like
Small samples shown with scatter plots. • Some may be more likely than others, but all came from a Gaussian population.
When sample size is small, it is hard to tell
what kind
of distribution it came from.
There are several
types of normality
tests that ask
What skewness? • Negative skew? • Positive skew? • How much kurtosis? • How peaked is it?
Test: if you randomly sample
from a Gaussian population,
what is
the probability of
obtaining a sample that
deviates as much or more
than this one.
With a large sample
size, you are very
likely to
reject the hypothesis that the sample came from a normal distribution because most don’t.
With a small sample
size, you are unlikely
to
reject even if it is
very different from
normal
Comparing ranks
-Converts observation values to only ranks -Has the effect of downweighting outliers
Note that the median is the same middle sample in
both. But the mean depends on
the distribution (It could be higher or lower than the median).
Problem with comparing ranks:
You are forced to ignore one aspect of your data (how much the values differ from each other). You are only looking at what order they fall into.
Without removing values.
• Resampling: bootstrap
Randomizing: permutations • e.g. when comparing a control group to a treatment, randomly reassign the values to each category. Look to see if the real set of observations is very unusual when compared to many different randomized versions
Problems With Nonparametric Options
Rank based methods discard data
• Like throwing out outliers
• Randomization and resampling are computer
intensive (not much of a problem anymore).
• Nonparametric methods have less power than
parametric (based on a Gaussian distribution).
• i.e. you’ll need higher sample size
• On the other hand, maybe assuming Gaussian
is akin to “making up data.”
• The problems with parametric methods decrease
as sample size increases anyway