Tests and Distributions (Basic Stats) Flashcards
Null Hypothesis
“no effect”
Alternative Hypothesis
Some real effect
Test Statistic
Standardized difference
Obtain p-value from the…
sampling distribution (If we had lots of samples, and the H_0 were true, what is the [theoretical] distribution of the test statistic?
General framework of a test:
- State null and alternative hypotheses
- Calculate test statistic
- Obtain p-value from sampling distribution
- Make conclusion
“Under H_0” means…
“if H_0 is true”
p-value is:
probability of observing the data we did (or more extreme), just by chance give that H_0 is true.
When is the t-distribution appropriate?
When the model assumptions are met:
- need to check the distribution for normality
(If the underlying distribution isn’t normal, than the distribution from which the p-value comes from will be incorrect, and will give an erroneous value!)
What graphical checks can we do to check normality?
- Boxplot
- Histogram
- Normal Probability Plot (Q-Q plot) (compare observed values to expected
Boxplots don’t really test for…
outliers
What is a Kernel?
Basically a smoothed histogram
What shape is a Q-Q plot if the distribution is short-tailed?
S- shaped!
What shape is a Q-Q plot if the distribution is long-tailed?
Inverted S (like x^3 graph)
What shape is a Q-Q plot if the distribution is right-skewed?
Like e^x function (curves upwards to inf)
What shape is a Q-Q plot if the distribution is left skewed?
like sqrt(x) function. (curves to a steady value)
What do we do if normality assumptions are violated?
- Transform the response variable
- or try a non-parametric test (aka no distribution is assumed) (these are usually less statistically powerful)
- Generate the sampling distribution by the Permutation Approach
What transformations can we try for right-skewed data?
- log
- sqrt
What is the permutation approach? (for t-tests)
Since H_0 says the labels don’t matter, if the underlying distribution isn’t normal, then we can rearrange the labels randomly a LOT (or possibly all) of times and find the distribution of all possible t-values.
Then finding the area under the tails from our t-value onwards will give us the p-value!
What is resampling?
When we generate the sampling distribution with repeat samples of the data
(ie: bootstrap sampling (with replacement), or permutation (w/o replacement)
What are non-parametric tests?
Wilcoxon Rank sum
Mann-Whitney test
Hypotheses for numerical normality tests?
H_0: the data are normal
H_A: the data are not normal
When to use the Shapiro-Wilk test for normality?
When 10 < n < 2000
When to use the Kolmogorov-Smirnov Test for normality?
When n > 2000