Hypothesis Testing Flashcards
Checking if a random variable is a normal random variable
“normal probability plot”
The data are plotted against a theoretical normal distribution in such a way that the points should form an approximate straight line
order value, z-score
Whose performance was more impressive (assuming a normal distribution) ?
need the value, the mean and the standard deviation then can compute z score. This shows how many standard deviation away from the mean it is. The further the more impressive :D
PMF and CDF
can be used for hypothesis testing of is it taken from that distribution
Binomial distribution and normal distribution
can sample n times the random variable following that distribution and
checking with it how likely the outcome we got is from the PMF/CDF
if <0.05 then significant, not taken from the distribution
Bernoulli & binomial
0 1 samples, doing it n times. discrete
CLT tells us that if 𝑛 is large, binomial random variables will be distributed approximately normally.
Central Limit Theorem
CLT
The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed. Thus only a test for big sample size, for small ones use student t-test
assume that and compute z score with sample data or use cdf with null hypothesis being an average of interest.
student t test
When n is small, the Central Limit Theorem can no longer be used. In this case, if the samples are drawn from an approximately normal distribution, then the correct distribution to use is called the [Student’s t distribution]
Normal distribution with heavier tails
error type
+ A type I error is the incorrect rejection of a true null hypothesis (a “false positive”).
+ A type II error is incorrectly accepting a false null hypothesis (a “false negative”).
data dredging/ p hacking
the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives (incorrect rejection of null hypothesis: false significant difference). This is done by performing many statistical tests on the data and only reporting those that come back with significant results.
To prevent this: Cross validation
inference
draw a conclusion from a sample, won’t be perfect representation of the population
confidence interval
interval around an estimated parameter taken from a sample of a full population.
compute z score and get interval around it to have the confidence interval around the mean
The 95% confidence interval does not mean that with probability 95%, the true value of 𝜇 lies within the interval.
A 95% confidence interval means that if we were to repeat the same experiment many times, and compute the confidence interval using the same formula, 95% of the time it would contain the true value of 𝜇 .
two sample 𝑡 -test
2 independent samples, we compare the means
proportion t test if dealing with categorical variables
A B testing
use 2 versions, apply them on random people. Test if there is a statistical difference between the 2.
Chi square
whether categorical variables are independent
Null hypothesis: are independent.
ANOVA
compare 3+ means