2 - Inference & Hypothesis Testing Flashcards
What happens when you change N (population size) and w (width of band) of a histogram?
- As N increases and w decreases, the underlying distribution becomes clearer
- As N -> infinity and w -> 0 we get a smooth curve of an underlying probability distribution
What is used to define the shape of a normal curve?
- sigma squared = population variance (standard deviation of the data set)
- mu = population mean (average of the data set)
What happens for two N (mu, sigma squared) distributions where sigma squared 1 = sigma squared 2, but mu 1 doesn’t = mu 2?
- The density functions will have the same shape but different locations on the x-axis
- *As long as SD stays the same, the shape will stay the same, but mean is always in the middle of the curve for a normal distribution, so if mean changes the curve will move left or right
What happens for two N (mu, sigma squared) distributions where sigma squared 1 doesn’t = sigma squared 2, but mu 1 = mu 2?
- Density functions will have different shapes but the same position on the x-axis
- *Mean is the same, so middle of the distribution will be the same but shape will change
What is the purpose of each function on a normal distribution (ex: X, mu, sigma squared, and AUC)?
- X is distributed as N (mu, sigma squared)
- Mu/ mean determines location
- Sigma squared determines distribution’s shape (peakedness and spread)
- AUC = 1, so a distribution w/ a larger sigma squared will be more spread and lower at the mean
Describe the normal distribution empirical rule
For normal distributions
- mu +/- sigma contains 68.26% of the observations
- mu +/- 2 sigma contains 95.44% of the observations
- mu +/- 3 sigma contains 99.74% of the observations
What are considered unrepresentative or atypical values for distribution?
Values outside mu +/- 3 sigma (which includes 99.74% of the study population)
What are considered somewhat representative values for distribution?
- Between representative (typical) and unrepresentative (atypical) values
- Fall within mu +/- 2 sigma (which includes 99.5% of the study population)
Describe the “line in the sand” for distribution
- Unrepresentative is generally accepted as 5% of a set of data
- The middle 95% representative and somewhat representative defined by +/- 1.96 sigma
What is the z-score?
z = [x - mu] / sigma
- x = value from the data set
- mu = mean of the data set
- sigma = SD of the data set
- z = z-score (# of standard deviations above or below the mean for a given x value)
- x = mu + sigma * z
What is the purpose of a z-score?
- To compare data from different data sets (different studies) – gives a common ground for us to make comparisons
- If value is negative, that means it is left of the mean; positive is to the right
What do you do if you want to know the probability that a normal deviate z might lie between - infinity and a z of 1.96?
- Pr (- infinity < z < 1.96) = Pr (0 < z < 1.96) + Pr (- infinity < z < 0) by symmetry
- Use 0 to Z table to find Pr (0 < z < 1.96)
- For Pr (infinity > z < 0), it constitutes the whole left half of the graph, so that equals 50%
What does p < 0.05 mean?
Probability is less than 5%, tells us if something is significant
What is the difference between a false positive and false negatives in statistics?
- False positive = stats say something is going on when in reality there isn’t
- False negative = something is actually happening but stats say there isn’t
What is the difference between a type 1 and type 2 error? What could be a cause of each type?
- Type 1 = false positive (ex: stats say there is a difference when there really isn’t); could be caused by non-normal data distribution analyzed w/ parametric statistics
- Type 2 = false negative (ex: stats say there is no difference when there really is); usually caused by small sample sizes
The possibility of statistical significance increases as _____
- Sample size increases (w/ a large enough sample, over 2000 people, the smallest difference or correlation is likely to be statistically significant)
- Differences between means or strength of correlations increases
Describe properties of sampling distributions
- Variability of the random sampling distribution depends on the sample size n, and the variability of the population sigma
- Larger sample size = smaller sigma / square root of n
- Sample means on larger samples are more trustworthy
- Smaller variability in population (sigma) = smaller variability we would expect in the sample
How can you increase confidence in the estimator?
- Use larger sample sizes
- Reduce variability in population by improving the sensitivity of the measurement
What is standard error of the mean?
- Describes the variability of a sampling distribution
- Aka standard deviation of the sampling distribution of means
- Standard deviation = variability of individual observations
- sigma /x = sigma / square root of n (same units as the data)
Define confidence interval
- Range of values used to estimate the true value of the population parameter
- Probability 1 - alpha (usually expressed as a %) or the proportion of times that the CI actually does contain the population parameter
- Establishes the precision (our confidence in) our estimate of mu
What happens when alpha decreases?
Confidence increases but precision decreases (widen the CI)