Ch8 - Confidence Intervals, Effect Size, and Statistical Power Flashcards
What are the new statistics?
- Effect sizes
- Confidence intervals
- Meta-analysis
Point estimate
Confidence Intervals
- A summary statistic from a sample that is just one number used as an estimate of the population parameter - “best guess”
- The true population mean is unknown - and we take a sample from the population to estimate the population mean
- EX: In studies on gender differences in math performance - the mean for boys, the mean for girls, and the difference between them, are point estimates
Interval estimate
Confidence Intervals
- Based on a sample statistic and provides a range of plausible values for the population parameter
- Frequently used by the media, often when reporting political polls, and are usually constructed by adding and subtracting a margin of error from a point estimate
What is the interval estimate composed of (EQUATION)?
Confidence Intervals
interval estimate = percentage, + and - the margin of error
Confidence intervals: we’re not saying that we’re confident that the population mean falls in the interval, but rather…
Confidence Intervals
we are merely saying that we expect to find the population mean within a certain interval a certain percentage of the time - usually 95% - when we conduct this same study with the same sample size
Confidence level vs. interval:
Confidence Intervals
- Level - the %
- Interval - range between the two values that suround the sample mean
Calculating confidence intervals with distributions
Confidence Intervals
- Draw a normal curve that has the sample mean at its center (NOTE: different from curve drawn for z test, where we had population mean at the center)
- Indicate the bounds of the confidence interval on the drawing
- Determine the z statistics that fall at each line marking the middle of 95%
- Turn the z statistics back into raw means
- Check that the confidence interval makes sense
Step 1 to calculating CI
Confidence Intervals
Draw a normal curve that has the sample mean at its center (NOTE: different from curve drawn for z test, where we had population mean at the center)
Step 2 to calculating CI
Confidence Intervals
- ** 2: Indicate the bounds of the confidence interval on the drawing**
- Draw a vertical line from the mean to the top of the curve
- For a 95% confidence interval we also draw two small vertical lines to indicate the middle 95% of the normal curve (2.5% in each tail, for a total of 5%)
- The curve is symmetric, so half of the 95% falls above and half falls below the mean
- Half of 95% = 47.5%, represented in the segments on either sides of the mean
Step 3 to calculating CI
Confidence Intervals
3. Determine the z statistics that fall at each line marking the middle of 95%
- To do so: turn back to the z table
- The % between the mean and each of the scores is 47.5% - when we look up this % in the z table, we find a statistic of 1.96
- Can now add the z statistics of -1.96 and 1.96 to the curve
Step 4 to calculating CI
Confidence Intervals
4. Turn the z statistics back into raw means
- Need to identify appropriate mean and SD to use formula
- Two important points to remember:
- Center the interval around the sample mean (not the population mean), so use the sample mean in the calculation
- Because we have a sample mean (rather than an individual score), we use a distribution of means - so we calculate standard error as the measure of spread:
Step 5 to calculating CI
Confidence Intervals
5. Check that the confidence interval makes sense
* The sample mean should fall exactly in the middle of the two ends of the interval
Statistically significant doesn’t/does mean…
- Does NOT mean that the findings from a study represent a meaningful difference
- ONLY means that those findings are unlikely to occur, in fact, if the null hypothesis is true
How does an increase in sample size affect SD and the test statistic? What dooes this cause?
The effect of sample size on statistical significance
- Each time we increased the sample size, the SD decreased and the test statistic increased
- Because of this, a small difference might not be statistically significant with a small sample but might be statistically significant with a large sample
Why would a large sample allow us to reject the null hypothesis than a small sample? (EXAMPLE)
If we randomly selected 5 women and they had a mean score well above the OkCupid average, we might say “it could be chance”; but if we randomly selected 1000 women with a mean rating well above the OkCupid average, it’s very unlikely that we just happened to choose 1000 people with high scores
Effect size
- Indicates the size of a difference and is unaffected by sample size
- Can tell us whether a statistically significant difference might also be an important difference
- Tells us how much two populations DO NOT overlap - the less overlap, the bigger the effect size
- DECREASING OVERLAP IS IDEAL!
How can the amount of overlap between two distributions be decreased? TWO WAYS:
1: overlap decreases and effect size increases when means are farther apart (distance wise)
2: overlap decreases and effect size increases when variability within each distribution of scores is smaller (height of peak)
How does effect size differ from statistical hypothesis testing?
Unlike statistical hypothesis testing, effect size is a standardized measure based on distributions of scores rather than distributions of means
* Rather than om = o/√N, effect sizes are based only on the variability in the distribution of scores and do not depend on sample size
Since effect sizes are not dependent on sample size, what does this allow us to do?
This means we can compare the effect sizes of different studies with each other, even when the studies have different sample sizes
When we conduct a z-test, the effect size is typically
Cohen’s D: a measure of effect size that expresses the difference between two means in terms of SD
* AKA, Cohen’s d is the standardized difference between two means
Formula for Cohen’s d for a z statistic:
d = (M - u)/o
- Similar to z statistic (om -> o, um -> u)
With the results, we can determine (from Cohen’s 3 guidelines)…
Small, Medium, Large Effects
- Small effects: 0.2 | 85% overlap
- Medium effects: 0.5 | 67% overlap
- Large effects: 0.8 | 53% overlap
Does an effect need to be large to be meaningful?
Just because a statistically significant difference is small, that does not necessarily suggest no meaning; interpreting the meaningfulness of the effect sizes depends on the context
Meta-analysis:
Meta-analysis
- a study that involves the calculation of a mean effect size from the individual effect sizes of more than one study
How do meta-analysis improve statistical power?
By considering multiple studies simultaneously and helps to resolve debates fueled by contradictory research findings
4 steps to calculating meta-analyses:
1: select the topic of interest and decide exactly how to proceed before beginning to track down studies
2: locate every study that has been conducted and meets criteria
3: calculate an effect size, often Cohen’s d, for every study
4: calculate statistics - ideally, summary statistics, a hypothesis test, a confidence interval, and a visual display of the effect sizes
Considerations to keep in mind:
1: select the topic of interest and decide exactly how to proceed before beginning to track down studies
- Make sure the necessary statistical information is available, either effect sizes of the summary stats necessary to calculate effect sizes
- Consider selecting only studies in which participants meet certain criteria, such as age, gender, or geographic location
- Consider eliminating studies based on the research design (EX: as they were not experimental in nature)
Key part involves finding…
2: locate every study that has been conducted and meets criteria
…any studies that have been conducted but not published
* Much of this “fugitive literature” or “gray literature” is unpublished simply because studies did not find a significant difference; the overall effect size seems larger without accounting these studies - AKA the “file drawer problem”
* Can find by using other sources - like contacting researchers to find unpublished work
“File drawer problem” - 2 solutions
2: locate every study that has been conducted and meets criteria
1: File drawer analysis: a statistical calculation, following a meta-analysis, of the number of studies with null results that would have to exist so that a mean effect size would no longer be statistically significant
* If just a few studies could render a mean effect size nonsignificant (no longer statistically significantly different from zero) then the mean effect size should be viewed as likely to be an inflated estimate
* If it would take several hundred studies in researchers’ “file drawers” to render the effect non-significant, then it’s safe to conclude that there really is a significant effect
2: Can work with replication to help draw more reliable conclusions
What visual display can researchers include?
4: calculate statistics - ideally, summary statistics, a hypothesis test, a confidence interval, and a visual display of the effect sizes
Forest plot: type of graph which shoes the confidence interval for the effect size of every study
Statistical power is…
…the likelihood of rejecting the null hypothesis WHEN WE SHOULD reject the null hypothesis
What is the probability that researchers consider the MINIMUM for conducting a study
Statistical power
0.80 - an 80% chance of rejecting the null if we should reject it
* Thus, they perform power analysis prior to conducting a study: if they have an 80% chance of correctly rejecting the null, then it’s appropriate to conduct the study
When we conduct a statistical null hypothesis test, we make a decision to either reject or fail to reject the null hypothesis. One issue being that we don’t have direct access to the truth about what we’re studying - instead…
- We make inferences based on the data we collected; which could be a right or wrong decision
- Overall a researcher’s goal is to be correct as often as possible - 2 ways to be right, and 2 ways to be wrong
What are 2 ways to be WRONG in rejecting/failing to reject the null hypothesis?
2 ways to be wrong - recap: Type I and Type II errors
What are 2 ways to be RIGHT in rejecting/failing to reject the null hypothesis?
1 - Correct decision: if the null is true and we fail to reject the null, we have made the correct decision (essentially leaving the null alone)
- In this case, we’re saying that there’s no effect, when in fact there is none
2 - Correct decision (Power): if the null hypothesis is false, and we reject the null hypothesis, that’s also a correct decision
- A goal of research is to maximize statistical power
Power is used by statisticians in a specific way - HOW?
- Statistical power: a measure of the likelihood that we will reject the null hypothesis, given that the null hypothesis is false
- In other words - statistical power is the probability that we will reject the null hypothesis when we should reject the null hypothesis; THE PROBABILITY THAT WE WILL NOT MAKE A TYPE II ERROR
The calculation of statistical power ranges from:
Probability of 0.00 to 1.00 (AKA 0% to 100%)
Conceptual calculation for power
- Power = effect size x sample size
- This means that we could achieve high power because the size of the effect is large - or we could achieve high power because the size of the effect is small, but it’s a large sample
The most practical way to increase statistical power for many behavioural studies is…
…to add more participants
How can researchers quantify the statistical power of their studies? 2 WAYS
1: By referring to a published table
2: By using computing tools like G*Power
G*Power
Used in 2 ways:
1: Can calculate power AFTER conducting a study from several pieces of information
- Because we are calculating power after conducting the study, G*Power refers to these calculations as post hoc, meaning after the fact
2: Can use in reverse, BEFORE conducting a study, so as to identify the sample size necessary to achieve a given level of power
- In this case, G*Power refers to calculations as a priori, which means prior to
Of the two, which of post hoc vs priori power is more meaningful?
post hoc power is NOT as meaningful as a priori power calculation for sample size planning
On a practical level, statistical power calculations tell researchers…
…how many participants are needed to conduct a study whose findings we can trust
Five factors that affect statistical power:
1: Increase alpha
2: Turn a two-tailed hypothesis into a one-tailed hypothesis
3: Increase N/sample size
4: Exaggerate the mean difference between levels of the IV
5: Decrease SD
1: Increase alpha
Five factors that affect statistical power:
- Like changing the rules by widening the goal posts in football, statistical power can increase when we increase an alpha level of 0.05
- This has the side effect of increasing the probability of a Type I error from 5% to 10%
2: Turn a two-tailed hypothesis into a one-tailed hypothesis
Five factors that affect statistical power:
- One tailed tests provide more statistical power, while two-tailed tests are more conservative
- However, best to use two-tailed
3: Increase N/sample size
Five factors that affect statistical power:
- Increasing sample size leads to an increase in the test statistic, making it easier to reject the null hypothesis
- Increase => distribution of means become more narrow and there is less overlap (larger sample size means smaller standard error)
4: Exaggerate the mean difference between levels of the IV
Five factors that affect statistical power:
The mean of population 2 is farther from the mean of population in part b) than it is in part a); difference in means is not easily changed, but can be done
5: Decrease SD
Five factors that affect statistical power:
When SD is smaller, standard error is smaller and the curves are narrower
We can reduce SD in two ways:
1: by using reliable measures from the beginning of the study
2: by sampling from a more homogenous group in which participants’ responses are more likely to be similar to begin with
LECTURES
CHP8 concepts push beyond the limits of NHST
- Effect size
- Confidence intervals
- Power
Effect size:
CHP8 concepts push beyond the limits of NHST
- If the null is really false, how big is that effect?
- Standardized numerical estimate of the population effect size using our sample data
Confidence intervals:
CHP8 concepts push beyond the limits of NHST
- Starting with sample mean, compute a range of plausible values for the true population of the mean
- Helps us prepare for replications
Power
CHP8 concepts push beyond the limits of NHST
- If the null is really false, how likely is it that we’re going to find a “significant effect” in our sample
- If the null is really false, how likely is it that we’re going to avoid a type II error
Effect size: after rejecting the null, we can conclude that…
we think we drew this sample from a different population with a different sampling distribution
Effect size - How can we guess the population of the mean we drew from?
Calculating the tallest point in the distribution, most common score
What does effect size look like, visually?
Distance from the highest peak of one distribution to the other (distance between group means)
How/what does effect size help us estimate?
If we DID draw from a different population, how different is that new population’s mean from the null mean?
What are some different ways to estimate an effect size for different kinds of data?
- How far away is the true mean from the null hypothesis mean?
- How far apart are the experimental and control conditions
- Strength of correlation
- How far from equal (50% each) is the distribution of proportions
Which volume of effect is easiest to detect?
smaller effects are HARDER to detect from drawing a single sample; larger effects (thus, larger Cohen’s d) are EASIER to detect
What is one of the many “standardized” indicators of effect size?
- COHEN’S D
- Estimates the population parameter - δ (delta)
How many SD away from the comparison value is our sample group mean?
- d = (M-u)/o
- NOTE: equation is in-between absolute value bars
What can we calculate to answer: “How likely is it that our class is a random sample from this general population? Or do we likely come from a different population?”
We can use a z test or we can use a confidence interval
All CI’s follow the same pattern…
Subtract margin of error to find lower bound (critical value x standard error), add margin of error to find upper point (critical value x standard error)
Type I error
- H0 is reject
- H0 is actually true
Correct Decision POWER (1- β)
- H0 is reject
- H0 is false
Type II error
- Fail to reject H0
- H0 is actually false
Correct Decision (1 + α)
- Retain H0
- H0 is true
What does β mean?
If the null is really false (effect exists), β% of the time we’re going to make a mistake and say there’s no effect
To identify power:
find the % of the curve of the H1 distribution that would lead us to correctly reject the null
If effect size increases, what happens to type I error rate?
- NO CHANGE
If effect size increases, what happens to type II error rate?
- DECREASES
If effect size increases, what happens to power?
- INCREASES
In priori power analysis, can ask two questions:
- “If I’m making the assumption that there is an effect to be found, HOW MANY PEOPLE DO I NEED IN MY STUDY?”
- “If I’m limited to N participants, will I have enough power to reject the null hypothesis if I should do so?”