Week 2: Hypothesis Testing and Its Implications Flashcards
What question are we focusing on the framework this lecture?
Meets assumption of parametric tests?
Answering the question: Meets assumption of parametric tests will determine whether our continous data can be tested with
with parametric or non-parametric tests
A normal distribution is a distribution with the same general shape which is a
bell shape
A normal distribution curve is symmetric around
the mean μ
A normal distribution is defined by two parameters - (2)
the mean (μ) and the standard deviation (σ).
Many statistical tests (parametric) cannot be used if the data is not
normally distributed
What does this diagram show? - (2)
μ = 0 is peak of distribution
Block areas under the curve and gives us insight to way data is distributed and certain scores occuring if they belong to normally distribution e.g., 34.1% of values lie one SD below mean
A z score in standard normal distribution will reflect the number of
SD above or below the mean of a particular score is
How to calculate a z score?
Take a value of participant (e.g., 56 years old) and take away mean of distribution (e.g., mean age of class is 23) divided by SD (class like 2)
If a person scored a 70 on a test with a mean of 50 and a standard deviation of 10
Converting the test scores to z scores, an X of 70 would be…
What the result means…. - (2)
a z score of 2 means the original score was 2 standard deviations above the mean
We can convert our z scores to
pecentiles
Example: What is the percentile rank of a person receving a score of 90 on the test? - (3)
Mean - 80
SD = 5
First calculating z score: graph shows that most people scored below 90. Since 90 is 2 standard deviations above the mean z = (90 - 80)/5 = 2
Z score to pecentile can be looked at table that z score of 2 is equivalent to the 97.7th percentle:
The proportion of people scoring below 90 is thus .977 and proportion of people scoring above 90 is 2.3% (1-0.977)
We can not always measure the whole… for a study
population
What is the sample mean?
an unbiased estimate of the population mean.
Example of sample vs population - (3)
You want to study political attitudes in young people.
Your population is the 300,000 undergraduate students in the Netherlands.
Because it’s not practical to collect data from all of them, you use a sample of 300 undergraduate volunteers from three Dutch universities – this is the group who will complete your online survey.
How can we know how that our sample mean estimate is representative of the population mean?
Via computing standard error of mean - smaller SEM the better
What does this diagram shows you? - (2)
If you take several samples from same population,
each sample has its own mean and some sample means will be different or same as population mean- error - known as SEM
What is sample variation and example - (2)
samples will vary because they contain different members of the population;
a sample that by chance includes some very
good lecturers will have a higher average (higher rating of all lectures) than a sample that, by chance, includes some awful lecturers.
Standard deviation is used as a measure of how
representative the mean was of the observed data.
Small standard deviations represented a scenario in which most data points were
most data points were close to the mean
Large standard deviation represented a situation in which data points were
widely spread
from the mean.
How to calculate the standard error of mean?
computed by dividing the standard deviation of the sample by the the square root of the number in the sample
The larger the sample the smaller the - (2)
standard error of the mean
more confident we can be that the sample mean is representative of the population.
The central limit therom proposes that
as samples get large (usually defined as greater than 30), the sampling distribution has a normal distribution with a mean equal to the population mean, SD = SEM
The standard deviation of sample means is known as the
SEM (standard error of the mean)
A different approach to assess accuracy of sample mean as estimate of - population mean, aside from SE, is to - (2)
calculate boundaries and range of values within which we believe the true value of the population mean value will fall.
Such boundaries are called confidence intervals.
Confidence intervals are created by
samples
A 95% confidence intervals is consructed such that
these intervals (created by samples) will contain the population mean
95% Confidence interval for 100 samples (CI constructed for each) would mean
95 of these samples, the confidence intervals we constructed would contain the true value of the mean in the population.
In fact, for a specific confidence interval, the probability that it contains the population value is either - (2)
0 (it does not contain it) or 1 (it does contain it).
You have no way of knowing which it is.
Diagram shows- (4)
- Dots show the means for each sample
- Lines sticking out representing Ci for the sample means
- If there was a vertical line down it represents population mean
- If confidence intervals don’t overlap then it shows significant difference between the sample means
if our sample means were normally distributed with a mean of 0 and a
standard error of 1, then the limits of our confidence interval
would be –1.96 and +1.96 -
95% of z scores fall between
-1.96 and 1.96
Confidence intervals can be constructed for any estimated parameter, not just
μ - mean
. If the mean represents the true mean well, then the confidence interval of that mean should be
small
if the confidence interval is very
wide then the sample mean could be
very different from the true mean, indicating that it
is a bad representation of the population
Remember that the standard error of the mean gets smaller with the number of observations and thus our confidence interval also gets
smaller - make sense as more we measure more certain sample mean close to population mean
Confidence intervals can be constructed for any estimated parameter, not just
mean , μ
Calculating Confidence intervals (for observations) - rearraning z formula
Know most scores remain at z = 1.96 (upper bound) and z = -1.96 (lower bound)
LB = (-1.96* SD of sample) + mean sample
UB = (+1.96* SD of sample) + mean sample
Calculating Confidence Intervals for sample means - rearranging in z formula
LB = Mean - (1.96 * SEM)
UB = Mean + (1.96 * SEM)
The standard deviation of SAT verbal scores in a school system is known to be 100. A researcher wishes to estimate the mean SAT score and compute a 95% confidence interval from a random sample of 10 scores.
The 10 scores are: 320, 380, 400, 420, 500, 520, 600, 660, 720, and 780.
Calculate CI
* M - 530
* N = 10
* SEM = 100/ square root of 10 = 31.62
* Value of z for 95% CI is number of SD one must go from mean (in both directions) to contain 0.95 of the scores
* Value of 1.96 was found in z-table
* Since each tail is to contain 0.025 of the scores, you find the values of z for which is 1-0.025 = 0.975 of the socres below
* 95% of z scores lie between -1.96 and +1.96
* Lower limit = 530 - (1.96) (31.62) = 468.02
* Upper limit = 530 + (1.96)(31.62) = 591.98
Null hypothesis is that there is
no effect of the predictor variable on the outcome variable
The alternate hypothesis is that there is an effect of
the predictor variable on the outcome variable
Null hypothesis signifiance testing computes the probability of the null hypothesis being true which si referred as the
p-value
To test the fit of statistical models to test our hypotheses, we calculate
getting that model (Data) if the Null hypothesis H0 were true (Statistical significance)
What if the proability p- value was small?
we conclude the model fits the data well (explains a lot of the variance) and we gain confidence in the alternative hypothesis H1
Steps in Hypothesis testing (6)
- specify the null hypothesis H0 and the alternative hypothesis H1
- select a significance level. Typically the 0.05 or the 0.01 level.
- calculate a statistic analogous to the parameter specified by the null hypothesis. (e.g. if null defined by parameter μ1- μ2 (diff between two means) then the statistic is M1-M2 (difference between sample means))
- calculate the probability value of obtaining a statistic (statistic computed from the data) as different or more different from the parameter specified in the null hypothesis (often 0 or based on past evid and mean stay same)
- probability value computed in Step 4 is compared with the significance level chosen in Step 2.
- If the outcome is statistically significant, then the null hypothesis is rejected in favor of the alternative hypothesis.
Think of test statistic capturing
signal/noise
Hypo
A testStatistic for which the frequency of particular values is known (t, F, chi-square) and thus we can calculate the
probability of obtaining a certain value or p value.
To test whether the model fits the data or whether our hypothesis is a good explanation of the data, we compare
systematic variation against unsystematic
If the probability (p-value) less than or equal to the significance level, then
the null hypothesis is rejected; When the null hypothesis is rejected, the outcome is said to be “statistically significant”
If the proabilibty (p-value) is greater than the signifiance leve, the
null hypothesis is not rejected.
We accept the results as true (accept our alternative hypothesis) if there is either
%, 1% (p<0.05 OR p<0.01) or less probability of a test statistics happening by chance.
P-value less than 0.05 means there is a low probability of obtaining at least as
extreme results given that H0 is true
What is a type 1 error in terms of variance? - (2)
think the variance accounted for by the model is larger than the one unaccounted for by the model (i.e. there is a statistically significant effect but in reality there isn’t)
Type 1 is a false
positive
What is type II error in temrs of variance?
think there was too much variance unaccounted for by the model (i.e. there is no statistically significant effect but in reality there is)
Type II error is false
negative
Example of Type I and Type II error
Type I and Type II errors are mistakes we can make when testing the
fit of the model
Type 1 errors when we believe there is a geniue effect in
population, when in fact there isn’t.
Acceptable level of type I error is usually
a-level of usually 0.05
Type II error occurs when we believe there is no effect in the
population when, in reality, there is.
Acceptable level of Type II error is probability/-p-value is
β-level (often 0.2)
An effect size is a standardised measure of
the size of the an effect
Properities of effect size (3)
Standardized = comparable across studies
Not (as) reliant on the sample size
Allows people to objectively evaluate the size of observed effect.
Effect Size Measures
r = 0.1, d = 0.2 (small effect):
the effect explains 1% of the total variance.
Effect size measures
r = 0.3, d = 0.5 (medium effect) means
the effect accounts for 9% of the total variance.
Effect size measures
r = 0.5, d = 0.8 (large effect)
effect accounts for 25% of the variance
Beware of the ‘canned’ effect sizes (e.g., r = 0.5, d = 0.8 and rest) since the size of
effect should be placed within the research context.
We should aim to achieve a power of
.8, or an 80% chance of detecting
an effect if one genuinely exists.
When we fail to reject the null hypothesis, it is either that there truly are no difference to be found,
OR
it may be because we do not have enough statistical power
Power is the probability of
correctly rejecting a false H0 OR the ability of the test to find an effect assuming there is one in the population,
Power is calculated by
1 - β OR probability of making Type II error
To increase statistical power of study you can increase
your sample sizee
Factors affecting the power of the test: (4):
- Probability of a type 1 error or a-level [level at which we decide effect is sig - p-value) –> bigger [more lenient] alpha then more power)
- True alternate hypothesis H1 [effect size] (degree of overlap, less means more power) - if you find large effect in lit then better chance of detecting something
- The sampel size [N]) –> bigger the sample, less the noise and more power
- The particular tests to be employed - parametric tests greater power to detect sig effect since more sensitive
How to calculate the number of pps they need for reasonable chance of correctly rejecting null hypothesis?
Sample size calculation at a desired level of power (usually power set to 0.8 in formula)
Tests of normality (2)
- Kolmogorov-Smirnov test
- Shapiro-Wilks test
If distribution of data looks normally distributed but test saying not normally distributed
Just do the parametric tests
Plot your data because this informs you on what decisions you want to make
with respect to normality –> normality tests have limitations
With power, we can do 2 things - (2)
- Calculate power of test
- Calculate sample size necessary to detect an decent effect size and achieve a certain level of power based on past research
Diagram of Type I error, Type II error, power - (4) and making correct decisions
Type 1 error p = alpha
Type II error p = beta
Accepting null hypothesis which is correct - p = 1- alpha
Accepting alternate hypo which is correct - p = 1 - beta
If there is a less degree of overlap in h0 and h1 then
bigger difference means higher power and and correctly reject the null hypothesis than distributions that overlap more
If distribution between h0 and h1 are narrower then
This means that the overlap in distributions is smaller and the power is therefore greater, but this time because of a smaller standard error of our estimate of the means.
Most people want to assess how many participants they need to test to have a reasonable chance of correctly rejecting the null hypothesis (the Power). This formula shows - (2)
us how.
We usually set the power to 0.8.
What is z scores? - (2)
A measure of variability:
The number of standard deviations from the population mean or a particular data point is
Z-scores are a standardised measure, hence they ignore measurement units
Why should we care about z scores? - (2)
Z-scores allow researchers to calculate the probability of a score occurring within a standard normal distribution
Enables us to compare two scores that are from different samples (which may have different means and standard deviations)
Diagram of finding percentile of Trish
Trish takes a test and gets 25
Mean of the class is 20
SD = 4
25-20/4 = 1.25
Z-score = 1.25
Let’s say Trish takes a test and scores 25 and the mean is 20 You may calculate the z-score to be 1.25 you would use a z-score table to see what percentile they would be in (marked in red) so to read the table you would go down to the value 1.2 and you would go across to 0.05 which totals to 1.25 and you can see about 89.4% of other students performed worse.
Diagram of z score and percentile
Josh takes a different test and gets 1150
Mean of the class is 1000
SD = 150
1150 – 1000/150 = 1.0
Z score = 1.0
Who performed better Trish or Josh?
Trish had z score of 1.25
We would use our table and look down the column to a z-score of 1 and across to the 0.00 column (in purple) and we can see 84.1% of students performed worse than Josh so Trish performed better than Josh.
Diagram of z scores and normal distribution - (3)
68% of scores are within 1 SD of the mean,
95% are within 2 SDs and
99.7% are within 3 SDs.
Whats standard error?
: by taking into account the variability and size of our sample we can estimate how far away from the real population mean our mean is!
If we took infinite samples from the population, 95% of the time the population mean will lie within the
the 95% confidence interval range
Due to the frequency of normal distribution we can get a number for the lower and upper limits of the
95% confidence interval.
How to calculate standard error?
What does narrow CI represent?
high statistical power
Wide CIs represent?
low statistical power
Power bring the probability of catching a real effect (as opposed to
missing a real effect – Type II error)
Null hypothesis in terms of same and fifferent population is that
No actual difference exists in the real world, all data comes from the same population
Alternative (experimental) hypothesis in terms of same or different population
There is an actual difference and we found it!
We collect evidence to
reject null hypothesis
We can never say the null hypothesis is
FALSE (or TRUE).
TheP valueor calculated probability is the estimated probability of us
us finding an effect when the null hypothesis (H0) is true.
p = probability of observing a test statistic at least as a big as the one we have if the
H0 is true
Hence, a significant p value (p <.05) tells us that there is a less than 5% chance of getting a test statistic that is
larger than the one we have found if there were no effect in the population (e.g. the null hypothesis were true)
The p-value gets smaller as the test statistic calculated from your data gets
gets further away from the range of test statistics predicted by the null hypothesis.
Statistical signifiance does not equal importance - (2)
p = .049, p = .050 are essentially the same thing- the former is ‘statistically significant’.
Importance is dependent upon the experimental design/aims: e.g., A statistically significant weight increase of 0.1Kg between two adults experimental groups may be less important than the same increase between two groups of babies.
Children can learn a second language faster before the age of 7’. Is this statement:
A. One-tailed
B. A non scientific
C. Two-tailed
D. Null hypothesos
A as one-tailed is directional and two tailed is non-direcitonal
Which of the following is true about a 95% confidence interval of the mean:
A. 95 out of 100 CIs wll contain population mean
B. 95 out of 100 sample means will fall within the limits of the confidence interval.
C. 95% of population means will fall within the limits of the confidence interval.
D. There is a 0.05 probability that the population mean falls within the limits of the confidence interval.
A as If we’d collected 100 samples, calculated the mean and then calculated a confidence interval for that mean, then for 95 of these samples the confidence intervals we constructed would contain the true value of the mean in the population
What does a significant test statistic tell us?
A. That the test statistic is larger than we would expect if there were no effect in the population.
B. There is an important effect.
C. The null hypothesis is false.
D. All of the above.
A and just because test statistic is sig does not mean its important effect
Of what is p the probability?
(Hint: NHST relies on fitting a ‘model’ to the data and then evaluating the probability of this ‘model’ given the assumption that no effect exists.)
A.p is the probability of observing a test statistic at least as big as the one we have if there were no effect in the population (i.e., the null hypothesis were true).
B. p is the probability that the results are due to chance, the probability that the null hypothesis (H0) is true.
C. p is the probability that the results are not due to chance, the probability that the null hypothesis (H0) is false
D. p is the probability that the results would be replicated if the experiment was conducted a second time.
A
A Type I error occurs when:
(Hint: When we use test statistics to tell us about the true state of the world, we’re trying to see whether there is an effect in our population.)
A. We conclude that there is an effect in the population when in fact there is not.
B. We conclude that there is not an effect in the population when in fact there is.
C. We conclude that the test statistic is significant when in fact it is not.
D. The data we have typed into SPSS is different from the data collected.
A as If we use the conventional criterion then the probability of this error is .05 (or 5%) when there is no effect in the population
True or false?
a. Power is the ability of a test to detect an effect given that an effect of a certain size exists in a population.
TRUE
True or False?
We can use power to determine how large a sample is required to detect an effect of a certain size.
TRUE
True or False?
c. Power is linked to the probability of making a Type II error.
TRUE
True or False?
d. The power of a test is the probability that a given test is reliable and valid.
FALSE
What is the relationship between sample size and the standard error of the mean?
(Hint: The law of large numbers applies here: the larger the sample is, the better it will reflect that particular population.)
A. The standard error decreases as the sample size increases.
B. The standard error decreases as the sample size decreases.
C. The standard error is unaffected by the sample size.
D. The standard error increases as the sample size increases.
A The standard error (which is the standard deviation of the distribution of sample means), defined as σ_Χ ̅ =σ/√N, decreases as the sample size (N) increases and vice versa
What is the null hypothesis for the following question: Is there a relationship between heart rate and the number of cups of coffee drunk within the last 4 hours?
A. There will be no relationship between heart rate and the number of cups of coffee drunk within the last 4 hours.
B. People who drink more coffee will have significantly higher heart rates.
C. People who drink more cups of coffee will have significantly lower heart rates.
D. There will be a significant relationship between the number of cups of coffee drunk within the last 4 hours and heart rate
A The null hypothesis is the opposite of the alternative hypothesis and so usually states that an effect is absent
A Type II error occurs when :
(Hint: This would occur when we obtain a small test statistic (perhaps because there is a lot of natural variation between our samples.)
A. We conclude that there is not an effect in the population when in fact there is.
B. We conclude that there is an effect in the population when in fact there is not.
C. We conclude that the test statistic is significant when in fact it is not.
D. The data we have typed into SPSS is different from the data collected.
A A Type II error would occur when we obtain a small test statistic (perhaps because there is a lot of natural variation between our samples)
In general, as the sample size (N) increases:
A. The confidence interval gets narrower.
B. The confidence interval gets wider.
C. The confidence interval is unaffected.
D. The confidence interval becomes less accurate
A
Which of the following best describes the relationship between sample size and significance testing?
(Hint: Remember that test statistics are basically a signal-to-noise ratio, so given that large samples have less ‘noise’ they make it easier to find the ‘signal’.)
A. In large samples even small effects can be deemed ‘significant’.
B. In small samples only small effects will be deemed ‘significant’.
C. Large effects tend to be significant only in small samples.
D. Large effects tend to be significant only in large samples.
A
The assumption of homogeneity of variance is met when:
A. The variances in different groups are approximately equal.
B. The variances in different groups are significantly different.
C. The variance across groups is proportional to the means of those groups.
D. The variance is the same as the interquartile range.
A - To make sure our estimates of the parameters that define our model and significance tests are accurate we have to assume homoscedasticity (also known as homogeneity of variance)
Next, the lecturer was interested in seeing whether males and females reacted differently to the different teaching methods.
Produce a clustered bar graph showing the mean scores of teaching method for males and females.
(HINT: place TeachingMethod on the X axis, Exam Score on the Y axis, and Gender in the ‘Cluster on X’ box. Include 95% confidence intervals in the graph).
Which of the following is the most accurate interpretation of the data?
A.Females performed better than males both the reward and indifferent conditions. Regarding the confidence intervals, there was a large degree of overlap between males and females in all conditions of the teaching method.
B.Males performed better than females in the reward condition, and females performed better than males in the indifferent condition. Regarding the confidence intervals, there was no overlap between males and females across any of the conditions of teaching method.
C.Males performed better than females in all conditions. Regarding the confidence intervals, there was a small degree of overlap between males and females for the reward and indifferent conditions, and a large degree of overlap between males and females for the punish condition.
D.Males performed better than females in the reward condition, and females performed better than males in the indifferent condition. Regarding the confidence intervals, there was a small degree of overlap between males and females for the reward and indifferent conditions, and a large degree of overlap between males and females for the punish condition.
D
Produce a line graph showing the change in mean anxiety scores over the three time points.
NOTE: this is a repeated measures (or within subjects) design, ALL participants took part in the same condition.
Which of the following is the correct interpretation of the data?
A.Mean anxiety increased across the three time points.
BMean anxiety scores were reduced across the three time points, and there was a slight acceleration in this reduction between the middle and end of the course.
CMean anxiety scores were reduced across the three time points, though this reduction slowed down between the middle and end of the course.
DMean anxiety scores did not change across the three time points.
B