L5 - Critical thinking about statistical inference Flashcards
List of things from last block/lectures we need to know so if you don’t remember, revise them
In some I included an answer in brackets as well if the book described it nicely in few word
- Null hypothesis and alternative hypothesis, the difference between the two (null is the one most costly to reject falsely), formulating those refers to sample properties, not population properties
-
T-statistic, probabilities (to calculate any we need a collective which can be constructed by assuming H0, imagining an infinite number of experiments, calculating t each time which is the single event of the collective)
↪ T-distirbution (the distribution of the infinite number of ts in the collective) -
p-value and α (they are objective probabilities, a relative long-run frequencies)
↪ Neither α nor p tell us how probable the null hypothesis is (they are not P(H|D) ) - β, power (1-β; P(reject H0|H0 false) )
- sensitivity, specificity
- Type I error (we will make this error in α proportion of our decisions; P(rejecting H0|H0 true) ), Type II error (P(accepting H0|H0 false)
Look at picture 7
Consider the statements about the study and say whether they are true or not
All of them are incorrect. Throught the flashcards it should become clear why that is the case. We’ll also revisit them at the end and explain why they are incorrect
What is important to remember about the alpha level and the p-value when interpreting statistical results?
The p-value and alpha level say something about the probability of the null hypothesis, they don’t refer to the alternative hypothesis at all.
This is a common misunderstanding of the definition of p value in statistical inference that leads people to make fallacious conclusions about their results
What does the misinterpretation of the p-value help explain?
It helps explain the motivation of people to obtain significant results
- Explains why peer-review process is often focused on checking whether the results were significant instead of focusing on the content of the paper
- Explains why issues with reproductibility occur (e.g. rounding p-value of 0.051 to 0.05)
What are different statements that people use to report non-significant results (p>0.05) to show it’s almost significant
Not important to remember them, just as an example to understand the lengths that people can go to, to mislead the reader into thinking the results are significant
- a certain trend toward significance (p=0.08)
- approached the borderline of significance (p=0.07)
- just very slightly missed the significance level (p=0.086)
- near-marginal significance (p=0.18)
- only slightly non-significant (p=0.0738)
- provisionally significant (p=0.073)
- quasi-significant (p=0.09)
What is the analogy of the conflict that goes on in researchers’ heads when they find non-significant results?
It’s a silly example, no need to remember, he included it in the lecture more for fun than for actual learning
Picture 1
What is the point of using p-value when it forces people to seek significant results at all costs?
Playing the devil’s advocate: how likely is (at least) this statistic if there were no difference in the population?
- What if I’m not measuring a systematic difference in the population, but just random variation? → Is the difference to be expected if there is nothing else going on but, for example, random sampling?
If there was actually nothing going on, the probabolity of me finding this result or more extreme is not that high
How did p-value come about? What did Fisher propose?
Significance testing!
- Formulate H0: the hypothesis to be ‘nullified’
- Report the exact level of significance (p-value), without further discussion about accepting or rejecting hypotheses (for the reader to decide how they want to interpret this value)
- Only do this if you know almost nothing about the subject
↪ ‘‘A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance
What did Neyman & Pearson suggest as an alternative to Fisher’s approach?
They thought that Fisher’s steps less useful, as no clear alternative is specified
Hypothesis testing!
- Formulate two statistical hypotheses, determine alpha, beta & sample size for the experiment, in a deliberate way (expected x value) before collecting the data
- If the data falls in the H1 rejection region, assume H2. This does not mean that you believe H2, only that you behave as if H2 is true
- Only use this procedure if there is a clear disjunction & if a cost-benefit assessment is possible
So basically, we’re setting behavioural rules: even though we don’t know whether the H0 is true or not, we won’t be wrong very often if it is true and we won’t be wrong very often if it is false.
We can also put this is a frequency tree (picture 2)
Since the two approaches didn’t agree with each other, what did we end up with? What is the issue with this?
We ended up with the null ritual
- Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses
- Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis
- Always perform this procedure
This approach also introduces many fallacies - we will dicuss these in the next block
What are 4 fallacies in statistical inference?
- P-values equal the probability that the (null)hypothesis is true
- Alpha equals the probability of making an error
- Failing to reject H0 is evidence for H0
- Power is irrelevant when results are significant
P-values equal the probability that the (null)hypothesis is true
Which probability do statements, in which alpha, power and p-value occur, relate to?
Statements in which alpha, power and p-values occur relate to:
1. frequentist or objective probabilities
2. Conditional probabilities
1. Frequentist probability
What is subjective probability?
Probability is the degree of belief that something is the case in the world
- This expresses a degree of uncertainty: e.g. how sure are you that you have chosen the right answer to an MC question?
1. Frequentist probability
What is objective probability?
Probability is the extent to which something IS the case in the world
- These probabilities exist independently of our states of knowledge
- This, for example, expresses the relative frequency in the long term: e.g, infinite number of coin tosses (reference class or collective)
- Probabilities need to be discovered by examining the world, not by reflecting on what we know or how much we believe
What is a reference class or collective?
The hypothetical infinite set of events and the long-run relative frequency is a property of all the events in the collective, not to any single event
- Might be the set of all potential tosses of a coin using a certain tossing mechanism →a single toss of a coin (singular event) doesn’t have a probability; only the collective of tosses has a probability
1.Frequentist probability
What is the reason, according to the frequentist probabilities, why we cannot infer the probability of the null hypothesis from the p-value?
Null hypothesis is either true or it’s not; just as a single event either occurs or doesn’t
- Hypothesis is not a collective, hence it’s not an objective probability
- With p-values (Fisher) and the Neyman-Pearson paradigm we talk about objective probability
1. Frequentist probability
What does the probability in the Naymen-Pearson paradigm apply to?
The probabilities of the long-term error rates apply to our behaviour, not the hypothesis itself, but whether we decide to reject it or not and the probability of making an error in a range of behaviours
➡ Objective probability
2.Conditional probabilities
What is the difference between P(D|H) and P(H|D)?
P(D|H) = probability of obtaining some data given a hypothesis
E.g. P(‘getting 5 threes in 25 rolls of a die’|’I have a fair die’
- For this probability we can set up a relevant collective consisting of infinite number of events: throwing a fair die 25x and observing the number of threes
- We can determine the proportion of such events in which the number of threes is 5 = probability we can calculate
But we cannot calculate P(H|D) (e.g. the probability that the hypothesis that I have a fair die is true, given I obtained 5 threes in 25 rolls) because there is no collective; the hypothesis is simply true or not
2.Conditional probabilities
Why if we know P(D|H), do we not know P(H|D)?
- The inverse conditional probabilities can have very different values (example in the next flashcard)
- It is meaningless to assign objective probability to a hypothesis
2.Conditional probabilities
An example that shows that we can’t just reverse conditional probabilities
P’(‘dying within two years ‘|’head bitten off by a shark’) = 1
P (‘head was bitten off by a shark’|’died in the last two years’) ~ 0
2.Conditional probabilities
What is the reason, according to conditional probabilities, why we cannot infer the probability of the null hypothesis from the p-value?
P value is a conditional probability and null hypothesis is not (it’s not a collective)
What is a counterexample?
Using the same structure of the argument but with different terms to make it clear that the argument doesn’t hold
2.Conditional probabilities
How can we use counterexamples to show that we can’t invert conditional probabilities and assume that the alternative hypothesis is true if H0 is false from a p-value?
Long flashcard but bear with me, it makes sense
P1) If H0 is true, probably not this data
P2) This data
C) H0 is not true
Counterexample:
P1) If someone is a Dutch national, (s)he probably does’t live in Amsterdam
P2) Sjinkus lives in Amsterdam
C) Sjinkus is not a Dutch national
This argument is not valid (talking about likelihood) and not forceful (the premises don’t provide strong enough evidence for the conclusion)
- P(AMS|dutch national) = 0.05
- P (AMS|non-dutch national - all the other people in the world) = 0.0001
↪ extremely small probability so we can’t draw the conclusion that Sjinkus is not a Dutch national just because he lives in Amsterdam - We have to compare the two probabilities to be able to draw valid conclusions but in p-values we only look at the null hypothesis and we don’t say anything about the likelihood of the alternative hypothesis
- So living in AMS is much more likely assuming that someone is a dutch national compared to assuming that someone is not a dutch national
↪ more likely to be true if you compare it to a meanigful alternative hypothesis
2.Conditional probabilities
What is a third way of understanding why we cannot infer that null is false, just from obtaining p-value lower than alpha?
In an ideal (unrealistic) world, we know the base rate of our null hypothesis, sensitivity (power) and specificity
But in real world, we don’t know the base rate (picture 3) but we can do a thought process where we assign the base rate based on the past knowledge from already conducted studies
2.Conditional probabilities
Demonstrating the thought process with an example
Picture 4
We have 1000 hypothesis and 100 turn out that null is true and 900 turn out that H0 is false. That’s out base rate - The probability of drawing one hypothesis a random and it being true or false
- That is an objective interpretation = the base rate of the null hypothesis being true is a proportion of true null hypothesis vs false null hypothesis
We give a value to sensitivity (0.8 common in psychology) and specificity (0.95 common in psych.) and calculate the probability of observing real effect given that we rejected the null hypothesis
Ex1: Base rate = 0.9
P (real ffect|reject 0) = 0.99
The sensitivity and specificity remains constant
Ex2: Base rate = 0.1
P (real effect|reject H0) = 0.64
The probability changed depending on the base rate.
P (real effect|reject H0) > 0.5 so should be forceful? No, becuase it also depends on sensitivity and specificity so if these change, the probability changes as well (picture 5)
2.Conditional probabilities
So why the p-value ≠ the probability that the null hypothesis is true according to the thought process?
Because rejection of the null is just based on the specificity and inversing conditional probability means that we need sensitivity, specificity and base rate
So in this argument we only have specificity, not sensitivity and base rate
P1) If H0 is true, probably not this data
P2) This data
C) H0 is not true
Alpha equals the probability of making an error
Saying that probability of making an error is alpha is not correct because we don’t know whether the null hypothesis is true or not and the alpha assumes the null is true
- It doesn’t say anything about Type II error (which looks at the probability of making an error when H0 is false)
If you’ve rejected H0 at alpha = 0.05, the probability that you’ve made an error is 5% - why is this also a fallacy?
Look at picture 6
When we talk about rejecting H0, we look at the circled part of the frequency trees, so not just rejecting it if H0 was true
Failing to reject H0 is evidence for H0
Same as saying a non-significant result means that the H0 is true
Why did Neyman and Pearson introduce power in their analysis?
So that they could say something about the sensitivity (power) of their analysis
- Because if we collect infinite number of data we will eventually reach significance that’s why you want to look at the power of the test to see whether the effect is actually happening regardless of the number of observations
A strict application of their logic means setting the risks of both Type I and II errors (α and β) in advance before collecting the data
How do we control β that we determine before collecting the data
- estimating effect size we’re interested in
- estimate data variance
↪ Do this based on knowledge from past studies about the same concept or do a pilot study
Determining these two, a table can tell us how many participants we need to have to keep β at our predetermined level
What is the difference between absence of evidence and evidence of absence? And how does it explain the fallacy of failing to reject H0 is evidence for H0
Absence of evidence - the experiment did not yield a conclusive result, perhaps because too few observations were taken
Evidence of absence - the experiment did yield a conclusive result, but it favours the null hypothesis
The p-value cannot discriminate between the two even though the evidence of absence offers much more evidence for the H0
What is the invalid argumentation logic behind replication and power?
P1) Study 1 finds an effect of size X with Z participants
P2) Study 2 is a direct replication of 1 with Z participants
C) Study 2 is sufficiently powered
What should we do to make the replicated study show effect size as well?
We should increase the number of participants in the second experiment to account for inflated significant results, sampling variability, subtle contextual differences, publication bias and regression to the mean
Difference between power and sensitivity?
Power - probability of correctly rejecting the null hypothesis when it is false (i.e. avoiding Type II error); measure to detect effect if there is one
Sensitivity - the ability of a test to correctly identify true positives
- Power applies more broadly to hypothesis testing whereas sensitivity relates to the performance of a test
- In hypothesis testing, power is analogous to sensitivity in that they both refer to correctly identifying true positives, but they are used in slightly different contexts
Why is checking for significance also an issue for replications?
- Because lack of sensitivity: underpowered studies make for inconclusive replication attempts (49% of replications were inconclusive but are often reported as conclusive failures to replicate)
- Because of lack of differentiation: is the found effect in the replication meaningfully different from the original?
What is the invalid argumentation logic behind the fallacy of failing to reject H0 is evidence for H0
P1) Manipulation X has an effect
P2) There’s no significant difference between conditions in the degree to which participants noticed manipulation X
C )The effect of manipulation X was not noticed
If we dare to say that there is an effect, we have to make sure to say what our sensitivity of the test is (the higher sensivity, the higher the power)
Power is irrelevant when results are significant
What is the (invalid) argumentation logic behind the fourth fallacy?
P1) P<.05
C) I have found an effect
( P2) When I have found an effect, it is no longer relevant what the probability is that I find an effect if H0 is not true)
C2) Power is not relevant
Why should we report the effect size as well as the significance?
The informativeness of rejecting the null is affected by sample size and by power
↪ When power or sample size increase, so does the P (Real effect|reject H0) and vice versa
- That’s the difference between making inferences about a hypothesis being true/false and deciding on a course of action (rejecting null with error rate that’s established for the long term) → we didn’t show an effect, we rejected the null
- Very small or unimportant effects will be statistically significant if sufficiently large amounts of data are collected, and very large and important effects will be missed if the sample size is too small
Remember the statements from flashcard 3 (for the statements look at picture 6)
Why is each incorrect?
(1) and (3) - stats never allow for absolute proof or disproof
(2) and (4) - refer to the probability of hypotheses which cannot be correct since objective probability refers to a collective of events, not the truth of a hypothesis
(5) - refers to the probability of a single event being correct which can’t be since objective probability doesn’t refer to a single event
(6) - description of power, not significance
What are stopping rules?
Those rules are defined by the conditions under which you will stop collecting data for a study
- They should be defined before hand by your sampling plan - with how many participants you will work
What are the different stopping rules and what are the issues with them?
- Run the first number of participants and if significance still wasn’t reach, run additional participants - we are now doing two different significance tests, so inflating the α level
- Continuing running until the test is significant = even assuming H0 is true, you will eventually obtain a ‘significant’ result if you run the data for long enough
↪ Although has a power of 1, it also has an α of 1!
How does Neyman-Pearson approach solve the issues with the stopping rules?
They came up with a standard stopping rule which is to use power calculations in advance to determine how many participants should be used to control power - to determine our sampling plan
- Both α and β can then be controlled
at known acceptable levels
Why is inflation of alpha level a problem if we conduct two t-tests?
If we conduct one t-test, the probability that it is significant by chance alone is 0.05 if we test at the 0.05 level
If we conduct two t-tests, the probability that at least 1 is significant by chance alone is slightly less than 0.10
What can we do to control Type I error in two t-tests?
In order to control Type I error, if we perform a number of tests, we need to test each one at a strickter level of significance in order to keep overall alpha level at 0.05
- This is done by applying a correction, e.g. Bonferroni which conducts each individual test at the 0.05/k (number of hypotheses/comparisons) level of significance and overall alpha will be no higher than 0.05