L5 - Critical thinking about statistical inference Flashcards

1
Q

List of things from last block/lectures we need to know so if you don’t remember, revise them

In some I included an answer in brackets as well if the book described it nicely in few word

A
  • Null hypothesis and alternative hypothesis, the difference between the two (null is the one most costly to reject falsely), formulating those refers to sample properties, not population properties
  • T-statistic, probabilities (to calculate any we need a collective which can be constructed by assuming H0, imagining an infinite number of experiments, calculating t each time which is the single event of the collective)
    T-distirbution (the distribution of the infinite number of ts in the collective)
  • p-value and α (they are objective probabilities, a relative long-run frequencies)
    ↪ Neither α nor p tell us how probable the null hypothesis is (they are not P(H|D) )
  • β, power (1-β; P(reject H0|H0 false) )
  • sensitivity, specificity
  • Type I error (we will make this error in α proportion of our decisions; P(rejecting H0|H0 true) ), Type II error (P(accepting H0|H0 false)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Look at picture 7
Consider the statements about the study and say whether they are true or not

A

All of them are incorrect. Throught the flashcards it should become clear why that is the case. We’ll also revisit them at the end and explain why they are incorrect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is important to remember about the alpha level and the p-value when interpreting statistical results?

A

The p-value and alpha level say something about the probability of the null hypothesis, they don’t refer to the alternative hypothesis at all.
This is a common misunderstanding of the definition of p value in statistical inference that leads people to make fallacious conclusions about their results

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does the misinterpretation of the p-value help explain?

A

It helps explain the motivation of people to obtain significant results

  1. Explains why peer-review process is often focused on checking whether the results were significant instead of focusing on the content of the paper
  2. Explains why issues with reproductibility occur (e.g. rounding p-value of 0.051 to 0.05)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are different statements that people use to report non-significant results (p>0.05) to show it’s almost significant

Not important to remember them, just as an example to understand the lengths that people can go to, to mislead the reader into thinking the results are significant

A
  • a certain trend toward significance (p=0.08)
  • approached the borderline of significance (p=0.07)
  • just very slightly missed the significance level (p=0.086)
  • near-marginal significance (p=0.18)
  • only slightly non-significant (p=0.0738)
  • provisionally significant (p=0.073)
  • quasi-significant (p=0.09)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is the analogy of the conflict that goes on in researchers’ heads when they find non-significant results?

It’s a silly example, no need to remember, he included it in the lecture more for fun than for actual learning

A

Picture 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the point of using p-value when it forces people to seek significant results at all costs?

A

Playing the devil’s advocate: how likely is (at least) this statistic if there were no difference in the population?

  • What if I’m not measuring a systematic difference in the population, but just random variation? → Is the difference to be expected if there is nothing else going on but, for example, random sampling?

If there was actually nothing going on, the probabolity of me finding this result or more extreme is not that high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How did p-value come about? What did Fisher propose?

A

Significance testing!

  1. Formulate H0: the hypothesis to be ‘nullified’
  2. Report the exact level of significance (p-value), without further discussion about accepting or rejecting hypotheses (for the reader to decide how they want to interpret this value)
  3. Only do this if you know almost nothing about the subject
    ↪ ‘‘A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What did Neyman & Pearson suggest as an alternative to Fisher’s approach?

A

They thought that Fisher’s steps less useful, as no clear alternative is specified
Hypothesis testing!

  1. Formulate two statistical hypotheses, determine alpha, beta & sample size for the experiment, in a deliberate way (expected x value) before collecting the data
  2. If the data falls in the H1 rejection region, assume H2. This does not mean that you believe H2, only that you behave as if H2 is true
  3. Only use this procedure if there is a clear disjunction & if a cost-benefit assessment is possible

So basically, we’re setting behavioural rules: even though we don’t know whether the H0 is true or not, we won’t be wrong very often if it is true and we won’t be wrong very often if it is false.

We can also put this is a frequency tree (picture 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Since the two approaches didn’t agree with each other, what did we end up with? What is the issue with this?

A

We ended up with the null ritual

  1. Set up a statistical null hypothesis of “no mean difference” or “zero correlation.” Don’t specify the predictions of your research hypothesis or of any alternative substantive hypotheses
  2. Use 5% as a convention for rejecting the null. If significant, accept your research hypothesis
  3. Always perform this procedure

This approach also introduces many fallacies - we will dicuss these in the next block

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are 4 fallacies in statistical inference?

A
  1. P-values equal the probability that the (null)hypothesis is true
  2. Alpha equals the probability of making an error
  3. Failing to reject H0 is evidence for H0
  4. Power is irrelevant when results are significant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

P-values equal the probability that the (null)hypothesis is true

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Which probability do statements, in which alpha, power and p-value occur, relate to?

A

Statements in which alpha, power and p-values occur relate to:
1. frequentist or objective probabilities
2. Conditional probabilities

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

1. Frequentist probability

What is subjective probability?

A

Probability is the degree of belief that something is the case in the world

  • This expresses a degree of uncertainty: e.g. how sure are you that you have chosen the right answer to an MC question?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

1. Frequentist probability

What is objective probability?

A

Probability is the extent to which something IS the case in the world

  • These probabilities exist independently of our states of knowledge
  • This, for example, expresses the relative frequency in the long term: e.g, infinite number of coin tosses (reference class or collective)
  • Probabilities need to be discovered by examining the world, not by reflecting on what we know or how much we believe
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is a reference class or collective?

A

The hypothetical infinite set of events and the long-run relative frequency is a property of all the events in the collective, not to any single event

  • Might be the set of all potential tosses of a coin using a certain tossing mechanism →a single toss of a coin (singular event) doesn’t have a probability; only the collective of tosses has a probability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

1.Frequentist probability

What is the reason, according to the frequentist probabilities, why we cannot infer the probability of the null hypothesis from the p-value?

A

Null hypothesis is either true or it’s not; just as a single event either occurs or doesn’t

  • Hypothesis is not a collective, hence it’s not an objective probability
  • With p-values (Fisher) and the Neyman-Pearson paradigm we talk about objective probability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

1. Frequentist probability

What does the probability in the Naymen-Pearson paradigm apply to?

A

The probabilities of the long-term error rates apply to our behaviour, not the hypothesis itself, but whether we decide to reject it or not and the probability of making an error in a range of behaviours
➡ Objective probability

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

2.Conditional probabilities

What is the difference between P(D|H) and P(H|D)?

A

P(D|H) = probability of obtaining some data given a hypothesis
E.g. P(‘getting 5 threes in 25 rolls of a die’|’I have a fair die’

  • For this probability we can set up a relevant collective consisting of infinite number of events: throwing a fair die 25x and observing the number of threes
  • We can determine the proportion of such events in which the number of threes is 5 = probability we can calculate

But we cannot calculate P(H|D) (e.g. the probability that the hypothesis that I have a fair die is true, given I obtained 5 threes in 25 rolls) because there is no collective; the hypothesis is simply true or not

20
Q

2.Conditional probabilities

Why if we know P(D|H), do we not know P(H|D)?

A
  1. The inverse conditional probabilities can have very different values (example in the next flashcard)
  2. It is meaningless to assign objective probability to a hypothesis
21
Q

2.Conditional probabilities

An example that shows that we can’t just reverse conditional probabilities

A

P’(‘dying within two years ‘|’head bitten off by a shark’) = 1
P (‘head was bitten off by a shark’|’died in the last two years’) ~ 0

22
Q

2.Conditional probabilities

What is the reason, according to conditional probabilities, why we cannot infer the probability of the null hypothesis from the p-value?

A

P value is a conditional probability and null hypothesis is not (it’s not a collective)

23
Q

What is a counterexample?

A

Using the same structure of the argument but with different terms to make it clear that the argument doesn’t hold

24
Q

2.Conditional probabilities

How can we use counterexamples to show that we can’t invert conditional probabilities and assume that the alternative hypothesis is true if H0 is false from a p-value?

Long flashcard but bear with me, it makes sense

A

P1) If H0 is true, probably not this data
P2) This data
C) H0 is not true

Counterexample:
P1) If someone is a Dutch national, (s)he probably does’t live in Amsterdam
P2) Sjinkus lives in Amsterdam
C) Sjinkus is not a Dutch national

This argument is not valid (talking about likelihood) and not forceful (the premises don’t provide strong enough evidence for the conclusion)

  • P(AMS|dutch national) = 0.05
  • P (AMS|non-dutch national - all the other people in the world) = 0.0001
    ↪ extremely small probability so we can’t draw the conclusion that Sjinkus is not a Dutch national just because he lives in Amsterdam
  • We have to compare the two probabilities to be able to draw valid conclusions but in p-values we only look at the null hypothesis and we don’t say anything about the likelihood of the alternative hypothesis
  • So living in AMS is much more likely assuming that someone is a dutch national compared to assuming that someone is not a dutch national
    ↪ more likely to be true if you compare it to a meanigful alternative hypothesis
25
Q

2.Conditional probabilities

What is a third way of understanding why we cannot infer that null is false, just from obtaining p-value lower than alpha?

A

In an ideal (unrealistic) world, we know the base rate of our null hypothesis, sensitivity (power) and specificity
But in real world, we don’t know the base rate (picture 3) but we can do a thought process where we assign the base rate based on the past knowledge from already conducted studies

26
Q

2.Conditional probabilities

Demonstrating the thought process with an example

Picture 4

A

We have 1000 hypothesis and 100 turn out that null is true and 900 turn out that H0 is false. That’s out base rate - The probability of drawing one hypothesis a random and it being true or false

  • That is an objective interpretation = the base rate of the null hypothesis being true is a proportion of true null hypothesis vs false null hypothesis

We give a value to sensitivity (0.8 common in psychology) and specificity (0.95 common in psych.) and calculate the probability of observing real effect given that we rejected the null hypothesis

Ex1: Base rate = 0.9
P (real ffect|reject 0) = 0.99

The sensitivity and specificity remains constant
Ex2: Base rate = 0.1
P (real effect|reject H0) = 0.64

The probability changed depending on the base rate.
P (real effect|reject H0) > 0.5 so should be forceful? No, becuase it also depends on sensitivity and specificity so if these change, the probability changes as well (picture 5)

27
Q

2.Conditional probabilities

So why the p-value ≠ the probability that the null hypothesis is true according to the thought process?

A

Because rejection of the null is just based on the specificity and inversing conditional probability means that we need sensitivity, specificity and base rate

So in this argument we only have specificity, not sensitivity and base rate
P1) If H0 is true, probably not this data
P2) This data
C) H0 is not true

28
Q

Alpha equals the probability of making an error

A

Saying that probability of making an error is alpha is not correct because we don’t know whether the null hypothesis is true or not and the alpha assumes the null is true

  • It doesn’t say anything about Type II error (which looks at the probability of making an error when H0 is false)
29
Q

If you’ve rejected H0 at alpha = 0.05, the probability that you’ve made an error is 5% - why is this also a fallacy?

A

Look at picture 6
When we talk about rejecting H0, we look at the circled part of the frequency trees, so not just rejecting it if H0 was true

30
Q

Failing to reject H0 is evidence for H0

A

Same as saying a non-significant result means that the H0 is true

31
Q

Why did Neyman and Pearson introduce power in their analysis?

A

So that they could say something about the sensitivity (power) of their analysis

  • Because if we collect infinite number of data we will eventually reach significance that’s why you want to look at the power of the test to see whether the effect is actually happening regardless of the number of observations

A strict application of their logic means setting the risks of both Type I and II errors (α and β) in advance before collecting the data

32
Q

How do we control β that we determine before collecting the data

A
  1. estimating effect size we’re interested in
  2. estimate data variance

↪ Do this based on knowledge from past studies about the same concept or do a pilot study

Determining these two, a table can tell us how many participants we need to have to keep β at our predetermined level

33
Q

What is the difference between absence of evidence and evidence of absence? And how does it explain the fallacy of failing to reject H0 is evidence for H0

A

Absence of evidence - the experiment did not yield a conclusive result, perhaps because too few observations were taken
Evidence of absence - the experiment did yield a conclusive result, but it favours the null hypothesis
The p-value cannot discriminate between the two even though the evidence of absence offers much more evidence for the H0

34
Q

What is the invalid argumentation logic behind replication and power?

A

P1) Study 1 finds an effect of size X with Z participants
P2) Study 2 is a direct replication of 1 with Z participants
C) Study 2 is sufficiently powered

35
Q

What should we do to make the replicated study show effect size as well?

A

We should increase the number of participants in the second experiment to account for inflated significant results, sampling variability, subtle contextual differences, publication bias and regression to the mean

36
Q

Difference between power and sensitivity?

A

Power - probability of correctly rejecting the null hypothesis when it is false (i.e. avoiding Type II error); measure to detect effect if there is one
Sensitivity - the ability of a test to correctly identify true positives

  • Power applies more broadly to hypothesis testing whereas sensitivity relates to the performance of a test
  • In hypothesis testing, power is analogous to sensitivity in that they both refer to correctly identifying true positives, but they are used in slightly different contexts
37
Q

Why is checking for significance also an issue for replications?

A
  1. Because lack of sensitivity: underpowered studies make for inconclusive replication attempts (49% of replications were inconclusive but are often reported as conclusive failures to replicate)
  2. Because of lack of differentiation: is the found effect in the replication meaningfully different from the original?
38
Q

What is the invalid argumentation logic behind the fallacy of failing to reject H0 is evidence for H0

A

P1) Manipulation X has an effect
P2) There’s no significant difference between conditions in the degree to which participants noticed manipulation X
C )The effect of manipulation X was not noticed

If we dare to say that there is an effect, we have to make sure to say what our sensitivity of the test is (the higher sensivity, the higher the power)

39
Q

Power is irrelevant when results are significant

A
40
Q

What is the (invalid) argumentation logic behind the fourth fallacy?

A

P1) P<.05
C) I have found an effect
( P2) When I have found an effect, it is no longer relevant what the probability is that I find an effect if H0 is not true)
C2) Power is not relevant

41
Q

Why should we report the effect size as well as the significance?

A

The informativeness of rejecting the null is affected by sample size and by power
↪ When power or sample size increase, so does the P (Real effect|reject H0) and vice versa

  • That’s the difference between making inferences about a hypothesis being true/false and deciding on a course of action (rejecting null with error rate that’s established for the long term) → we didn’t show an effect, we rejected the null
  • Very small or unimportant effects will be statistically significant if sufficiently large amounts of data are collected, and very large and important effects will be missed if the sample size is too small
42
Q

Remember the statements from flashcard 3 (for the statements look at picture 6)

Why is each incorrect?

A

(1) and (3) - stats never allow for absolute proof or disproof
(2) and (4) - refer to the probability of hypotheses which cannot be correct since objective probability refers to a collective of events, not the truth of a hypothesis
(5) - refers to the probability of a single event being correct which can’t be since objective probability doesn’t refer to a single event
(6) - description of power, not significance

43
Q

What are stopping rules?

A

Those rules are defined by the conditions under which you will stop collecting data for a study

  • They should be defined before hand by your sampling plan - with how many participants you will work
44
Q

What are the different stopping rules and what are the issues with them?

A
  1. Run the first number of participants and if significance still wasn’t reach, run additional participants - we are now doing two different significance tests, so inflating the α level
  2. Continuing running until the test is significant = even assuming H0 is true, you will eventually obtain a ‘significant’ result if you run the data for long enough
    ↪ Although has a power of 1, it also has an α of 1!
45
Q

How does Neyman-Pearson approach solve the issues with the stopping rules?

A

They came up with a standard stopping rule which is to use power calculations in advance to determine how many participants should be used to control power - to determine our sampling plan

  • Both α and β can then be controlled
    at known acceptable levels
46
Q

Why is inflation of alpha level a problem if we conduct two t-tests?

A

If we conduct one t-test, the probability that it is significant by chance alone is 0.05 if we test at the 0.05 level
If we conduct two t-tests, the probability that at least 1 is significant by chance alone is slightly less than 0.10

47
Q

What can we do to control Type I error in two t-tests?

A

In order to control Type I error, if we perform a number of tests, we need to test each one at a strickter level of significance in order to keep overall alpha level at 0.05

  • This is done by applying a correction, e.g. Bonferroni which conducts each individual test at the 0.05/k (number of hypotheses/comparisons) level of significance and overall alpha will be no higher than 0.05