Topic 6: Hypothesis testing and false-discovery rates Flashcards

1
Q

What is hypothesis test

A

Hypothesis testing addresses decision problems (usually about comparisons)
in statistical inference.
- We want to draw inferences about underlying data-generating processes.
- But we have access to the processes only through random samples.
- Are the observations we make due to different processes or to sampling noise?

The procedure is something like this:

  1. Vague idea
    - Men are taller than women
  2. Precise hypotheses
    • H0 - null hypothesis:
      The mean heights of men and women are equal.
    • H1 - alternate hypothesis:
      The mean height of men is larger than that of women.
  3. Gather data
    Collecting unbiased samples is essential.
  4. Perform a statistical test to reject or fail to reject the null hypothesis

Type I and Type II error:

https://docs.google.com/document/d/1Igqo8XxvM1sBeULhABFSO4f0xog3Gn5Y_GW-41N3pdA/edit?tab=t.0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is large-scale testing

A

We’re used to classical testing theory, where it only involves N=1, so one test.

We can have N in the hundreds, thousands, even millions, so we can have N test statistics from 2 until a million e.g.

Here are some properties when running large-scale testing:
- Running thousands of hypothesis tests will give hundreds of false positives (Type I errors).
- But the distribution suggests there are some interesting outliers.
- Some results might stand out of the “null hypothesis distribution”, indicating that the effect is worth investigating, so we should try to separate the false positives from the true positives
- If we ignore them, we’ll make many Type II (false-negatives) errors (lose power).
- We strike a balance between Type I and Type II errors by using methods such as holm-bonferroni, FDR,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is bonferroni bound

A

Very common workaround for multiple testing:
- Divide significance level α by number of tests N,
- test each hypothesis at level α/N

This guarantees that the family-wise error rate is bounded at level α, but bonferroni is VERY conservative for large N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe Holm’s procedure

A

This is a procedure to control the Family-wise error rate (FWER) at level alpha. It’s uniformly more powerful than the Bonferroni bound.

How to works:
Step 1: Order p-values from smallest to largest
Step 2: For each p-value, compare it to α/(N-i+1)
Step 3: Find i₀ (the first p-value that’s larger than its the new adjusted threshold)
Step 4: Reject all hypotheses before i₀, accept all after

Old:

Procedure:

  1. Order the observed $p$-values, from smallest to largest:
    EQUATION
    https://docs.google.com/document/d/1AvGnQnLdH6QLnmHanuB5i2LdGHKCoJj89uUwrvZzMdc/edit?tab=t.0

where H_0(i) denotes the corresponding null-hypotheses to the ith p-value

  1. Let i_0 be the smallest index $i$ for the ith p-value, such that it exceeds the adjusted threshold
    EQUATION

This means we can stop at the first “failure”, so where the p-value is larger than the alpha

  1. Reject all null hypotheses, H_0(i), for i<i_0 and accept all where i≥i_0. So, reject all null-hypotheses with a ranking before the smallest i_0, and accept all null-hypotheses with a ranking at or after i_0

It can be shown that Holm’s procedure controls the FWER at level alpha, while being more generous than Bonferroni in terms of when to declare a rejection. It’s much more flexible than Bonferroni, because we calculate the threshold at every rank.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe Benjamini-Hochberg procedure

A

Benjamini controls for the FDR instead of FWER, quick definitions:
i is the position
N is total number of tests
q is desired FDR level

How it works:
- Order all p-values from smallest to largest:
p(1) ≤ p(2) ≤ p(3) ≤ … ≤ p(N)

  • For each p-value position i, calculate its threshold: threshold(i) = (i/N) × q
  • Find largest i_max where: p(i) ≤ (i/N) × q (the smallest p-value where the condition holds)
  • Reject all hypotheses with p-values less than or equal to the p-value of the i_max (the largest p-value where the condition holds)

Old:
3. 1. Order the observed p-values from smallest to largest

EQUATION
https://docs.google.com/document/d/1v2nryiM1BGBG2k1apPQFNfxpOzzXJ3wZUTLdXVNN6ao/edit?tab=t.0

  1. Finding the Critical Threshold
    - For each p-value p(i), we see if it is bigger than (i/N)q:

EQUATION

  • i is the rank of the p-value
  • N is total number of tests
  • q is our desired FDR level (like 0.05 or 0.10)
  • i_{max} is the largest i where p(i) ≤ (i/N)q
  1. Decision Rule
    - We reject all null hypotheses for tests with p-values ≤ p($i_{max}$)
    - We accept all others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe family-wise error rate

A

Probability of making at least one false rejection. The goal is to control any false positives, and it is highly strict: even 1 false positive can stop the testing. You use this procedure e.g. when you’re making a clinical trial and making even 1 false rejection is unacceptable.

The level-alpha (significance level/threshold) for a single null hypothesis H_0 satisfies, by definition:

EQUATION
https://docs.google.com/document/d/1vpKM0HRWeNydB6yauWHaycTqa5Xj0ErGARXJvLd4rso/edit?tab=t.0

For a collection of N null-hypotheses, H_{0i}, the FWER (family-wise error rate) is the probability of making even one false rejection, so the probability of making at least one Type I error:

EQUATION

We want to control the FWER, so to control the probability of making even one false positive rejection among N comparable hypotheses tests.

We can do this by using Bonferroni’s procedure or Holm’s procedure. We would prefer Holm’s procedure, as it’s more flexible.

FWER was originally developed for small-scale testing, such as N ≤ 20, as FWER proved to conservative for working with N in the thousands.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe false discovery rate

A

Proportion of false rejections among all rejections. The goal is to control the ratio of false-positives, it is less strict: it allows for a small proportion of errors.

A decision rule D has rejected R out of N null hypotheses; a of these decisions were incorrect, i.e., they were“false discoveries,” while b of them were “true discoveries.”

The false-discovery proportion Fdp equals a=R.

The False Discovery Rate (FDR) is the expected proportion of false positives among all positive results.

EQUATION

This means that the FDR of our decision procedure (so maybe Holm) is the expected value of the proportion of false discoveries

  • D is your decision procedure (like a hypothesis test or classification rule)
  • Fdp stands for False discovery proportion

Think of it like this:

  • FWER: “I want to be 95% sure I make NO false discoveries”
  • FDR: “I’m okay if 5% of my discoveries are false”
  • Therefore: FDR control is less conservative than FWER control.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe two-group model

A

Goal:

  • To separate the hypotheses into nulls, pi_0, and alternative/non-null, pi_1 in large-scale hypothesis testing.
  • The Two-Groups Model helps identify significant results in large-scale testing by modeling test statistics as a mixture of null and non-null distributions and setting a decision threshold to control the False Discovery Rate (FDR) - that is, to determine which results are truly significant while limiting the proportion of false positives we accept.

Old:

A simple Bayesian framework for simultaneous testing is provided by the two-groups model:

Each of the $N$ cases, is either null with prior probability pi_0 (null hypothesis being true) or non-null with probability pi_1 = 1 - pi_0 (alternative hypothesis being true).

The resulting observation z has a density of either f_0(z) or f_1(z):

EQUATION
https://docs.google.com/document/d/1EqguzJZZIJ8QSyXyyVIzVbkoth8ffCmTomjQCZstnIU/edit?tab=t.0

The test statistics where $H_0$ is true, are drawn iid. (independent and identically distributed) from density f_0(z) with cdf (cumulative distribution function) F_0(z).

The test statistics where H_1 is true are drawn iid. from density f_1(z) with cdf F_1(z).

The mixture density is:

EQUATION

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain how each of the covered hypothesis testing methods work and discuss in which cases they would be applicable, and which one would be most appropriate in specific cases

A
  1. Classical Single Hypothesis Testing
    • How it works: Tests one hypothesis at a time with a predefined significance level α
    • When to use: Best for single comparisons (N=1) like comparing means between two groups
    • Example case: Testing if men are taller than women (single comparison)
  2. Bonferroni Procedure
    • How it works: Divides significance level α by number of tests N, testing each at α/N
    • When to use:
      • Small number of tests (N ≤ 20)
      • When absolutely avoiding false positives is crucial
    • Limitation: Very conservative for large N, leading to loss of power
  3. Holm’s Procedure
    - How it works:
    • Orders p-values from smallest to largest
    • Uses sequential thresholds α/(N-i+1)
    • Stops at first failure and rejects all hypotheses before that point
      - When to use:
    • Multiple comparisons where FWER control is needed
    • Better choice than Bonferroni as it’s uniformly more powerful
    • Still suitable for relatively small N
  4. Benjamini-Hochberg Procedure (BH)
    - How it works:
    • Orders p-values smallest to largest
    • Compares each p-value to (i/N)q threshold
    • Rejects all hypotheses up to largest i where p(i) ≤ (i/N)q
      - When to use:
    • Large-scale testing (hundreds or thousands of tests)
    • When some false positives are acceptable
    • When balance between Type I and II errors is needed

Most Appropriate Method for Specific Cases:

  1. Single Important Comparison (N=1)
    - Use: Classical hypothesis testing
    - Why: No need for multiple testing correction
  2. Small Number of Critical Tests (N ≤ 20)
    - Use: Holm’s procedure
    - Why: Controls FWER while being more powerful than Bonferroni
  3. Large-Scale Testing (N in thousands)
    - Use: Benjamini-Hochberg procedure
    - Why:
    • Better balance between false positives and false negatives
    • More appropriate than FWER methods for large N
    • Accepts some false discoveries for greater power
  4. Zero Tolerance for False Positives
    - Use: Bonferroni correction
    - Why: Most conservative approach, minimizes false positives
    - Note: Only if power loss is acceptable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly