Topic 6: Hypothesis testing and false-discovery rates Flashcards

1
Q

What is hypothesis test

A

Hypothesis testing addresses decision problems (usually about comparisons)
in statistical inference.
- We want to draw inferences about underlying data-generating processes.
- But we have access to the processes only through random samples.
- Are the observations we make due to different processes or to sampling noise?

Type I and Type II error:

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is large-scale testing

A

In large-scale testing, you have observed a large number of test statistics, N, how should we decide which (if any) of the null hypotheses to reject?

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is bonferroni bound

A

Very common workaround for multiple testing:
- Divide significance level α by number of tests N,
- test each hypothesis at level α/N

This guarantees that the family-wise error rate is bounded at level α, but bonferroni is VERY conservative for large N

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe Holm’s procedure

A

This is a procedure to control the Family-wise error rate (FWER) at level alpha. It’s uniformly more powerful than the Bonferroni bound.

Procedure:

  1. Order the observed $p$-values, from smallest to largest:
    EQUATION
    https://docs.google.com/document/d/1AvGnQnLdH6QLnmHanuB5i2LdGHKCoJj89uUwrvZzMdc/edit?tab=t.0

where H_0(i) denotes the corresponding null-hypotheses to the ith p-value

  1. Let i_0 be the smallest index $i$ for the ith p-value, such that it exceeds the adjusted threshold
    EQUATION

This means we can stop at the first “failure”, so where the p-value is larger than the alpha

  1. Reject all null hypotheses, H_0(i), for i<i_0 and accept all where i≥i_0. So, reject all null-hypotheses with a ranking before the smallest i_0, and accept all null-hypotheses with a ranking at or after i_0

It can be shown that Holm’s procedure controls the FWER at level alpha, while being more generous than Bonferroni in terms of when to declare a rejection. It’s much more flexible than Bonferroni, because we calculate the threshold at every rank.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe BenjaminiHochberg procedure

A
  1. Order the observed p-values from smallest to largest

EQUATION
https://docs.google.com/document/d/1AvGnQnLdH6QLnmHanuB5i2LdGHKCoJj89uUwrvZzMdc/edit?tab=t.0

  1. Finding the Critical Threshold
    - For each p-value p(i), we see if it is bigger than (i/N)q:

EQUATION

  • i is the rank of the p-value
  • N is total number of tests
  • q is our desired FDR level (like 0.05 or 0.10)
  • i_{max} is the largest i where p(i) ≤ (i/N)q
  1. Decision Rule
    - We reject all null hypotheses for tests with p-values ≤ p($i_{max}$)
    - We accept all others
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Describe family-wise error rate

A

The level-alpha (significance level/threshold) for a single null hypothesis H_0 satisfies, by definition:

EQUATION
https://docs.google.com/document/d/1vpKM0HRWeNydB6yauWHaycTqa5Xj0ErGARXJvLd4rso/edit?tab=t.0

For a collection of N null-hypotheses, H_{0i}, the FWER (family-wise error rate) is the probability of making even one false rejection, so the probability of making at least one Type I error:

EQUATION

We want to control the FWER, so to control the probability of making even one false positive rejection among N comparable hypotheses tests.

We can do this by using Bonferroni’s procedure or Holm’s procedure. We would prefer Holm’s procedure, as it’s more flexible.

FWER was originally developed for small-scale testing, such as N ≤ 20, as FWER proved to conservative for working with N in the thousands.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe false discovery rate

A

A decision rule D has rejected R out of N null hypotheses; a of these decisions were incorrect, i.e., they were“false discoveries,” while b of them were “true discoveries.”

The false-discovery proportion Fdp equals a=R.

The False Discovery Rate (FDR) is the expected proportion of false positives among all positive results.

EQUATION

This means that the FDR of our decision procedure (so maybe Holm) is the expected value of the proportion of false discoveries

  • D is your decision procedure (like a hypothesis test or classification rule)
  • Fdp stands for False discovery proportion

Think of it like this:

  • FWER: “I want to be 95% sure I make NO false discoveries”
  • FDR: “I’m okay if 5% of my discoveries are false”
  • Therefore: FDR control is less conservative than FWER control.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe two-group model

A

A simple Bayesian framework for simultaneous testing is provided by the two-groups model:

Each of the $N$ cases, is either null with prior probability pi_0 (null hypothesis being true) or non-null with probability pi_1 = 1 - pi_0 (alternative hypothesis being true).

The resulting observation z has a density of either f_0(z) or f_1(z):

EQUATION
https://docs.google.com/document/d/1EqguzJZZIJ8QSyXyyVIzVbkoth8ffCmTomjQCZstnIU/edit?tab=t.0

The test statistics where $H_0$ is true, are drawn iid. (independently and independent and identically distributed) from density f_0(z) with cdf (cumulative distribution function) F_0(z).

The test statistics where H_1 is true are drawn iid. from density f_1(z) with cdf F_1(z).

The mixture density is:

EQUATION

Goal:

  • To separate the hypotheses into nulls, pi_0, and alternative/non-null, pi_1 in large-scale hypothesis testing.
  • The Two-Groups Model helps identify significant results in large-scale testing by modeling test statistics as a mixture of null and non-null distributions and setting a decision threshold
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Explain how each of the covered hypothesis testing methods work and discuss in which cases they would be applicable, and which one would be most appropriate in specific cases

A
  1. Classical Single Hypothesis Testing
    • How it works: Tests one hypothesis at a time with a predefined significance level α
    • When to use: Best for single comparisons (N=1) like comparing means between two groups
    • Example case: Testing if men are taller than women (single comparison)
  2. Bonferroni Procedure
    • How it works: Divides significance level α by number of tests N, testing each at α/N
    • When to use:
      • Small number of tests (N ≤ 20)
      • When absolutely avoiding false positives is crucial
    • Limitation: Very conservative for large N, leading to loss of power
  3. Holm’s Procedure
    - How it works:
    • Orders p-values from smallest to largest
    • Uses sequential thresholds α/(N-i+1)
    • Stops at first failure and rejects all hypotheses before that point
      - When to use:
    • Multiple comparisons where FWER control is needed
    • Better choice than Bonferroni as it’s uniformly more powerful
    • Still suitable for relatively small N
  4. Benjamini-Hochberg Procedure (BH)
    - How it works:
    • Orders p-values smallest to largest
    • Compares each p-value to (i/N)q threshold
    • Rejects all hypotheses up to largest i where p(i) ≤ (i/N)q
      - When to use:
    • Large-scale testing (hundreds or thousands of tests)
    • When some false positives are acceptable
    • When balance between Type I and II errors is needed

Most Appropriate Method for Specific Cases:

  1. Single Important Comparison (N=1)
    - Use: Classical hypothesis testing
    - Why: No need for multiple testing correction
  2. Small Number of Critical Tests (N ≤ 20)
    - Use: Holm’s procedure
    - Why: Controls FWER while being more powerful than Bonferroni
  3. Large-Scale Testing (N in thousands)
    - Use: Benjamini-Hochberg procedure
    - Why:
    • Better balance between false positives and false negatives
    • More appropriate than FWER methods for large N
    • Accepts some false discoveries for greater power
  4. Zero Tolerance for False Positives
    - Use: Bonferroni correction
    - Why: Most conservative approach, minimizes false positives
    - Note: Only if power loss is acceptable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly