Hypothesis testing Flashcards
Background for multiple hypothesis testing
“For many statisticians, microarrays provided an introduction to large-
scale data analysis.
These were revolutionary biomedical devices that en-
abled the assessment of individual activity for thousands of genes at once—
and, in doing so, raised the need to carry out thousands of simultaneous
hypothesis tests, done with the prospect of finding only a few interesting
genes among a haystack of null cases.
”
Large-scale testing
”
Type 1 error:
Running 100 separate hypothesis tests at significance level 0.05
will produce about five “significant” results even if each case is actually
null
The prostate cancer data, Figure 3.4, came from a microarray study of
n D102men, 52 prostate cancer patients and 50 normal controls. Each
man’s gene expression levels were measured on a panel of N = 6033
genes, yielding a 6033 x 102 matrix of measurements.
Even though most of the genes appear null, the discrepancies from the curve suggest that there are some non-null cases, the kind the investigators hoped to find
Large-scale testing refers exactly to this situation: having observed a
large number N of test statistics, how should we decide which if any of the
null hypotheses to reject?
“
classical Bonferroni bound
“The classical Bonferroni bound avoids this fallacy by strengthening
the threshold of evidence required to declare an individual case significant
(i.e., non-null).”
family-wise error rate FWER
“Classic hypothesis testing is usually phrased in terms of significance lev-
els and p-values.
A level ‘alpha’ test for a single null hypothesis satisfies: alpha is the rate with which we accept to reject a true null hypothesis.
the family-wise error rate is the probability of making even one false rejection
The FWER criterion aims to control the probability of making even one
false rejection among N simultaneous hypothesis tests. Originally devel-
oped for small-scale testing, say N = 20, FWER usually proved too con-
servative for scientists working with N in the thousands.”
Holm’s procedure
“It can be shown that Holm’s procedure controls FWER at level alpha, while
being slightly more generous than Bonferroni in declaring rejections.”
False discovery rates
“The false discovery rate (FDR) is the expected proportion of false positives (incorrectly rejected null hypotheses) among all rejected hypotheses. It is a less stringent error control method than the family-wise error rate (FWER), making it well-suited for large-scale hypothesis testing.
R null hypotheses have been rejected; a of these were cases of
false discovery, i.e., valid null hypotheses, for a “false-discovery propor-
tion” (Fdp) of Fdp(D) = a / R
“
Benjamini–Hochberg FDR Control
“That is, we expect that most cases are null, putting pi_zero very near 1.
The popularity of FDR control hinges on the fact that it is more generous
than FWER in declaring significance. “
Comparison of Holm and BH procedures
Criticism of FDR:
“A critic, noting FDR’s relaxed rejection standards in Figure 15.3, might raise some pointed questions.
The control theorem depends on independence among the p-values. Isn’t this unlikely in situations such as the prostate study? “
Empirical Bayes large-scale testing (two-groups model)
“Bayesian methods, at least in their empirical Bayes manifes-
tation, no longer demand heroic modeling efforts, and can help untangle
the interpretation of simultaneous test results.
Definition: A statistical approach for hypothesis testing in large-scale studies (e.g., genomics, proteomics), where each hypothesis belongs to one of two groups: the null group (no effect) or the alternative group (nonzero effect). The method uses empirical Bayes to estimate the proportion of null hypotheses and the distributions of test statistics under both groups.”