Type 1 and 2 error Flashcards
Statistical error
We can never be completely certain that we are right when we reject or fail to reject the null hypothesis
Type 1 error = rejecting the null hypothesis when it is CORRECT
Saying that means are different when they are the same FALSE POSITIVE
Type 2 error = failing to reject the null hypothesis when it is incorrect
Saying the means are the same when they are different FALSE NEGATIVE
Type 1 error = FALSE POSITIVE
problems
- Overcall positive results
- Identify a treatment effect when doesn’t exist
- Waste of time and effort on further development of an ineffective drug
- Consequences for patients
Type 2 error = FALSE NEGATIVE
problems
- Overcall negative results
- Fail to identify a treatment effect when one does exist
- Reject /lose a potentially effective treatment
- Waste resources used so far on drug development
Type 1 error rate
how more likely and how to reduce
Rejecting the null hypothesis when It is correct
- Designated by a (alpha) usually set to 0.05
- Implying that it is okay to have a 5% probability of incorrectly rejecting a true null hypothesis
Type 1 errors more likely with:
• Multiple tests – if we do 20 tests, one will falsely reject the null hypothesis
• Higher alpha values will cause error also
Reduced by:
• Pre-study analysis design – avoid multiple testing
• Setting a lower e.g., 0.01
• Reporting p values to 3 decimal places to give accurate probability estimate
Type 2 error rate
how more likely and how to reduce
FAILING to reject the null hypothesis when it is incorrect
Denoted by Greek letter B
‘POWER’ of a test is 1-B
• Power = likelihood of a statistical test detecting an effect when there is one
• Greater power = less likely to be a false negative result
Type 2 errors more likely with:
• Small samples
• Small effect size - hard to detect
Prevented by:
• Large sample size
• Larger effect size (choosing an outcome where you can measure better effect size)
Multiplicity
Performing many statistical tests on one clinical trial
Increases the risk of type 1 error (alpha)
• False positive result
• Rejecting null hypothesis when it is actually true
• Set for a single comparison at p<0.05
Risk of type 1 error calculation
Calculated by:
[1-(1-a)^n] where n is the number of tests
Type 1 error rate of <0.05 is accepted for a single test
Inappropriate for multiple tests
Multiplicity in clinical trials
5 examples
Multiple treatment
• More than 2 groups (drugs, doses, combinations)
Multiple endpoints
• Several outcomes of interest
Repeated measurements
• Measurements at multiple time points
Subgroup analyses
• Tests whether individuals with certain characteristics benefit more than those without (e.g. demographics, lifestyle)
Interim analyses
• Analysis of data that is conducted before data collection has been Conducted during the trial e.g. for ethical and economic reasons
Dealing with multiplicity
- Make less comparisons
- Pre-define/ prioritize the comparisons
- Adjusting the p value
Make less comparisons - dealing with multiplicity
- MULTIPLE TREATMENTS use analysis of variance (single omnibus test compares all treatments ar once rather than making multiple comparisons)
- MULTIPLE ENDPOINTS
use single summary statistics e.g. questionnaire many questions but one score - MULTIPLE ENDPINTS use composite endpoint
e. g. MACE - occurrence of any fatal heart problem like stroke, heart attack etc all counts as 1 - REPEATED MEASUREMENTS
Do analysis at predefined timepoint
or do summary measure - area under curve for example
or use statisical mixed model
pre-define. - dealing with multiplicity
Multiple treatments
– Pre-define the most important comparison
Multiple endpoints
– Specify primary and secondary endpoints in advance
• Study is powered to detect primary endpoint and outcome judged on the significance of the primary endpoint
Subgroup analyses
– Predefine a limited number of subgroups to be analyzed
adjust the p value - dealing with multiplicity
Type 1 error rate is inflated by multiple tests
– Reduce the p value threshold for individual tests
– Overall level of significance can be kept at 0.05 for entire series of tests
e.g. Bonferroni correction
– divide 0.05 by number of tests done to set significance level for each subtest
– e.g. for 5 related tests set a (risk of false positive result) at 0.05/5 = 0.01
– Very conservative, tends to overcorrect and increase the risk of a false negative result
No significance testing for baseline data
can avoid to reduce multiplicity
Multiple tests will generate false positive results
• e.g. 30 comparisons; 79% chance of false positive result
Differences may be clinical important but not statistically significant
• negative tests may be falsely reassuring
Comparisons not testing a useful scientific hypothesis
repeated measurements
Outcome variable measured two or more times for each participant over a period of time
e.g. before, during ad after
How to compare repeated measures between groups?
Compare final measurement?
• Wastes a lot of valuable information
Compare every timepoint?
• Multiple comparisons – risk type 1 error
Some kind of regression?
• Correlation structure leads to bias
Summary measure approach?
• Each measure has limitations
WE CAN USE REPEATED MEASURES MODEL (ANOVA) or summary measures
Tracking
- Baseline characteristics influence PK/PD so that measurement values vary from low to high
- Values tend to track for an individual e/g/ start high, stay high, start low, stay low
- There is strong correlation between repeated measures – ‘correlation structure’ hence cant do regression |/
• ANOVA (analysis of variance test) is an OMNIBUS TEST
- ONMNIBUS test – tests everything at once (variance of all variables) – avoids risk of multiplicity
- However, the output just tells us there is a difference – doesn’t tell us what is different (which time points are different??)
• A post hoc test
can tell us what is different - you can do this by estimated marginal means
- It is okay for multiplicity for post hoc test because you have already proved there is a difference between the time points as a whole
- Post hoc is exploratory analysis not your primary outcome
• Estimated marginal means
an estimate of the means rather than the actual calculation of them
• We can compare them and compare the main effects
• The means are estimated from the regression model rather than calculated from data
• These are inferential stats not descriptive
• Means for groups adjusted for means of other factors in the model
• Also referred to as least square means
descriptive stats vs Least square means
- Least squares mean just means means have been estimated from model
- Primary outcome is NOT significant
- Why are there no p-values for secondary outcomes? Because primary outcome is not significant – so you dint explore stats on secondary outcome – you don’t give p value
Summary measure approach pros
- Summarises all the information as a single statistic
- Reduced multiplicity
- Avoids the problem of correlation structure
- Makes interpretation easier
summary measure for repeated time measurements approach examples and limitations
- mean (central level of efficacy of outcome variable)
limitation - sensitive to missing info - maximum - (describe max drug concentration)
- time to maximum (describe speed of drug)
limitation - sensitive to missing info - area under curve (assess overall conc of drug) (ignores within subject variations
- percentage of time above below certain value (asses time that drug is effective)
- number of occasions above or below certain value (assess frequency of fluctuations)
limitation - many time points needed for stable estimate - rate of change (rate of change in outcome variable)
limitation - coefficients are measured with varying levels of precision