Medical Statistics Flashcards
Research starts with asking a clinically relevant question, determining an appropriate outcome measure (data), and deciding on the most rigorous and feasible design.
- Research designs are selected to test a specific theoretical hypothesis in a way that provides the highest level of evidence.
- Research designs should be rigorous and controlled to provide the highest-quality outcomes.
Aims to apply evidence from the highest-quality research studies to the practice of medicine
- Findings from the best-designed and most rigorous studies have the greatest influence on clinical decision making.
- The levels of evidence in medical research: a hierarchy for various research applications and questions based on several factors affecting the quality of a research design
- Diagnostic, prognostic, or therapeutic research designs with higher levels of evidence have a greater influence on clinical recommendations.
- Level I:………………………………………………………………………………………………………….
- Level II: …………………………………………………………………………………………………………
- Level III: ……………………………………………
- Level IV: …………………………………………..
- Level V: ……………………………………………
-
Level I: high-quality clinical trials (randomized, controlled, blinded, etc.)
- “less than 80% follow-up, no blinding, or improper randomization” are lesser-quality studies and qualify as Level 2 evidence
- Level II: cohort studies or lesser-quality clinical trials
- Level III: case-control studies
- Level IV: case series studies
- Level V: expert opinions
Clinical study design is an essential element of research that the study team must determine in advance of initiating the study.
- Prospective studies………………………………………………………………………………………………………..
- Retrospective studies……………………………………………………………………………………………………..
- Longitudinal studies………………………………………………………………………………………………………………………….
- Observational research……………………………………………………………………………………………………………… (Case Reports, Case Series, Case-control study, Cohort study, Cross sectional study)
-
Experimental research designs
-
………………………………………………………………………………………………………… Clinical trials are costly, take a great deal of time, money, and resources and a comprehensive research team (often at multiple patient enrollment sites) to accomplish their aims.
- ……………………………………………………………………………………………………………………………………………………………………….
- Clinical trials with …………………………………………………………………………………………………
- Clinical trials with ………………………………………………………………………………………………………
- Clinical studies ……………………………………………………………………………………………………………
-
………………………………………………………………………………………………………… Clinical trials are costly, take a great deal of time, money, and resources and a comprehensive research team (often at multiple patient enrollment sites) to accomplish their aims.
Clinical study design is an essential element of research that the study team must determine in advance of initiating the study.
- Prospective studies are designed to start in the present and collect data forward in time.
- Retrospective studies are designed to assess outcomes that have already occurred or data that have been collected in the past (medical records)
- Longitudinal studies involve repeated assessments over a long period. A longitudinal study can also be performed on historical (retrospective) data.
- Observational research designs can be prospective, retrospective, or longitudinal. Common observational designs are as follows; (Case Reports, Case Series, Case-control study, Cohort study, Cross sectional study)
-
Experimental research designs
-
A clinical trial is designed to allocate treatments and track outcomes prospectively to test a specific hypothesis. Clinical trials are costly, take a great deal of time, money, and resources and a comprehensive research team (often at multiple patient enrollment sites) to accomplish their aims.
- The gold standard, and the type of clinical trial that produces the highest level of evidence, is the randomized controlled trial (RCT).
- Clinical trials with parallel design: treatments are allocated to different subjects/ patients in random or nonrandom manner.
- Clinical trials with crossover designs: each subject receives two or more interventions in a predetermined or random order.
- Clinical studies can be designed to determine superiority of one treatment over another or to determine whether one treatment is no worse than another (noninferiority) or just as effective (equivalency).
-
A clinical trial is designed to allocate treatments and track outcomes prospectively to test a specific hypothesis. Clinical trials are costly, take a great deal of time, money, and resources and a comprehensive research team (often at multiple patient enrollment sites) to accomplish their aims.
-
Case reports:
- Descriptions of …………………………………………………………………………………………………..
- No attempts at advanced data analysis are made.
- Cause-and-effect relationships and generalizability are not determined.
-
Case series:
- Outcomes are measured …………………………………………………………………………………………………….
- No attempts are made to estimate frequencies or distributions.
-
Case-control studies:
- Outcomes measured in patients ……………………………………………………………………………………..
- Odds ratios (not relative risks) are appropriate measures of association from data collected in these study designs.
-
Cohort study:
- Groups of patients with ………………………………………………………………………………………………………….
- Cohort studies are appropriate for estimating …………………………………………………………………….
-
Cross-sectional study:
- A specific ……………………………………………………………………………………………………..
- All measurements are made at ………………………………………………………………………
- Considered ……………………… that is useful for describing the ………………………………………….at a particular point in time.
-
Case reports:
- Descriptions of unique injures, disease occurrences, or outcomes in a single patient
- No attempts at advanced data analysis are made.
- Cause-and-effect relationships and generalizability are not determined.
-
Case series:
- Outcomes are measured in patients with a similar disease/ injury to determine outcomes retrospectively.
- No attempts are made to estimate frequencies or distributions.
-
Case-control studies:
- Outcomes measured in patients with similar disease/ injury are compared with a control group (see later discussion of flaws in research designs for more information about control groups).
- Odds ratios (not relative risks) are appropriate measures of association from data collected in these study designs (see later Concepts in Epidemiologic Research Studies).
-
Cohort study:
- Groups of patients with a similar characteristic or exposure/ risk factor are studied forward in time (prospective) or from existing data (retrospective).
- Cohort studies are appropriate for estimating incidence of disease/ injury and relative risks.
-
Cross-sectional study:
- A specific patient population is studied at a given point in time.
- All measurements are made at once with no follow-up period.
- Considered “snapshot” that is useful for describing the prevalence of a particular injury/ disease of interest at a particular point in time.
-
Confounding variables are factors …………………………………………………. that potentially influence the outcome.
- Conclusions regarding cause-and-effect relationships may be explained by confounding variables
- Must therefore be controlled for in the research design
-
Bias is ……………………………………………………………………………………………. the internal validity of a study.
- selection (sampling) bias,
- nonresponder (loss to follow-up) bias,
- observer/ interviewer bias, and recall bias
-
Control groups can help account for potential placebo effect of interventions.
- Control subjects are often matched on the basis of specific characteristics (e.g., gender, age), a process that helps account for potential confounding sources that may influence the impact of research findings.
- The strongest clinical trial design uses randomly allocated, blinded and concurrent, matched controls.
- Design flaws may challenge the internal or external validity of a research study.
- Internal validity describes the quality of a research design and how well the study is controlled and can be reproduced.
- External validity is the ability of a study’s results to be generalized or applied to a whole population of interest.
-
Study populations in clinical research studies are delimited by inclusion and exclusion criteria. During a screening process, clinical researchers carefully review all inclusion and exclusion criteria to determine eligibility for participation in a clinical research study or clinical trial.
- The narrower a patient population becomes, …………………………………………………………………………………….., study findings will be.
- Inclusion criteria are specific characteristics that are identified to best describe a target population. Sex, age, race, primary diagnosis, and procedure are all examples of inclusion criteria. In clinical research, to be included in a study, the response to all inclusion criteria must be affirmative (i.e., “yes”).
- Exclusion criteria are specific characteristic that, when present, would disqualify a potential participant from the study. For the participant to be included in a clinical research study, all exclusion criteria must be negative or ruled out.
-
Confounding variables are factors extraneous to a research design that potentially influence the outcome.
- Conclusions regarding cause-and-effect relationships may be explained by confounding variables
- Must therefore be controlled for in the research design
-
Bias is unintentional systematic error that will threaten the internal validity of a study.
- selection (sampling) bias,
- nonresponder (loss to follow-up) bias,
- observer/ interviewer bias, and recall bias
-
Control groups can help account for potential placebo effect of interventions.
- Control subjects are often matched on the basis of specific characteristics (e.g., gender, age), a process that helps account for potential confounding sources that may influence the impact of research findings.
- The strongest clinical trial design uses randomly allocated, blinded and concurrent, matched controls.
- Design flaws may challenge the internal or external validity of a research study.
- Internal validity describes the quality of a research design and how well the study is controlled and can be reproduced.
- External validity is the ability of a study’s results to be generalized or applied to a whole population of interest.
-
Study populations in clinical research studies are delimited by inclusion and exclusion criteria. During a screening process, clinical researchers carefully review all inclusion and exclusion criteria to determine eligibility for participation in a clinical research study or clinical trial.
- The narrower a patient population becomes, the less confounded or biased, but also the less generalizable, study findings will be.
- Inclusion criteria are specific characteristics that are identified to best describe a target population. Sex, age, race, primary diagnosis, and procedure are all examples of inclusion criteria. In clinical research, to be included in a study, the response to all inclusion criteria must be affirmative (i.e., “yes”).
- Exclusion criteria are specific characteristic that, when present, would disqualify a potential participant from the study. For the participant to be included in a clinical research study, all exclusion criteria must be negative or ruled out.
Research studies should have enough subjects/ samples to get valid results that can be generalized to a population while minimizing unnecessary work or risk to subjects.
- Sample size estimates are based on the desired statistical power (often termed power analyses).
- There are four elements involved in a power analysis: alpha, beta, effect size, and sample size
- Typically, power is set at 80%, alpha is set at 0.05, the effect size and variance are estimated from pilot data or prior literature, and the equation is solved for the necessary sample size.
- when a study determines no significant effect the power of the study should be reported.
- The effect size is defined as the magnitude of a difference considered to be clinically meaningful. It is used in the power analysis to determine the required sample size
- Statistical ………………………………………………………………………………………………………..
- We want to be able to find these differences with our statistical tests 80% of the time or more.
- Sample sizes are justified ………………………………………………………………………………………….
- Higher sample sizes and/ or highly precise measurements (lower variability) are necessary to find small differences between study groups.
- Power analyses can be done before the study starts (a priori) or after the study has been completed (post hoc).
- Studies with low power have higher likelihood of missing statistical differences (or relationships) when they actually exist (i.e., type II error).
- Sample sizes are calculated to determine the number of subjects needed to study a specific outcome measure. It is important to identify a primary outcome measure in order to determine sample size for a research study.
- Studies that have multiple outcome measures may need multiple sample size estimates to ensure all outcomes are appropriately “powered.”
Research studies should have enough subjects/ samples to get valid results that can be generalized to a population while minimizing unnecessary work or risk to subjects.
- Sample size estimates are based on the desired statistical power (often termed power analyses).
- Statistical power is the probability of finding differences among groups when differences actually exist (i.e., avoiding type II error).It is defined as 1 - the probability of a type 2 error (beta).
- We want to be able to find these differences with our statistical tests 80% of the time or more.
- Sample sizes are justified as the number of subjects needed to find a statistically significant difference or association (i.e., P < 0.05) while maintaining statistical power greater than 80%.
- Higher sample sizes and/ or highly precise measurements (lower variability) are necessary to find small differences between study groups.
- Power analyses can be done before the study starts (a priori) or after the study has been completed (post hoc).
- Studies with low power have higher likelihood of missing statistical differences (or relationships) when they actually exist (i.e., type II error).
- Sample sizes are calculated to determine the number of subjects needed to study a specific outcome measure. It is important to identify a primary outcome measure in order to determine sample size for a research study.
- Studies that have multiple outcome measures may need multiple sample size estimates to ensure all outcomes are appropriately “powered.”
- Selecting the most appropriate outcome for a study is an important decision made in advance by the research team.
- Primary outcome measures match the primary purpose of the study.
- Secondary and tertiary outcomes may be included as additional (sometimes exploratory) measures that are important to achieve the goals of the study.
- Typically, sample size estimates for a study are based on the primary outcome measure.
- Subjective data are opinions, judgments, or feelings (e.g., in clinical research, patient-reported outcomes are subjective)
- Objective data are measured by a valid or reliable instrument (see discussion Validity and Reliability). ▪ Primer on sampling and data distributions
- Population: all individuals who share a specific characteristic of clinical or scientific interest. • Parameters describe the characteristics of a population.
- Random sampling affords all members of a specific population equal chance of being studied/ enrolled in a clinical study.
- Sample populations are representative subsets of the whole population.
- Statistics describe the characteristics of a sample and are intended to be generalized to the whole population.
- Populations are delimited on the basis of inclusion and exclusion criteria that are set before a study starts.
- Types of data collected from samples:
- Discrete data have an infinite number of possible values (e.g., age, height, distance, percentages, time, etc.).
- Categorical data have a limited/ finite number of possible values or categories (e.g., excellent/ good/ fair/ poor, male/ female, satisfied/ unsatisfied, etc.).
- Binary categorical data only have two options (i.e., yes/ no).
- Categorical data can be ordered (e.g., severity: mild, moderate, severe) or unordered (e.g., gender, race).
- Data can be plotted in frequency distributions (histograms) to summarize basic characteristics of the study sample.
- Continuous data are often converted into categorical or binary data through the use of cutoff points. Cutoff points can be arbitrary or evidence based.
- Evidence-based establishment of cutoff points uses receiver operating characteristic (ROC) curves and identifies a point that maximizes sensitivity and/ or specificity of a particular test.
- Example: a numerical value can be established as a cutoff point for white blood cell count to identify whether or not an infection exists.
- Arrays of continuous data can be separated into percentiles to identify upper/ lower halves, thirds, quartiles, and so forth.
Data distribution is a histogram describing the frequency of occurrence of each data value. Distributions can be described using descriptive statistics such as the following:
- Mean is calculated as the sum of all scores divided by the number of samples (n).
- Median is the value that separates a dataset into equal halves, so that half of the values are higher and half are lower than the median.
- Mode is the most frequently occurring data point.
- Range is the difference between the highest value and the lowest value in a dataset.
-
Standard deviation (SD) is a value that describes the dispersion or variability of the data.
- SD is higher when data are more “spread out.
-
The confidence interval (CI) quantifies the precision of the mean or other statistic, such as an odds ratio (OR) or relative risk (RR).
- Datasets that are highly variable (large SDs) have larger CIs and hence are less accurate estimates of the characteristics of a population.
- A 95% CI consists of a range of values within which we are 95% certain that the actual population parameter [mean/ OR/ RR] lies.
- How to determine whether a data distribution is normal:
- Normally distributed datasets resemble a bell-shaped curve. The mean, median, and mode are the same value in a Gaussian (normally distributed) distribution.
- Skewed data distributions are asymmetric and may be due to outliers (see later). Data distributions can be skewed to the left (negative skew) or skewed to the right (positive skew). This distribution can be calculated as a numeric value to determine the skewness of a data distribution.
-
Kurtosis is a measure of the relative concentration of data points within a distribution.
- If data values cluster closely, the dataset is more kurtotic. This concentration can be calculated as a numeric value to determine the extent of kurtosis in a data distribution.
- Outliers are data points that are considerably different from the rest of the dataset. Outliers can cause data distributions to be skewed.
- When an unknown value is sought, the confidence interval gives the statistician a set of parameters within which the “true” value is located.
- The confidence interval is used to indicate the reliability of an estimate.
- The standard deviation is a quantity calculated to indicate the extent of deviation for a group as a whole. The mode is the value which occurs most frequently in a given set of data.
- The variance is a quantity equal to the square of the standard deviation.
- The incidence is the frequency of an occurrence (or disease).
Epidemiology is the study of the distribution and determinants of disease. The following measures are commonly used in this type of research:
- Prevalence is the proportion of existing injuries/ disease cases conditions within a particular population.
- Incidence (absolute risk) is the proportion of new injuries/ disease cases within a specified time interval (requires a follow-up period).
-
RR is a ratio between the incidences of an outcome in two cohorts. Typically a treated/ exposed cohort (in the numerator of the ratio) is compared with an untreated (control) group/ unexposed group (in the denominator of the ratio). Values can range from 0 to infinity and are interpreted as follows:
- RR = 1.0: indicates the incidences of an outcome are equal in the two groups.
- RR > 1.0: indicates the incidence of an outcome is greater in the treated/ exposed group (higher incidence value in the numerator).
- RR < 1.0: indicates the incidence of an outcome is greater in the untreated/ unexposed group (higher incidence value in the denominator).
-
OR is calculated as a ratio between the probabilities of an outcome in two cohorts.
- ORs are well suited for binary data or studies in which only prevalence can be calculated.
-
Interpreting RR and OR •
- OR and RR values are interpreted similarly.
- In the comparison of outcomes between two groups, an RR or OR value of 0.5 would indicate that treated/ exposed patients have half the likelihood of experiencing a particular outcome than that for the untreated/ control group.
- A value of 2.5 would indicate that a treated/ exposed group would have a 2.5 times greater likelihood of experiencing the outcome than the untreated/ control group.
- An RR or OR whose CI crosses 1 is not considered to be “significant.”
-
Sensitivity:
- The likelihood of positive test results in patients who actually DO have the disease/ condition of interest (i.e., ability to detect true positives among those with a disease)
- Sensitive tests are used for screening because they have few false-negative results.
- When the result of a highly sensitive (Sn) test is negative, the condition can be ruled OUT (mnemonic: SnOUT).
-
Specificity:
- The likelihood of negative test results in patients who actually DO NOT have the disease/ condition of interest (i.e., ability to detect true negatives among those without a disease)
- Specific tests are used for confirmation because they are tests that have few false-positive results and are therefore unlikely to result in false treatment of a healthy individual.
- When the result of a highly specific (Sp) test is positive, the condition can be ruled IN (mnemonic: SpIN
-
Positive predictive value
- the likelihood that patients with positive test results actually DO have the disease/ condition of interest
-
Negative predictive value
- the likelihood that patients with a negative test result actually DO NOT have the disease/ condition of interest
-
Likelihood ratio
- Probability that a disease exists, given a test result; likelihood ratios consider both specificity and sensitivity of a given test.
- Likelihood ratios close to 1.0 provide little confidence regarding presence/ absence of a disease.
- Positive likelihood ratios greater than 1.0 indicate higher probability of disease when diagnostic test result is positive.
- Negative likelihood ratios less than 1.0 indicate higher probability that the disease is absent given a negative test result.
- Receiver operating characteristic curves are graphical representations of the overall clinical utility of a particular diagnostic test that can be used to compare accuracy of different tests in diagnosing a particular condition .
- Tradeoffs between sensitivity and specificity must be considered in the identification of the best diagnostic tests
- ROC curves plot the true-positive rate (sensitivity) and the false-positive rate (1-specificity) on a graph.
- Statistical tests are prescribed to match the purpose and design of a particular research study. Statistical tests are used to answer research questions. Statistics are merely tools to describe data and make inferences
- Statistical analyses differ according to whether a researcher wants to compare groups to identify differences, establish relationships between groups, and so on.
- Inferential statistics are used to test specific hypotheses about associations and/ or differences among groups of subject/ sample data.
- The dependent variable is what is being measured as the outcome. There can be multiple dependent variables depending on how many outcome measures are desired.
-
The independent variables include the conditions or groupings of the experiment that are systematically manipulated by the investigator.
- For example, a researcher is measuring pain and prescription medication use in patients receiving treatment A or B or C in patients with shoulder pain. The dependent variables are “pain” and “prescription medicine use.” The independent variable is “treatment condition” with three levels, “A,” “B,” and “C.”
- Inferential statistics can be generally divided into parametric tests and nonparametric tests. The goal of inferential statistics is to estimate parameters; therefore the default should be to parametric tests. Nonparametric alternatives are justified if the basic underlying assumptions for using parametric statistics are violated or if the sample sizes are very small.
-
Parametric statistics are appropriate for continuous data and rely on the assumption that data are normally distributed.
- They use the mean and SD when comparing groups or identifying associations.
- The mean of a dataset is greatly influenced by outliers, so these tests may not be as robust for skewed datasets
-
Nonparametric statistics are appropriate for categorical and non– normally distributed data.
- They use the median and ranks as more robust alternatives when data are non-normally distributed.
-
Parametric statistics are appropriate for continuous data and rely on the assumption that data are normally distributed.
- The decision on which statistical test to use is based on several factors inherent to research designs.
- Some important considerations are:
- How many groups are being studied?
- Are the measures being recorded in the same or different subjects (or samples)?
- Are the data continuous or categorical?
- Are the data normally distributed?
-
When two groups of data are compared, the t-test is used; there are two variations:
-
Dependent (paired) samples t-test:
- Appropriate for comparing continuous, normally distributed data collected two times on the same subjects
- Example: two time points measured in the same patient (e.g., before/ after intervention)
- Also appropriate for side-by-side comparison within the same subject or in matched pairs of subjects
- Nonparametric equivalent: Wilcoxon signed rank test. When the observations are not normally distributed, the Wilcoxon rank-sum test is more powerful than the t-test in detecting an actual difference between paired samples. It is appropriate for small samples that are not normally distributed
-
Independent samples t-test
- Appropriate for comparing continuous, normally distributed data from two separate groups
- Example: two groups of patients who received different treatments
- Nonparametric equivalent: Mann-Whitney U test
-
ANOVA is appropriate to compare three or more groups of continuous, normally distributed data.
- Nonparametric equivalent: Kruskal-Wallis test
-
Repeated measures ANOVA is a variation of the ANOVA test that is appropriate for sequential measurements recorded on the same subjects.
- For example, this test would be used to compare a dependent variable (outcome measure) recorded at three or more time points (baseline, 1 month post intervention, 2 months post intervention).
- Nonparametric alternative: Friedman test
- Multivariate ANOVA (MANOVA): variation of the ANOVA test that is used when multiple dependent variables are compared among three or more groups
- Analysis of covariance (ANCOVA) is an appropriate test when confounding factors must be accounted for in the statistical test.
- Post hoc testing is necessary after any ANOVA test to determine the exact location of differences among groups.
- ANOVA tests describe whether or not a statistically significant difference exists somewhere among the study groups.
- For example, in a comparison of three levels of the independent variable treatment condition (A, B, or C), post hoc testing will specifically compare A vs. B, B vs. C, and A vs. C to determine the exact locations of group differences.
- Post hoc testing is appropriate only if the ANOVA test is statistically significant (see later section).
- Common post hoc tests: Tukey HSD, Šidák, Dunnett, Scheffe
- Factorial designs for multiple independent variables
- Hypotheses regarding an interaction among three different treatment groups from pre/ post intervention will have a 2 × 3 factorial design.
- “2 × 3” indicates two independent variables; for example, the first (time) has two levels, pretest and post test, and the second (treatment condition) has three levels, treatments A, B, and C.
- Hypotheses regarding an interaction among three different treatment groups from pre/ post intervention will have a 2 × 3 factorial design.
-
Dependent (paired) samples t-test:
- Correlation and regression
- Correlation coefficients
- Describe the strength of a relationship between two variables
- Pearson product correlation coefficient (r) used for continuous normally distributed data
- Spearman rho correlation coefficient (ρ) is the nonparametric equivalent.
- Values range from − 1.0 to 1.0; less than ± 0.33 are “weak,” between ± 0.33 and ± 0.66 are “moderate,” and more than ± 0.66 are “strong.” Positive values are direct relationships; negative values are indirect relationships.
- Positive correlation coefficients indicate direct relationships suggesting that patients who scored high on one scale also score high on the other.
- Negative correlation coefficients indicate inverse/ indirect relationships suggesting that patients who score high on one scale score low on the other.
- Describe the strength of a relationship between two variables
-
Simple linear regression
- Describes the ability of one independent (predictor) variable to predict a dependent variable (outcome) variable
- The coefficient of determination (R 2 ) is the square of r (Pearson product correlation coefficient) and indicates the proportion of variance explained in one variable by another.
- R 2 ranges from 0 to 1.0, in which higher values indicate more variance explained.
- Multivariate linear regression describes the ability of several independent variables to predict a dependent variable.
- Logistic regression is used when the outcome is categorical and the predictor variables can be either categorical or non– normally distributed continuous data.
- Correlation coefficients
-
Statistical tests for categorical data
-
Chi-square (χ 2 ) test
-
Used for two or more groups of categorical data
- Example: to compare treatment A versus B when the outcome is either “satisfied or unsatisfied,” the chi-square test can be used to identify relationships between “treatment condition” and “outcome category.”
- If the result of the test is statistically significant, frequencies of each outcome in the two treatment groups can be visually compared to describe which treatment is superior.
-
Used for two or more groups of categorical data
-
Fisher exact test
- Similar to the chi-square test but better for small sample sizes or when the number of occurrences in one of the categories is low (e.g., if only one patient in treatment group A had an unsatisfactory outcome, this test is preferred).The Fisher exact test is preferred when there is less than 5 data points in any group being compared.
-
Chi-square (χ 2 ) test
- Some important considerations are:
- Can be assessed using statistical techniques similar to correlation coefficients
- The intraclass correlation coefficient evaluates agreement between two measures on the same scale.
-
Accuracy/ validity
- An instrument or test with the ability to accurately describe truth/ reality is said to be valid.
- A validation study is designed to compare measures recorded from a gold-standard method with a new or experimental method. The data should be on the same measurement scale to determine agreement between the two instruments or techniques.
-
Precision/ reliability
- The ability to precisely describe a characteristic with repeated measurements can be tested statistically.
- The precision of an instrument or technique can be tested for interobserver (measures taken by different examiners on the same patient) or intraobserver (reliability of measures recorded by the same examiner at consecutive times) reliability. Measures should be on the same scale to determine agreement.
- The intraclass correlation coefficient (ICC) is a common statistical method for statistically testing the agreement between two sets of data. Values range from 0 to 1.0 (1.0 = perfect accuracy/ precision).
- For binary or categorical data, a κ (kappa) statistic can be used to determine agreement. The κ statistic has the same scale (0 to 1.0) as the ICC.
- A common mistake in the biomedical literature is to interpret the correlation coefficient as indicating the accuracy or validity of one measurement compared to another. In fact, correlation analysis is not the most appropriate type of analysis for the example above. A correlation coefficient of 0.97 indicates that 97% of variation in data from the bench-top test corresponds to variation in the computer model. However, two models can be nearly perfectly correlated, but both sets of data may be highly inaccurate. That is, the correlation coefficient indicates only whether two quantities tend to increase and decrease together, but it does not establish that they are accurate. Suppose, for example, that an experimental quantity is measured using two methods, giving 2, 3, 4, 5 for the first method and 4, 6, 8, 10 for the second. In this case, the two sets of data are perfectly correlated (R2 = 1 and p = 0.00). However, it is clear that at least one and possibly both methods are inaccurate. A particular measurement method is said to be highly repeatable and precise if it provides the same value each time it is performed, with little variation. However, it still might be highly inaccurate. For example, a high-quality bathroom scale that has not been “zeroed” will give precisely the same inaccurate weight for many measurements. In general, statistical analyses cannot determine the accuracy of a particular experimental method. This must be done by comparing the data obtained using this method to data obtained using an independent method for which the accuracy is known.
- Number needed to treat is the number of patients that must be treated in order to achieve one additional favorable outcome. It is calculated by 1 / absolute risk reduction.
- In the interpretation of a statistical test result, it is important to establish whether or not your findings (e.g., a difference or relationship) were due to chance.
-
Probability values (P values)
- Inferential test statistics (t-statistic, F-statistic, r coefficient, etc.) are accompanied by a probability (P) value. These values are expressed on a 0% to 100% scale and indicate the probability that the differences/ relationships among study data occurred by chance.
-
P values less than 0.05 mean there is less than a 5% chance that the observed difference/ relationship has occurred by chance alone and not through the study intervention.
- A test is identified as statistically significant if the P value is 0.05 or less (willing to commit type I error 5/ 100 times).
- Bonferroni correction to the P value:
- Adjusted threshold for statistical significance when performing multiple t-tests for each of several dependent (outcome) variables (used to protect against type I error that may occur)
- Calculated as 0.05/ k where k is the number of comparisons being made • For example, when two groups are compared using a t-test for each of three outcome variables, the t-test is statistically significant only if the P value is less than or equal to 0.05/ 3 = 0.017.
-
Probability values (P values)
The p-value should be interpreted only as an indication of the level of uncertainty of the results observed in this study. That is, the p-value answers the specific question, “If, in general, there actually is no difference between the average activity levels of women and men, how often would one expect to obtain by chance a difference as large (or larger) than was observed in the present study?” If the p-value is very small, it is relatively unlikely that the observed difference occurred by chance. However, it is critical to realize that, because of its definition, a large p-value is not an indication that there probably is no difference in general. Therefore, it is not true that the study has shown that there is “statistically no difference” between the activity levels of women and men. Rather, a large p-value indicates a relative lack of certainty of whether the difference between the activity levels of women and men in general is much smaller or much larger than was observed in the present study. Furthermore, no matter how large the p-value, in the absence of other data (other studies), the difference observed between two randomly selected groups of subjects is the most reliable estimate of the magnitude of the actual difference between the full populations. In a study such as this, if the p-value is sufficiently small, the investigators may be relatively confident in concluding that the observed difference holds in general. In contrast, if the p-value is very large (say, 0.8), then the investigators are relatively uncertain about any conclusion - they are not highly certain that there is no difference in general. Put simply, contrary to the common misconception, observed differences are not shown to be real or false depending on whether the p-value is less than or greater than 0.05, or any other arbitrary value
- Statistical significance does not imply clinical importance. Therefore, if a study result includes a statistically significant difference, it remains essential to determine whether that difference is clinically important.
-
Minimal clinically important differences (MCIDs) is a method to describe the importance of an observed difference during a statistical test.
- MCIDs describe the smallest change in a patient-oriented outcome measure that would be perceived as being beneficial to the patient or would necessitate treatment.
- Many of the more commonly used patient-oriented outcome instruments have research-established MCID values— or a change in outcome that would change the course of a disease or its treatment.
- Expert and experienced clinicians should also consider whether observed differences are important enough to change practice.
-
Effect size (e.g., Cohen’s d) is a standardized method of expressing the magnitude of differences between study groups or in subjects before and after treatment in the unit of the SD. (Effect size = 1 means that the mean difference equals the SD.) The larger the value, the greater the effect (e.g., of treatment).
- Calculated as the mean difference (e.g., between two treatment groups or from pre- and posttreatment) divided by the SD (typically SD pooled between groups or the SD of the reference/ control group): Effect size = Mean Group 1 − Mean Group 2 / Standard deviation pooled
- Interpretation of effect size: effect sizes greater than 0.8 are “large”; those less than 0.2 are “small” (between these values can be interpreted as “medium”).
- Effect sizes are similar to percentage differences, except the denominator is the SD. Therefore datasets that are highly variable may have lower effect sizes even if the mean difference is high.
- Calculated as the mean difference (e.g., between two treatment groups or from pre- and posttreatment) divided by the SD (typically SD pooled between groups or the SD of the reference/ control group): Effect size = Mean Group 1 − Mean Group 2 / Standard deviation pooled
-
Statistical error primer
- Type I error (alpha [α] error)
- Probability that a statistical test is wrong when the null hypothesis is rejected (i.e., claiming that groups are different when they actually are not)
- It is accepted that this may occur 5 times out of 100, so the probability value threshold for statistical significance is 0.05 or 5%.
- Type II error (beta [β] error)
- Probability that a statistical test is wrong when failing to reject the null hypothesis (i.e., claiming that two groups are NOT different when they actually are)
- It is accepted that this may occur up to 20% of the time.
- The type-II or beta error can be determined if Type I error rate and sample size are known.
- Type I error (alpha [α] error)
The null hypothesis in this randomized controlled trial is that there is no difference in cement penetration during TKA with or without tourniquet use. As there was significant crossover (tourniquet use in the “no tourniquet” cohort), accepting the null hypothesis when it is false would result in beta (type 2) error.
In hypothesis testing, the assertion that the observed findings did not occur by chance alone but rather occurred because of a true association between variables is confirmed or rejected. By convention, the null hypothesis suggests that there is no significant association between variables while the alternative hypothesis suggests that there is a significant association. Alpha (type 1) error occurs when the null hypothesis is rejected when it is, in fact, true (false positive effect). Beta (type 2) error occurs when the null hypothesis is accepted when it is, in fact, false (false negative effect).
Kocher et al. reviewed power analyses, statistical errors, and the concept of statistical power. They discuss that beta represents the chance of a type II error, while alpha represents the chance of a type I error, and that conventionally beta is set at 0.2 and alpha at 0.05. The authors recommended that when a study observes no difference, the power of the study, or (1 - beta), should be reported.
Lochner et al. investigated the rates of beta error in randomized controlled trials in orthopedic trauma. They reported a 90% beta error rate in these trials, which exceeds accepted standards. The authors recommended that future authors perform pre-study power and sample-size calculations to reduce these rates.
Illustration A shows a Bayesian analysis table demonstrating the relationship between alpha, beta, and the null hypothesis.
Incorrect Answers:
Answer 1: The alpha (type 1) error occurs when the null hypothesis is rejected when it is, in fact, true (false positive effect). In other words, a difference is found by chance when there is not one. In this scenario, the null hypothesis is accepted.
Answer 3: (1 - alpha) corresponds to the probability of finding a true negative, i.e. the null hypothesis is correctly accepted.
Answer 4: (1 - beta), or power, is the probability of finding a significant association if one truly exists.
Answer 5: The kappa statistic is used to measure inter-observer and intra-observer reliability.