Feldmand. Module 2 – Data Types & Applied Statistics Flashcards
Independent & Dependent Variables
• Independent variables
– The characteristic being observed or measured that is hypothesized to influence an event of manifestation. E.g., Risk factors
• Dependent variables
– The value of which is dependent on the effect of other variable(s). A manifestation or outcome whose variation we seek to explain or account for by the influence of independent variables. E.g., Disease outcome
Continuous vs. Discrete Data
• Continuous: Quantitative with potentially infinite number of values along continuum. Can be measured to as many decimal places as measuring instrument allows. E.g., Weight, height
• Discrete:
– Count – quantitative data that can be arranged into discrete, naturally occurring or arbitrarily selected groups or sets of values, e.g., pulse rate
– Categorical
-> Nominal–qualitative, named category; the order of the categories is irrelevant to statistical analyses e.g., gender, reproductive status
-> Ordinal–ordered categories, qualitative e.g., disease staging in cancer, education level
Descriptive vs. Inferential statistics
• Descriptive statistics
– Communicate results without attempting to generalize
– Important first step in epidemiologic studies
• Inferential statistics
– Used to infer the likelihood that the observed results can be generalized to other samples of individuals
Measures of central tendency
• Mean – The average, determined by adding all values and dividing by total number of subjects • Mode – The most common value in the data • Median – Value in dataset where 1⁄2 subjects are smaller and 1⁄2 are larger. List data in ascending order Find the median location as (n+1) / 2
Measures of Dispersion (Variation)
• Need to be able to measure the extent to which individual values differ from mean:
• Range: The difference between the highest and lowest values
• Variance: Average squared deviation of each value from the mean
Σ(Individual value – mean value)^2 / (n - 1)
Because variance is reported in squared units, take square root of the variance and report standard deviation
• Standard deviation (SD): Average measure of how individual values differ from the mean
– The smaller the SD, the less each score varies from the mean
– The larger the spread of scores, the larger the SD. SD = √ Σ(Individual value – mean value)^2
/ (n - 1)
• When reporting estimates of central tendency, report measure of dispersion, e.g., mean ± SD
Inference & Assessing the Role of Chance
- A principal assumption underlying use of measures of disease frequency is that we can make inferences to the population based on a sample
- Because of random variation from sample to sample, the observed results will probably reflect the play of chance
How can we quantify the degree to which chance variability may account for the results observed in any individual study
– By performing appropriate test of statistical significance and determining the p-value
How to determine likelihood that sampling variability (chance) explains the observed results?
Hypothesis Testing
• Performing a test of statistical significance to determine likelihood that sampling variability (chance) explains the observed results
• Make explicit statement of hypothesis to be tested:
– Null hypothesis (H0): Always the hypothesis of no difference. The assertion that there is no association between exposure and disease, e.g., RR = 1, OR = 1
– Alternative hypothesis (H1 or HA): The assertion that there is some association between exposure and disease, e.g., RR ≠ 1, OR ≠ 1
The Appropriate Test of Statistical Significance
• Will vary by study design, data type and situation
• Generates a test statistic that is a function of:
– The difference between observed values in the study and expected values if null hypothesis were true, and
– The variability in the sample
• Will lead to a probability statement (p-value)
p-value
- Probability that an effect at least as extreme as that observed in a particular study could have occurred by chance alone, given H0 is true
- The larger the test statistic, the lower the p-value
- Convention in medical research is when p ≤ 0.05, then association between exposure and disease is statistically significant; i.e. There is no more than a 5% (1 in 20) probability of observing results as extreme as that observed due solely to chance
- If p > 0.05, then chance cannot be excluded as a likely explanation
t Test
• Parametric test for differences between means of independent samples
– Continuous data
- H0: mean1 = mean2
- HA: mean1 ≠ mean2
Chi-square test
• Test whether observed differences in proportions between study groups are statistically significant
– I.e., Whether there is an association between exposure and outcome
– Categorical data
H0: proportions are equal; no association
HA: proportions are different; there is an association