Biostatistics Flashcards
Descriptive statistics
the collection, organization, summarization, and analysis of data
Inferential staitistics
drawing inferences about a body of data when only a part of the data is the observed
population
defined by a sphere of interest
sample
subgroup or subset of the population
parameter
characteristics or measure obtained from a population
statistic
characteristics or measure obtained from a sample
We compute _____ and use them to estimate _____.
We compute statistics and use them to estimate parameters.
nominal scale
The lowest measurement scale.
Used for naming or labeling, not ordering.
Though numbers can be used, the relationship between the numbers are not meaningful.
Ex: Categorical and Dichotomous variables (Marital status, DL #, SSN)
ordinal scale
observations are ranked; level of differences between ranks is unknown
Ex: Low, Medium, High; Likert-type scale
interval scale
observations are ranked; level of differences between ranks is equal; scale is relative
No true zero point, so ratios are meaningless.
Ex: Temperature (F/C) or pH scales (0 does not equal absence of heat/acidity)
ratio scale
observations are ranked; level of differences between ranks is equal;
true zero point exist
Ex: height, length, Kelvin Temperature scale (defines 0K as absolute zero)
Measures of disease frequency
count, ratio, proportion, rate
count
of cases of a disease or other health condition;
Ex: dorm students with COVID-19
proportion
measure that states a count relative to the size of the group;
numerator/denominator
Ex: dorm students with COVID-19/all student
ratio
divide one number into another number
numerator does not have be a subset of denominator
Ex: dorm students with COVID-19/dorm students with flu
rate
similar to ratios and proportions, but includes a time components
Ex: % of dorm students with COVID-19 in 2020
Descriptive Study Examples
- case studies/reports
- cross-sectional studies
- ecological studies
Analytical Study Examples
- Case-control Studies
- cohort studies
- randomized control studies
Cohort Study
begin with a group of people who are disease free at baseline
Follow over time and classify on exposure; identify incident cases
MOA: Relative risk
Good for prevalent diseases
Case-Control Study
Compare Diseased (cases) to Disease free (controls)
Classify on disease status; collect exposure data retrospectively
MOA: Odds ratio
Good for rare disease
RR or OR = 1
no association between exposure and outcome
RR or OR > 1
exposure increases risk of the outcome
Positive (direct) association
RR or OR < 1
exposure decreases risk of the outcome
Negative (inverse) association
RR range
-1 to 1
When interpreting OR, begin with the _____
outcome
When interpreting RR, begin with the _____
exposure
Attributable risk
tells us how much of the disease that occurs can be attributed to a certain exposure
calculate among exposed individuals or an entire population
background risk
the risk of non-exposed people is not zero
Ex: some people who get lung cancer do not smoke
Attributable risk formula
(incidence in exposed) - (incidence in unexposed)
simple random sample
enumerate all members of the population N
select n individuals at random (each has the same probability of being selected)
systematic sampling
- start with sampling frame
- determine sampling interval (N/n)
- select first person at random from first (N/n) and every (N/n) thereafter.
Stratified sampling
organize population into mutually exclusive strata, select individuals at random within each stratum
binomial distribution
- models # of events out of n observations
- 2 possible outcomes: success or failure
- replications of process are independent
- P(success) is constant for each replication
normal distribution
m = mean s = standard deviation
mean = median = mode and are located at the center of the distribution (not skewed)
area under curve = probability of observation
2 statistical inference methods:
- Estimation
2. Hypothesis Testing
Estimation
sample statistics are used to generate estimates of the population parameter
Hypothesis Testing
Sample statistics are analyzed to either support or reject the hypothesis about the parameter.
Are statistics from different samples in the same population the same?
No, the sample mean of the second sample is likely to be different from the first sample mean.
sampling distribution
consists of multiple sample means
point estimate
the “best” single estimate of that parameter
confidence interval
range of plausible values for the population parameter; carries a level of confidence
confidence level
reflects the likelihood that the confidence interval contains the true, unknown parameter;
90%, 95%, and 99%
If we repeatedly generate similar Confidence Intervals for the same population, 95% of those intervals will cover the true parameter.
As Confidence Level _____, Confidence Interval _____.
As Confidence Level increases, Confidence Interval widens.
standard error
reflects the variability of the sampling distribution of the sample statistic
estimated standard error formula
s/ square root of n
s = sample std. dev. n = sample size
As sample size _____ , standard error _____ .
As sample size increases, standard error decreases.
Small samples have a lot of standard error
population standard deviation can be _____ by sample standard deviation.
replaced
The midpoint of the Confidence Interval is _____.
the mean
margin of error formula
Z * s/square root of n
s = sample std. dev. n = sample size
Z reflects the critical value for _____.
confidence level
Confidence interval formula
Sample mean +/- Z * s/square root of n
null hypothesis (H0)
assumes nothing is going on, usually carries equality
alternative hypothesis (HA)
the “research hypothesis”
reflects the researcher’s belief
Hypothesis Testing: 2 Possible Conclusions
- Reject the null hypothesis
2. Fail to reject the null hypothesis
Hypothesis Testing: 2 Possible Hypotheses
- null hypothesis
2. Alternative hypothesis
Hypothesis Testing Procedures
- Set up a null and research hypothesis
- Determine significance level - acceptable rate at which a Type I error can occur.
- Select test
- Compute test statistic
- Compute p-value
- Compare p-value to alpha
- Draw conclusion + summarize significance
3 Choices for Hypothesis Statements
- Non-Directional (key word = difference); not equal
- Directional (key word = greater, more, positive direction); greater than
- Directional (key word = less, smaller, negative direction); less than
Non-Directional hypothesis testing (Two-Tailed)
H0 : μ = x
HA : μ ≠ x
Directional hypothesis testing (Right-Tailed)
H0 : μ = x
HA : μ > x
Directional hypothesis testing (Left-Tailed)
H0 : μ = x
HA : μ < x
Hypothesis Testing - Decision Making
If test statistic > critical value = reject the null
p-value
the probability of observing the obtained data (or more extreme values) given the null hypothesis was true
use to measure the significance of the test (is there enough evidence to reject H0?)
_____ the null hypothesis if the p-value is _____ than the alpha level
Reject; lower
Type I Error
(Alpha)
Reject a true null hypothesis
Most dangerous type of error
Type II Error
(Beta)
Fail to reject a false null hypothesis
alpha
probability of making a Type I error
error rate
beta
probability of making a Type II error
error rate
power
1-beta
rate at which a test correctly rejects a null hypothesis
power is dependent on _____
effect size;
larger effect size; we can detect that more readily than a small effect size
Small effect sizes may require _____ sample sizes
larger
Chi Square test of independence
determines whether 2+ categorical variables are independent or share an association
Chi Square Test Statistic formula
X^2 = the sum of (observed - expected)^2/expected
Expected value formula (Chi Square test for independence)
(column total * row total) / total
Chi Square test of independence - Degrees of freedom formula
Df = (# of rows - 1) * (# of columns - 1)
2 Independent Sample T Test
measures the difference of 2 unrelated population means of continuous outcomes
population variance is unknown
ANOVA F-Test
determines whether or not the means of more than 2 populations are statistically different
Hypothesis Testing is only for _____.
population parameters
correlation
measures the strength of the linear relationship between 2 continuous variables;
equivalent to simple linear regression
regression
estimates the value of one continuous variable corresponding to a given value of another variable
Correlation Coefficient
r;
measures the strength of the linear relationship between x & y
correlation coefficient range
-1 to +1
Correlation coefficient sign
indicates nature of relationships
positive=direct; negative=inverse
r^2
percent variation attributed to predictor variables
range from 0 (low variation explanation) to 1 (explains a lot of variation)
Want to be high ;)
Simple linear regression formula
Y = β0 +β1x + error
Y = dependent/outcome variable X = independent/predictor variable β0 = intercept β1 = slope
linear regression example
What is the expected Systolic BP for a male with BMI=20?
Y = SBP; X = BMI
scatterplot
helps to visualize relationships in bivariate data
r = 0.4. What is the percent variation?
r^2 = 0.4^2 = 0.16 x 100 = 16%
bar plot
for categorical data
histograms
for continuous and ordinal data
box (and whisker) plots
for continuous data possibly with outliers or skewed data
categorical variable
fixed # of outcomes (nominal scale)
2 possible outcomes = Dichotomous variable
ordinal variable
fixed number of outcomes with an inherent order
ordinal scale
continuous variable
outcome (interval or ratio) may be any numerical value between a defined minimum and maximum
E.g. GPA is any # between 0.0 and 4.0
Summarizing categorical or ordinal variables
- use frequencies (counts of categories)
- Use relative frequencies (percentages of categories)
- present in table format
- graph in a bar chart
Summarizing continuous variables
- central tendency: sample mean, (X bar) median (2nd Quartile), mode
- Variability: sample std dev, variance, range, or Interquartile range (3rd - 1st quartile)
sample standard deviation
(s) spread from mean in original units
variance
(s^2) spread from mean in squared units
Interquartile Range
3rd - 1st Quartiles
Variability
how spread out are values in the population?
Histograms
graphical representation of the distribution of (continuous or ordinal) data
shapes reflects distribution type, which determines which numerical summary to use
Normal distribution shape
more observations in the middle
mean=median-mode
symmetric about the mean; area to the left/right = 0.5
Positive skew
more observations in the left, tail to the right
mean > median
Negative skew
more observations to the right, tail to the left
mean < median
Graphing skewed data
use box ( and whisker) plot
shows sample minimum (Left whisker) + maximum (right whisker)
1st Quartile
(left edge of box);
2nd Quartile (middle of box = median)/;
3rd Quartile (right side of box)
Percentile
the kth percentile is a value where k% of all other values fall below:
Scored in 90 Percentile = scoring better than 90% of people who took the exam
Normal Distribution 68/95/99 Rule
- 68% of population within 1 standard deviation of mean
95% of population within 2 standard deviations of mean
99% of population within 3 standard deviations of mean
Z score formula
Z = (X - mean)/Std dev
transform any normal value into a standard value
Two Sample Z Test
- want to to know is there a difference in population means between two groups
population variance is known
Chi Square Goodness of Fit
Does the sample come from a hypothesized distribution?
for continuous data: divide data into intervals, then apply test
For continuous independent and dependent variables use _____ (measure of association).
correlation
For dichotomous independent and dependent variables use _____ (measure of association).
relative risk -or- odds ratio
relative risk (RR)
risk of getting the disease with the risk factor compared to the risk of getting the disease without the risk factor
(a/(a+b))/(c/(c+d))
odds ratio (OR)
ratio of the odds of having the disease with the risk factor compared to the odds of having the disease without the risk factor
(a/c)/(b/d) -or- ad/bc
If the value 1 is included within confidence interval, then the OR or RR is _____. Otherwise it is _____.
not significant; significant
Simple linear regression
Models the relationship between independent (X) and dependent (Y) variables;
Dependent (Y) variable must be continuous
When X increases by _____ unit, Y changes by _____.
1 unit; B1 (slope)
If B1 > 0 then X and Y are _____ proportional and variables have _____ association
directly; positive
If B1 < 0 then X and Y are _____ proportional and variables have _____ association
inversely; negative
If B1 = 0 then X and Y are _____ and variables are _____.
not related; not related
logistic regression
used when dependent (Y) variable is dichotomous
Ex: Someone has the disease or not
e^B1 = ____
odds ratio when X increases by 1 unit
multiple regression
models the relationship between dependent (Y) and independent (X) variables while also considering other variables that may affect the relationship (e.g. confounders)
more than 1 independent (X) variable
survival analysis
collection of statistical procedures used for outcome that is time until an event
From the time we start to observe, when does the event occur?
goal: analyze survival experience of a population of interest
Survival analysis - time
measure of time from the beginning of follow-up until the event for an individual
e.g. days, weeks, months, years
Survival analysis - event
occurrence of interest
e.g. death, disease incidence, relapse, recovery
survival analysis - censoring + 3 reasons
exact survival time is unknown
three reasons
- study ends before an individual experiences event
- individual is lost to follow-up during the study
- individual is withdrawn from the study (e.g. death before event of interest occurs).
3 types of censoring
- right censored
- left censored
- interval censored
right censored data
we know when survival time starts, but not when or if event occurs
left censored data
start of survival period is unknown
E.g. survival time of HIV patient begins at infection, but may not enter study until tested positive
interval censored data
the exact time of the even is unknown within the interval
occurs in studies where subjects are not monitored continuously
survival function/curve
in theory, are continuous and smooth
Common application is to compare survival functions of two groups
Kaplan Meier estimator
method used to practically visualize survival curves for a study
estimated as a step function
1 step down = 1 event occurred
does not usually decrease to 0, not everyone will experience event during the study
log rank test
if test rejects, the survival curves are significantly different;
works for 2+ groups
does not tell you which is better (visually compare or compare means)
reliability
- Consistency of measures
- Are similar results produced under similar conditions
- Uses Cronbach’s alpha
- high reliability does not mean high validity (accuracy)
Cronbach’s alpha
an indicator of internal consistency
ranges from 0 to 1
higher values = higher internal consistency
Validity
- Accuracy of a measure
- Does the result actually reflect the true measure
- Often difficult to know if a measure is valid
confounding
extraneous variable that distorts the true effect of the independent variable (exposure) on the dependent variable (outcome)
Ways to control confounding
- Stratification (single confounder)
2. Regression (multiple confounders)
Stratification
conduct separate analysis for each level of a confounding variable
Effect Modification
the effect of an independent variable (X) on the dependent variable (Y) differs depending on the level of the third variable
Poisson distribution
models # of events out of infinite (in theory) observation
not practical
use when the event is rare or when modeling # of events over space of time
Increasing sample size _____ variability of the estimate.
decreases