Stats and epi definitions Flashcards
statistical heterogeneity
Statistical heterogeneity manifests itself in the observed intervention effects being more different from each other than one would expect due to random error (chance) alone.
clinical heterogenity
differences in the study population characteristics, type of intervention
methodological heterogeneity
differences in study design - blinding, sources of bias, the way the outcomes are defined and measured
conditional probability
probability for A occurring given B has already occurred
P(A|B) = P(A and B) / P(B)
Bayes theorem is based on conditional proability
bayes theorem
It answers the question: “Given some new information, how should I update what I already believe?”. e.g. in diagnostic tests, it brings in other information about a patient or the prevalence of disease to the probability of a diagnostic test being correct.
P(A|B) = posterior probability
P(B|A) = likelihood - probability of seeing B if A true
P(A) = prior probability
P(B) = total probability of B
A = having disease
B = positive test result
posterior probability = probability of having disease given the result of the test (B)
Frequency polygon =
line added to a histogram to join the centre of each bar to show to shape of the distribution
probability multiplication rule
P(A and B) = P(A) * P(B) (for independent events)
Used when you want to know the probability of both events occurring simultaneously.
Dependent events:
When events are not independent, the multiplication rule becomes more complex, requiring conditional probability calculations: P(A and B) = P(A) * P(B|A)
probability addition rule
P(A or B) = P(A) + P(B) - P(A and B)
for mutually exclusive events = P(A) + P(B)
Used when you want to know the probability of either one of two events happening.
Poisson distribution
= a probability distribution tells how many times an event is likely to occur over a specified period. It is a count distribution,
the parameter of which is lambda (λ); the mean number of events in the specific interval. (discrete quantitative data – incident rates) e.g. number of radioactive emissions detected by a Geiger counter in 5 minutes.
mean = variance
Binomial distribution
probability distribution for data with two outcomes - success or failure
summarizes the probability that a value will take one of two independent values under a given set of parameters or assumptions.
The underlying assumptions of binomial distribution are that there is only one outcome for each trial, each trial has the same probability of success, and each trial is mutually exclusive or independent of one another.
Defined by n (sample size) and π (true probability or proportion)
normal distribution
probability distribution for continuous data
The normal distribution describes a symmetrical plot of data around its mean value, where the width of the curve is defined by the standard deviation
95% of values are within 1.96 SDs of the mean.
For all other distributions, these approximate towards the normal distribution as sample size increases.
Standard normal distribution
has a mean of 0 and SD of 1
used to convert another data set to the standard normal to get a z score - shows how many SDs the result is from the mean
central limit theorem
sampling distributions (of any statistic) approximate towards the normal distribution as sample size increases.
p value
probability of getting that result if the null hypothesis were true
Sample size calculations - what are they for and what do you need
Sample size calculations = ensure the study has sufficient number of participants to answer the study question i.e. detect an association if one truly exists. Depends on:
- the null and alternative hypotheses.
- The type of outcome variable (e.g. difference in mean, risk ratio)
- Effect size for clinically significant result (smaller needs larger sample)
- The variability in the outcome data – mean, SD, prevalence (from local data)
- Significance level
- Power
- Population proportion / prevalence of outcome (cohort studies) or exposure (case control) – smaller prevalence needs larger sample size
- Also consider dropout rates, design (clustered, multiple arms), ethics, budget
regression
a method which allows you to model the relationship between a dependent variable (target) and one or more independent variables (predictors).
It helps in predicting outcomes, identifying trends, and understanding the strength and nature of relationships between variables. It can be used to assess if there is an association between variables and to predict the value of one variable based on the value of another within the dataset
Linear regression and assumptions
models relationship between a continuous dependent variable and one or more (multiple linear regression) independent variables using a linear equation
additive scale
assumptions :
Linear relationship between dependent and independent variables
the residuals (the differences between observed and predicted values) are normally distributed
No Multicollinearity: It is essential that the independent variables are not too highly correlated with each other, a condition known as multicollinearity.
logistic regression and assumption
models the probability of a binary outcome (dependent variable - binary data) based on predictors (independent variables - can be any type)
log scale - output is log of odds
assumptions:
Independent observations (the observations should not come from repeated measurements or matched data).
no multicollinearity among the independent variables. Meaning, that the independent variables should not be too highly correlated with each other.
linearity of independent variables and log odds of the dependent variable.
Poisson regression and assumptions
models outcome which is count data (rates)
output on a log scale - rate ratio
assumes:
- poisson distribution - Variance cannot be greater than the mean
cox regression
models the relationship between outcome which is time to event data and dependent variables
output is on a log scale - hazards ratio
assumptions:
proportional hazards
censored data do not differ systematically / non informative censoring
independent observations
Cluster analysis - why, approaches, pros and cons
why: feasability - some interventions are implemented at the group level (media campaigns, policy, group counselling or education)
Some interventions require structural change in the delivery of care such that it is
not possible to randomise individuals to receive different types of care.
reduces risk of contamination between groups
cons:
harder to interpret results - needs additional skill to design, implement and anlayse
requires larger sample size - more expensive
may be more complex to generalise
analysis of clustered data
account for in regression models
still analyse all indiviudals but need to account for ICC
or can do aggregate analysis using clusters as experimental unit
sample size for cluster trials
Calculate intra cluster correlation coefficient - quantifies the homogeneity within clusters and informs how much you need to inflate sample size
the number you need to increase by is the design effect
generally 30% larger sample size
explanations for study findings
- True association
- Chance finding (eg small numbers, sampling error)
- Confounding. For example social deprivation associated with higher crime rates
- Bias. Information bias due to inconsistent recording of results. Selection bias