Analysis of epidemiological data Flashcards

Question 1

Q

Data types

Answer

A

Qualitative (categorical)
a. Binary (equivalent, 2 categories)
b. Nominal (equivalent)
b. Ordinal (ranked)
Quantitative
a. Continuous (measurements)
i. Interval e.g. date (zero doesn’t have meaning)
ii. Ratio e.g. age
b. Discrete (counts)

Question 2

Q

Statistical inference

Answer

A

Process of drawing conclusions about an entire population based on the information in a sample.

Precision based methods (confidence intervals)
Hypothesis testing methods

Question 3

Q

Confidence interval - definition

Answer

A

In statistical inference, one wishes to estimate population parameters using observed sample data. A confidence interval gives an estimated range of values which is likely (within a certain degree of confidence) to include the unknown population parameter.

The endpoints of the interval take values that depend on the random sample selected
If we select another random sample its mean and standard deviation would be different so we would obtain a different confidence interval
If one calculates 100 confidence intervals based on 100 random samples on the average 95 of them would contain the true value of the mean and 5 of them wouldn’t
The wider the confidence interval, the less precise the estimate

Question 4

Q

Confidence interval - calculation

Answer

A

value +/- 1.96*SE

where value is a mean, proportion or ratio and SE is calculated as:

Proportion: sqrt [p(1-p)/n]
Odds ratio: sqrt [1/a+1/b+1/c+1/d]

Assumptions:

Sample is drawn randomly
Observations within a sample are independent
Sampled population is normally distributed

Question 5

Q

Hypothesis testing - definition, method (4)

Answer

A

Hypothesis testing is a method for testing a claim or hypothesis about a parameter in a population, using data measured in a sample.

State the null and alternative hypotheses (e.g. population mean is equal to some value, population mean is not equal to some value)
Decide on the significance level (alpha)
Draw random sample from population and calculate test statistic
Make a decision as to whether to reject or not reject the null hypothesis based on the test statistic

Question 6

Q

Null hypothesis

Answer

A

The null hypothesis (H0) is a statement about a population parameter, such as the population mean, that is assumed to be true. i.e. burden is on research is to show that there is ample evidence to reject the null hypothesis. If null hypothesis is true, then sample parameter will equal the population parameter on average. Based on outcome of hypothesis testing we reject or don’t reject null (never accept the null since it can’t be proven).

Question 7

Q

Alternative hypothesis

Answer

A

An alternative hypothesis (H1) is a statement that directly contradicts a null hypothesis by stating that that the actual value of a population parameter is less than, greater than, or not equal to the value stated in the null hypothesis.

Question 8

Q

p-value

Answer

A

Probability of obtaining a sample outcome (value as extreme as or more extreme than the observed sample parameter), given the null hypothesis is true.

Compared to (pre-defined) significance level, alpha, to decide whether the null hypothesis should be rejected or not rejected.

Question 9

Q

Type I error

Answer

A

Probability of erroneously rejecting the null hypothesis when it is true. Example: RCT testing new treatment vs standard treatment. We declare the new treatment is more effective, when in fact it is not (challenges status quo with potential for harm if product is introduced - patients wouldn’t receive proven treatment)

Question 10

Q

Type II error

Answer

A

Probability of not rejecting the null hypothesis when we should. Example: RCT testing new treatment vs standard treatment. We declare the new treatment is not more effective, when in fact it is (missed opportunity, but generally more acceptable since it maintains the status quo and patients still receive proven treatment)

Question 11

Q

Statistical power - definition, how to increase

Answer

A

Probability of rejecting the null hypothesis when we should. e.g. clinical trial - probability of detecting a difference in outcome between animals receiveing and not recieving a new treatment given one exists.

Can increase power by:

decrease beta (type II error)
increasing alpha (e.g. 0.1 instead of typical 0.05) - trade off
increasing sample size (decreases standard error)
increase effect size willing to detect

Question 12

Q

Descriptive statistics - summarizing data by data type (3)

Answer

A

Data can be described according to measures of position (central tendancy) and spread

Binary/nominal: proportion
Ordinal: median, IQR (box and whisker plot)
Continuous: mean, SD (barchart with error bars)

Question 13

Q

Statistical tests - compare 2 groups (unpaired data)

Answer

A

Continuous data: unpaired T-test

Ordinal data/non-normal data: Mann-Whitney U (non-parametric equivalent)

Nominal (binomial) data: Fisher’s test (chi-square for large samples)

Question 14

Q

Statistical tests - compare 2 groups (paired)

Answer

A

Continuous data: paired T-test

Ordinal data/non-normal data: Wilcoxon test

Nominal (binomial) data: McNemar’s test

Question 15

Q

Statistical tests - compare 3 or more groups (independent/unpaired)

Answer

A

Continuous data: One-way ANOVA

Ordinal data/non-normal data: Kruskal-Wallis test

Nominal (binomial) data: Chi-square test

Question 16

Q

Linear regression - uses, assumptions (4), output

Answer

Study These Flashcards

A

Prediction of Y (continuous variable) based on one or more variables (X1, X2 etc). Can also assess/adjust for confounding by incorporating potential confounding variables. Can also assess for effect modification by incorporating an interaction term.

Assumptions:

Observations (Y) are independent
Linear relationship between outcome and predictor
Residuals are normally distributed
Homoscedacity (variance of Y is same across all levels of predictor variables)

Output:

Overall significance of model assessed using F-test (ANOVA - null hypothesis is that all regression coefficients are zero; if significant, X variables explain some of the variation in Y)
Interpretation of regression coefficient: increase/decrease in Y for each unit increase in X; significance tested with t-test
Coefficient of determination, R²: amount of variance in Y explained by the predictor variables

Question 17

Q

Logistic regression - uses, assumptions (2), output

Answer

Study These Flashcards

A

Prediction of Y (dichotomous variable) based on one or more variables (X1, X2 etc). Can also assess/adjust for confounding by incorporating potential confounding variables. Can also assess for effect modification by incorporating an interaction term.

Assumptions:

Observations (Y) are independent
Linear relationship between outcome and predictor (logit of probability of outcome is modeled as a linear function of X)

Output:

Overall significance of model assessed using likelihood ratio test (compares full model with intercept-only model; if significant, then predictors contribute significantly to prediction of outcome)
Interpretation of regression coefficient (significance tested with Wald test):
- Dichotamous predictor: log odds increase/decrease in Y when X is present; converted to odds ratio by exponentiating the coefficient (i.e. raising e to the power of the coefficient
- Continuous predictor: log odds increase/decrease in Y for each unit increase in X; calculate OR using specific example e.g. change in X from 75 to 25 = (75-25)*B1 = some value, OR then calculated as e raised to the power of that value
Pseudo R²: amount of variance in Y explained by the predictor variables

Question 18

Q

Survival analysis - uses, data characteristics (3), summarizing data (2), approaches to analysis(3)

Answer

Study These Flashcards

A

Time-to-event data: subjects followed until they experience event (“failure”) or are lost to follow up (right censored). Dependent variable is duration.

Data characteristics:

Strict left truncation (non-zero values)
Often highly right skewed
Often observations are censored (LTF)

Summarizing time-to-event data:

Survival function: cumulative proportion of individuals that have not experienced the event over a given time at risk (usually summarized as the median survival time at which 50% of individuals at risk have experienced the event)
Hazard function: instantaneous probability that event will happen at time, t, given that the individual is still at risk.

Analysis approaches:

Non-parametric: Compare survival times between groups (e.g. Kaplan-Meier estimate of survival function [graph] with log-rank test to test if survival functions are equal across groups)
Semi-parametric: Predicting survival time as a function of X₁-X_n (Cox proportional hazard regression)
Parametric: e.g. exponential model

Question 19

Q

Paired data

Answer

Study These Flashcards

A

Two measurements are paired when they come from the same observational unit: before and after, twins, husbands and wives, brothers and sisters, matched cases and controls. Pairing is determined by a study’s design.

Question 20

Q

Clustered/hierarchical data

Answer

Study These Flashcards

A

Clustering arises when observations (outcome and predictor variables) share common features as a result of the data structure, e.g. common environment, spatial proximity, repeated measurements on same individual. Such data violates assumption of independence inherent in common statistical methods.

Examples: puppies within a litter, animals in same herd, surveillance data from districts

Management:

Include group identifier as dummy variable in traditional regression model (~fixed effect) > inferences are then made about actual herds, not more general population; requires fitting many parameters if many herds
Estimate intra-class correlation coefficient or estimate of overdispersion and use this to adjust the SE of the regression coefficients
Use mixed models

Question 21

Q

Data quality considerations - invalid values, missing data

Answer

Study These Flashcards

A

Invalid values:

Verify against original records if possible
Data entry - double entry to avoid transcription errors

Missing values:

Exclude (i.e. only analyze complete records) - only if missing at random
Predict missing values based on patterns in complete records
Assign weights to missing data

Analysis of epidemiological data Flashcards

(21 cards)