Analysis of epidemiological data Flashcards

1
Q

Data types

A
  1. Qualitative (categorical)
    a. Binary (equivalent, 2 categories)
    b. Nominal (equivalent)
    b. Ordinal (ranked)
  2. Quantitative
    a. Continuous (measurements)
    i. Interval e.g. date (zero doesn’t have meaning)
    ii. Ratio e.g. age
    b. Discrete (counts)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Statistical inference

A

Process of drawing conclusions about an entire population based on the information in a sample.

  1. Precision based methods (confidence intervals)
  2. Hypothesis testing methods
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Confidence interval - definition

A

In statistical inference, one wishes to estimate population parameters using observed sample data. A confidence interval gives an estimated range of values which is likely (within a certain degree of confidence) to include the unknown population parameter.

  • The endpoints of the interval take values that depend on the random sample selected
  • If we select another random sample its mean and standard deviation would be different so we would obtain a different confidence interval
  • If one calculates 100 confidence intervals based on 100 random samples on the average 95 of them would contain the true value of the mean and 5 of them wouldn’t
  • The wider the confidence interval, the less precise the estimate
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Confidence interval - calculation

A

value +/- 1.96*SE

where value is a mean, proportion or ratio and SE is calculated as:

  • Proportion: sqrt [p(1-p)/n]
  • Odds ratio: sqrt [1/a+1/b+1/c+1/d]

Assumptions:

  • Sample is drawn randomly
  • Observations within a sample are independent
  • Sampled population is normally distributed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Hypothesis testing - definition, method (4)

A

Hypothesis testing is a method for testing a claim or hypothesis about a parameter in a population, using data measured in a sample.

  1. State the null and alternative hypotheses (e.g. population mean is equal to some value, population mean is not equal to some value)
  2. Decide on the significance level (alpha)
  3. Draw random sample from population and calculate test statistic
  4. Make a decision as to whether to reject or not reject the null hypothesis based on the test statistic
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Null hypothesis

A

The null hypothesis (H0) is a statement about a population parameter, such as the population mean, that is assumed to be true. i.e. burden is on research is to show that there is ample evidence to reject the null hypothesis. If null hypothesis is true, then sample parameter will equal the population parameter on average. Based on outcome of hypothesis testing we reject or don’t reject null (never accept the null since it can’t be proven).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Alternative hypothesis

A

An alternative hypothesis (H1) is a statement that directly contradicts a null hypothesis by stating that that the actual value of a population parameter is less than, greater than, or not equal to the value stated in the null hypothesis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

p-value

A

Probability of obtaining a sample outcome (value as extreme as or more extreme than the observed sample parameter), given the null hypothesis is true.

Compared to (pre-defined) significance level, alpha, to decide whether the null hypothesis should be rejected or not rejected.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Type I error

A

Probability of erroneously rejecting the null hypothesis when it is true. Example: RCT testing new treatment vs standard treatment. We declare the new treatment is more effective, when in fact it is not (challenges status quo with potential for harm if product is introduced - patients wouldn’t receive proven treatment)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Type II error

A

Probability of not rejecting the null hypothesis when we should. Example: RCT testing new treatment vs standard treatment. We declare the new treatment is not more effective, when in fact it is (missed opportunity, but generally more acceptable since it maintains the status quo and patients still receive proven treatment)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Statistical power - definition, how to increase

A

Probability of rejecting the null hypothesis when we should. e.g. clinical trial - probability of detecting a difference in outcome between animals receiveing and not recieving a new treatment given one exists.

Can increase power by:

  1. decrease beta (type II error)
  2. increasing alpha (e.g. 0.1 instead of typical 0.05) - trade off
  3. increasing sample size (decreases standard error)
  4. increase effect size willing to detect
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Descriptive statistics - summarizing data by data type (3)

A

Data can be described according to measures of position (central tendancy) and spread

  • Binary/nominal: proportion
  • Ordinal: median, IQR (box and whisker plot)
  • Continuous: mean, SD (barchart with error bars)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Statistical tests - compare 2 groups (unpaired data)

A

Continuous data: unpaired T-test

Ordinal data/non-normal data: Mann-Whitney U (non-parametric equivalent)

Nominal (binomial) data: Fisher’s test (chi-square for large samples)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Statistical tests - compare 2 groups (paired)

A

Continuous data: paired T-test

Ordinal data/non-normal data: Wilcoxon test

Nominal (binomial) data: McNemar’s test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Statistical tests - compare 3 or more groups (independent/unpaired)

A

Continuous data: One-way ANOVA

Ordinal data/non-normal data: Kruskal-Wallis test

Nominal (binomial) data: Chi-square test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Linear regression - uses, assumptions (4), output

A

Prediction of Y (continuous variable) based on one or more variables (X1, X2 etc). Can also assess/adjust for confounding by incorporating potential confounding variables. Can also assess for effect modification by incorporating an interaction term.

Assumptions:

  1. Observations (Y) are independent
  2. Linear relationship between outcome and predictor
  3. Residuals are normally distributed
  4. Homoscedacity (variance of Y is same across all levels of predictor variables)

Output:

  • Overall significance of model assessed using F-test (ANOVA - null hypothesis is that all regression coefficients are zero; if significant, X variables explain some of the variation in Y)
  • Interpretation of regression coefficient: increase/decrease in Y for each unit increase in X; significance tested with t-test
  • Coefficient of determination, R2: amount of variance in Y explained by the predictor variables
17
Q

Logistic regression - uses, assumptions (2), output

A

Prediction of Y (dichotomous variable) based on one or more variables (X1, X2 etc). Can also assess/adjust for confounding by incorporating potential confounding variables. Can also assess for effect modification by incorporating an interaction term.

Assumptions:

  1. Observations (Y) are independent
  2. Linear relationship between outcome and predictor (logit of probability of outcome is modeled as a linear function of X)

Output:

  • Overall significance of model assessed using likelihood ratio test (compares full model with intercept-only model; if significant, then predictors contribute significantly to prediction of outcome)
  • Interpretation of regression coefficient (significance tested with Wald test):
    • Dichotamous predictor: log odds increase/decrease in Y when X is present; converted to odds ratio by exponentiating the coefficient (i.e. raising e to the power of the coefficient
    • Continuous predictor: log odds increase/decrease in Y for each unit increase in X; calculate OR using specific example e.g. change in X from 75 to 25 = (75-25)*B1 = some value, OR then calculated as e raised to the power of that value
  • Pseudo R2: amount of variance in Y explained by the predictor variables
18
Q

Survival analysis - uses, data characteristics (3), summarizing data (2), approaches to analysis(3)

A

Time-to-event data: subjects followed until they experience event (“failure”) or are lost to follow up (right censored). Dependent variable is duration.

Data characteristics:

  1. Strict left truncation (non-zero values)
  2. Often highly right skewed
  3. Often observations are censored (LTF)

Summarizing time-to-event data:

  1. Survival function: cumulative proportion of individuals that have not experienced the event over a given time at risk (usually summarized as the median survival time at which 50% of individuals at risk have experienced the event)
  2. Hazard function: instantaneous probability that event will happen at time, t, given that the individual is still at risk.

Analysis approaches:

  1. Non-parametric: Compare survival times between groups (e.g. Kaplan-Meier estimate of survival function [graph] with log-rank test to test if survival functions are equal across groups)
  2. Semi-parametric: Predicting survival time as a function of X1-Xn (Cox proportional hazard regression)
  3. Parametric: e.g. exponential model
19
Q

Paired data

A

Two measurements are paired when they come from the same observational unit: before and after, twins, husbands and wives, brothers and sisters, matched cases and controls. Pairing is determined by a study’s design.

20
Q

Clustered/hierarchical data

A

Clustering arises when observations (outcome and predictor variables) share common features as a result of the data structure, e.g. common environment, spatial proximity, repeated measurements on same individual. Such data violates assumption of independence inherent in common statistical methods.

Examples: puppies within a litter, animals in same herd, surveillance data from districts

Management:

  1. Include group identifier as dummy variable in traditional regression model (~fixed effect) > inferences are then made about actual herds, not more general population; requires fitting many parameters if many herds
  2. Estimate intra-class correlation coefficient or estimate of overdispersion and use this to adjust the SE of the regression coefficients
  3. Use mixed models
21
Q

Data quality considerations - invalid values, missing data

A

Invalid values:

  1. Verify against original records if possible
  2. Data entry - double entry to avoid transcription errors

Missing values:

  1. Exclude (i.e. only analyze complete records) - only if missing at random
  2. Predict missing values based on patterns in complete records
  3. Assign weights to missing data