Lecture 8 and 9 (Correlation, Regression, CIs) Flashcards
The dependent variable is on what axis?
Y-axis
The independent variable is on what axis?
X-axis
When should you use a correlation analysis?
- examine relationship between variables
- estimate strength of association between variables
- when independent and dependent variables are not clearly different
- when regression requirements not met
A correlation coefficient of 0 means:
- there is no association between the two variables
A regression is:
- how well data fits a line
- r-value close to 0 = no correlation
- r-value closer to 1 or -1 = high correlation
- r-squared tells you the amount of variation in Y that is contributed by variation in X.
When should you use regression analysis?
- look for a trend in data between variables
- more than one X (independent) variable = multiple regression
- predict a dependent variable
- adjust for confounding variables
- curve fitting (pharmacokinetics)
- calibration and laboratory assays
- detect patterns in microarray data
Regression r-value close to 0:
no association
Regression r-value close to 1:
strong association
Regression r-squared value tells you:
- the amount of variation in Y that is contributed by variation in X.
Parametric test characteristics:
- assume variables are normally distributed with equal variances
- dependent on mean and variance
- susceptible to outliers
- requires continuous variables
Non-parametric test characteristics:
- based on ranks
- distribution, variance, mean does not matter
You can transform non-linear data to linear data by:
- taking logs
Three ways you can control for outliers:
- using non-parametric test
- dropping the outlier(s)
- log transformation
Multivariate regression:
- more than one X (independent) variable
- allows adjustment for confounders
- controls for variable interactions by multiplying variables together
Stepwise regression:
- finds the top contributing variable, then the second, then the third, etc. until a point of diminishing returns is reached.
- a.k.a finds the group of variables that has the largest collective r-squared value.
Multiple logistic regression:
- a multivariate analysis
- adjusts for confounding
- useful when outcome is dichotomous
- provides a direct estimate of the ODDS RATIO for each independent variable
When the distribution of your data is not normal, what type of test should you use?
non-parametric
If you are analyzing more than one type of independent (X) variable, what type of analysis should you use?
multivariate regression
Principal Component Analysis (PCA):
- takes many variables and reduces them by regression
- gives you groups of variables that best explain variation
Risk factors are modfiable through:
primary prevention
Prognostic factors are modifiable through:
secondary prevention
Common prognosis endpoints:
- case fatality (patients with disease who die of it)
- disease-specific mortality (people per 10,000 who are dying of specific disease)
- response
- remission
- recurrence
Equipoise:
- a genuine lack of consensus in the medical community about a treatment or prognosis, and how to treat.
- allows for RCTs
Kaplan-Meier Analysis:
- most widely used survival analysis:
- a graph of time to event
- every horizontal segment is a time period
- every vertical drop is an event (death) or a dropout
- larger the sample size, smoother the curve
Kaplan-Meier analysis truncation:
when a patient enters the study after it has already started
Kaplan-Meier analysis censoring:
when a patient drops out of a study after it has started
Can a Kaplan-Meier analysis handle covariates?
No.
- use a Cox regression for this
Cox regression:
- multivariate survival analysis
- can control for other factors
- calculates hazard ratio (same as relative risk)
Equipoise allows for:
- randomized control trials to occur
- equipoise = uncertainty in the medical community
Variance =
measure of the spread/dispersion of values around the mean.
Standard deviation =
√v; (v = variance)
- decreases as sample size increases
Standard error of the mean (SEM) =
SD/ √n
Central limit theorem posits:
- larger the sample size, the closer the study mean is to the population mean
- i.e. narrower confidence interval
Interquartile range:
- IQR contains 50% of the observations
- (from the 25th - 75th percentile)
Confidence intervals describe:
- the uncertainty that surrounds a particular observation
- larger the sample size, narrower the CI = MORE PRECISE STUDY
Equation for 95% CI:
95% CI = mean +/- 1.96(SD/ √n)
- SD = standard deviation
- n = sample size
For correlation analyses, the confidence interval cannot contain:
0
0 = no correlation
For relative risk, hazard ratios, and odds ratios, the confidence interval cannot contain:
1