L11 to L13: Correlation and Regression Flashcards
Define Correlation
ASSUMING that relationship is linear, it QUANTIFIES degree to which 2 random variables are related
What is correlation coefficient (R)?
QUANTITATIVE measure of the STRENGTH and DIRECTION of a linear relationship between two variables
Two types of correlation analysis, and when to use them?
- Pearson product-moment correlation (PPMC): Parametric test used when variables are continuous
- Spearman rank Correlation (SRC): NPT, when ≥1 are non-normal distribution, OR ordinal data
Assumptions of correlation analyses
- x and y independent
- Pairs of observations (x,y) are randomly selected
- PPMC: underlying ppn of both variables are normally variables
Hyp of correlation analysis
H0: r = 0 (no correlation)
H1: r ≠0 (have correlation) or r>0 or r<0
Advantage of SRC over PPMC
Decreased sensitivty to outliers since ranks are used (similar to other NPT)
When you receive data and want to check for correlation, the VERY FIRST STEP you should do
Construct scatter plot and roughly scan for linear relationship
This is to check whether assumption that variables have linear relationship, before quantifying their linearity
Distinguish between correlation and simple linear regression (SLiR)
- Correlation: Find out how linear x and y are, provided their relation is already linear from scatter plot. No defined independent/dependent variable is defined yet
- SLiR: Provided that correlation is SIGNIFICANT, give BEST-FIT LINE for DEFINED x and y (defined independent/dependent variable)
Purpose of SLiR
Estimate y for defined x using equation obtained from best fit line
One disadvantage of SLiR
Not suitable for extrapolation. Equation only applies WITHIN data range
The equation for SLiR and what do each symbol mean
y = a + Bx
y: Dependent variable
x: Independent variable
a: y-intercept
B: Slope. i.e. change in MEAN of y that correspond to one unit change in x
Assumptions of SLiR
- Assume variables have linear relationship
- Observations are independent
- For any values of x, y is NORMALLY distributed
- Fo any x, variances are equal (similar to other tests)
How does SLiR get its line of best fit?
Method of least squares
Hypotheses of SLiR and tail?
H0: No effect by x on y (B = 0)
H1: B ≠ 0
- ALWAYS two-tailed
Given B = 1.657, alpha = 23.811, x is Weight, and y is systolic blood pressure (SBP), p = 0.001, construct the regression equation and formulate a conclusion.
y = 23.811 + 1.657(BW)
Conclusion:
- For every 1kg increase in BW, the MEAN SBP increases by 1.657 mmHg.
- At a sig level of 0.05, there is a statsig effect of BW on SBP (p = 0.001)
(rmb both word explanation of equation and sig level)
(rmb units)
What is R2? What does it mean if:
R2 = 1
R2 = 0
Proportion of variability among observed y values that is explained by linear regression of x & y
R2 = 1: All pts lie on line R2 = 0: No pts lie on line
When data is obtained (for one dependent and one independent variable), what is the proper step to get linear equation if you suspect linear relationship?
- Construct scatter plot and scan for linear relationship
- If linear: Use Corr. analysis to check whether linearity is statsig
- If statsig, proceed to use SLiR to obtain equation and R2
Distinguish between SLiR and multiple linear regression (MLiR)
MLiR: extension of SLiR describing relationship between dep var. and MORE THAN ONE INDEP VAR.
Assumptions of MLiR
- Observations are independent
- For any x, distribution of y is normal
- For any SET of values x, variance is constant
- There is LITTLE OR NO MULTICOLLINEARITY among all indep var.
What does Bi represent in MLiR
Change in mean value of y that corresponds to one-unit change in xi, AFTER controlling for all other indep var. (i.e. keeping all other values constant)
Distinguish the purpose between adjusted R2 and R2
- Adjusted R2: Used to compare between models that has different number of indep variables as it compensates for complexity
E.g. MLiR vs SLiR regression - Normal R2: the definition
Purpose of dummy variables
Using NUMBERS to identify categories of nominal variables (coz MLiR can only take numbers)
Given:
- Data collected: BMI at f/u, Baseline BMI
- Interventions: two different dosage of drugs (1 and 2 dummy-coded)
- B1 = -2.064, p = 0.06
- B2 = -1.941, p = 0.005
- B3 = 0.984, p = 0.0442
- a = 0.428
State the MLiR equation and also explain what do each variable mean
y = 0.428 - 2.064 (Dose1) - 1.941 (Dose 2) + 0.984 (Baseline BMI)
- B1: The Mean BMI@f/u btwn ctrl and dose 1 grp is 2.064 kg/m2 smaller than that of ctrl AFTER controlling for BASELINE BMI
- B2: The Mean BMI@f/u btwn ctrl and dose 1 grp is 1.941 kg/m2 smaller than that of ctrl AFTER controlling for BASELINE BMI
- B3: For every 1 kg/m2 increase in Baseline BMI, mean BMI@f/u increase by 0.984 kg/m2, after controlling for tx grps (no make sense, hence not impt)
At sig. level of 0.05, there is statsig assoc btwn tx and BMI@f/u AFTER ctrlling for basline BMI (as long as one p <0.05 of all the beta)
Recommended max number of indep var. to analyse for MLiR
n/10, where n is sample size
General meaning of B in MLiR, with a control group involved
The mean change/difference in y btwn control and x, AFTER controlling for baseline characteristics
Three types of model in MLiR
- Enter: All indep var. entered into equation, good for small set of predictors
- Fwd selection: Begin with empty equation and add one at a time beginning with highest corr. first. Once in, variable remains
- Backward elimination: reverse of fwd, PREFERRED over fwd selection. (var. removed if they don’t contribute to regression equation)
What kind of data do Logistic regression (LoR) analyse?
- Dependent: dichotomous nominal variable
- Independent: ≥1 continuous/ordinal/normal variable
General Equation for logistic regression
loge(O) = a + Bx
O = Outcome x = exposure
What is odds ratio? Mathematically, what is it represented by?
Measure the STRENGTH of association between E and O
Represented by e^B (obtained from LoR equation
General expression of OR in words, given that OR = 1.1
Those who are E (exposed) have 1.1 times the odds (or 10% more likely) of developing O compared with those who are uE (unexposed)
What does OR equal to when there is no assoc between E and O
OR = 1
General Equation to calculate OR. How to calculate from 2x2 table?
OR = Odds that case was exposed/ Odds that ctrl was exposed
From 2x2 table, take quotient of cross products
i.e. ad/bc
Assumptions for LoR
both SLoR and MLoR
- Dependent variable should be dichotomous
- Observations are independent
- There is linear relationship between independent variable and loge(O)
- MLoR: Litte or no multicollinearity among independent variables
State the Hypotheses for SLoR and MLoR
SLoR:
- H0: OR = 1
- H1: OR ≠ 1
MLoR
- H0: ORi = 1, after controlling for all other variables
- H1: ORi ≠ 1, after controlling for all other variables
Given MLoR was carried out:
- Exposed to drug: p = 0.031, Exp(B) = 2.192, 95%CI = 1.053-4.567
- Gender: p = 0.240, Exp(B) = 0.665, 95%CI = 0.338-1.312
- Outcome of interest: Side effect
State whether the OR for E is adjusted or crude. Formulate a conclusion
- OR is ADJUSTED odds ratio (since there is another variable)
- Conclusion: Subjects who were E to drug had 2.19 times (95% CI: 1.05 - 4.57) the odds of developing the SE compared to those who were not exposed, AFTER controlling for gender (p = 0.031)
In what kind of studies is OR most likely used?
Case-control studies (CCS), and maybe cross-sectional studies (XSS)