L11 to L13: Correlation and Regression Flashcards
Define Correlation
ASSUMING that relationship is linear, it QUANTIFIES degree to which 2 random variables are related
What is correlation coefficient (R)?
QUANTITATIVE measure of the STRENGTH and DIRECTION of a linear relationship between two variables
Two types of correlation analysis, and when to use them?
- Pearson product-moment correlation (PPMC): Parametric test used when variables are continuous
- Spearman rank Correlation (SRC): NPT, when ≥1 are non-normal distribution, OR ordinal data
Assumptions of correlation analyses
- x and y independent
- Pairs of observations (x,y) are randomly selected
- PPMC: underlying ppn of both variables are normally variables
Hyp of correlation analysis
H0: r = 0 (no correlation)
H1: r ≠0 (have correlation) or r>0 or r<0
Advantage of SRC over PPMC
Decreased sensitivty to outliers since ranks are used (similar to other NPT)
When you receive data and want to check for correlation, the VERY FIRST STEP you should do
Construct scatter plot and roughly scan for linear relationship
This is to check whether assumption that variables have linear relationship, before quantifying their linearity
Distinguish between correlation and simple linear regression (SLiR)
- Correlation: Find out how linear x and y are, provided their relation is already linear from scatter plot. No defined independent/dependent variable is defined yet
- SLiR: Provided that correlation is SIGNIFICANT, give BEST-FIT LINE for DEFINED x and y (defined independent/dependent variable)
Purpose of SLiR
Estimate y for defined x using equation obtained from best fit line
One disadvantage of SLiR
Not suitable for extrapolation. Equation only applies WITHIN data range
The equation for SLiR and what do each symbol mean
y = a + Bx
y: Dependent variable
x: Independent variable
a: y-intercept
B: Slope. i.e. change in MEAN of y that correspond to one unit change in x
Assumptions of SLiR
- Assume variables have linear relationship
- Observations are independent
- For any values of x, y is NORMALLY distributed
- Fo any x, variances are equal (similar to other tests)
How does SLiR get its line of best fit?
Method of least squares
Hypotheses of SLiR and tail?
H0: No effect by x on y (B = 0)
H1: B ≠ 0
- ALWAYS two-tailed
Given B = 1.657, alpha = 23.811, x is Weight, and y is systolic blood pressure (SBP), p = 0.001, construct the regression equation and formulate a conclusion.
y = 23.811 + 1.657(BW)
Conclusion:
- For every 1kg increase in BW, the MEAN SBP increases by 1.657 mmHg.
- At a sig level of 0.05, there is a statsig effect of BW on SBP (p = 0.001)
(rmb both word explanation of equation and sig level)
(rmb units)
What is R2? What does it mean if:
R2 = 1
R2 = 0
Proportion of variability among observed y values that is explained by linear regression of x & y
R2 = 1: All pts lie on line R2 = 0: No pts lie on line
When data is obtained (for one dependent and one independent variable), what is the proper step to get linear equation if you suspect linear relationship?
- Construct scatter plot and scan for linear relationship
- If linear: Use Corr. analysis to check whether linearity is statsig
- If statsig, proceed to use SLiR to obtain equation and R2
Distinguish between SLiR and multiple linear regression (MLiR)
MLiR: extension of SLiR describing relationship between dep var. and MORE THAN ONE INDEP VAR.
Assumptions of MLiR
- Observations are independent
- For any x, distribution of y is normal
- For any SET of values x, variance is constant
- There is LITTLE OR NO MULTICOLLINEARITY among all indep var.
What does Bi represent in MLiR
Change in mean value of y that corresponds to one-unit change in xi, AFTER controlling for all other indep var. (i.e. keeping all other values constant)
Distinguish the purpose between adjusted R2 and R2
- Adjusted R2: Used to compare between models that has different number of indep variables as it compensates for complexity
E.g. MLiR vs SLiR regression - Normal R2: the definition
Purpose of dummy variables
Using NUMBERS to identify categories of nominal variables (coz MLiR can only take numbers)
Given:
- Data collected: BMI at f/u, Baseline BMI
- Interventions: two different dosage of drugs (1 and 2 dummy-coded)
- B1 = -2.064, p = 0.06
- B2 = -1.941, p = 0.005
- B3 = 0.984, p = 0.0442
- a = 0.428
State the MLiR equation and also explain what do each variable mean
y = 0.428 - 2.064 (Dose1) - 1.941 (Dose 2) + 0.984 (Baseline BMI)
- B1: The Mean BMI@f/u btwn ctrl and dose 1 grp is 2.064 kg/m2 smaller than that of ctrl AFTER controlling for BASELINE BMI
- B2: The Mean BMI@f/u btwn ctrl and dose 1 grp is 1.941 kg/m2 smaller than that of ctrl AFTER controlling for BASELINE BMI
- B3: For every 1 kg/m2 increase in Baseline BMI, mean BMI@f/u increase by 0.984 kg/m2, after controlling for tx grps (no make sense, hence not impt)
At sig. level of 0.05, there is statsig assoc btwn tx and BMI@f/u AFTER ctrlling for basline BMI (as long as one p <0.05 of all the beta)
Recommended max number of indep var. to analyse for MLiR
n/10, where n is sample size