Bio-statistics linear regression Flashcards
The relationship between outcome (Y) and a covariate (X) can be
either linear or non‐linear
the outcome is ……
and the exposure is ……
The outcome is continuous.
The exposure can be continuous or categorical.
A scatter‐plot can help determine:
Is the relationship between outcome & covariate linear?
How strong is the strength of the relationship?
correlation coefficient
The strength of the relationship can be negative, zero or positive,
and assessed by the correlation coefficient.
Outcome” (Y) other names
Response variable
Dependent variable
“Exposure” (X) other names
Covariate Independent variable Predictor Explanatory variable Risk factor
Similarities between CORRELATION AND REGRESSION
Create a scatter plot of outcome vs exposure. Observe the pattern.
Outcome is continuous
Exposure: continuous or categorical
Hypothesis test used for both.
Correlation: r = 0 vs r ≠ 0.
Regression: = 0 vs ≠ 0.
AIM: To find if there is an association between the chosen exposure and outcome.
Differences between CORRELATION AND REGRESSION
Correlation: r ranges from ‐1 to +1. Strength of relationship. Regression: B‐coefficient can be any value. Equation: outcome & exposure. Predict the value of outcome from a certain exposure value. Two types of regression: Simple Multiple
LINEAR REGRESSION – STEPS
- Graph the data. Check linear relationship.
- Calculate correlation coefficient
- Do linear regression analysis
- Evaluate the model
Coefficient of determination (R2)
Residual plot
Normal probability plot
CORRELATION COEFFICIENT
Correlation coefficient, p, quantifies the linear relationship between a pair of variables. The correlation coefficient can be between ‐1 and +1. Stats package (Graph Pad, SPSS, Stata) used to obtain “r” . Degrees of freedom: n ‐ 2
What is the Hypothesis test for correlation:
Null: Correlation = 0
Alternative: Correlation ≠ 0
HOW TO INTERPRET A CORRELATION COEFFICIENT?
r < 0.00 (Negative numbers)
Negative relationship. As X increases, Y decreases.
r > 0.00 (Positive numbers)
Positive relationship. As X increases, Y also increases.
Ranges of r (magnitude)
Ranges of r (magnitude) 0 to 0.3 = fairly weak 0.3 to 0.7 = fairly strong 0.7 to 0.9 = strong Above 0.9 = very strong
THREE ASSUMPTIONS OF LINEAR REGRESSION
The outcome (Y) variable follows a normal distribution.
Check by histogram or boxplot.
The relationship between outcome (Y) and covariate (X) is linear.
Check with a Scatterplot.
There is constant variance of the outcome across different values of the covariate.
Check with a residual plot
Two types of linear regression models:
Simple – one risk factor.
Multiple – at least two risk factors
Equation of a simple regression line (one x variable):
y= B0+B1X1 B1 = slope of the line. B0 = Y‐Intercept x1 = The value of variable “x”.
WHAT DOES B1 REPRESENT?
The beta coefficient represents the amount of change in outcome variable for every unit change in the covariate, that is, the effect of the covariate on the outcome.
t‐score follows a t‐distribution with df = n – 2
t=B1/SE
The 95 % Confidence Interval
Statistic +,- Multiplier x Standard Error
= B1 +,- t xSE
What are the 3 ways to evaluate linear regression model?
There are 3 ways to evaluate the linear regression model:
1. Coefficient of determination (R2)
2. Residual plot
3. Normal Probability Plot
These evaluate whether there are any outlier data points.
Outliers can have a large influence on the regression equation.
COEFFICIENT OF DETERMINATION (R‐SQUARED)
The coefficient of determination tells us about the proportion
of variation in the outcome variable that is explained by the
covariate(s).
It is the square of the correlation coefficient. i.e. R2 = r2
R2 can range from 0 to +1. (r ranges from ‐1 to +1.)
What is the “residual”?
And how to calculate it?
The “error” between the “observed” and “predicted value”.
i.e. How far away from the “line of best fit” is the point?
Residual = Observed value – Predicted value (from equation)
Residual plot of a linear regression
For linear regression:
The residuals are random.
They follow a normal distribution
NORMAL PROBABILITY PLOT
Why is it done?
To check if the outcome (Y) variable is normally distributed.
If the dots follow a straight line, the data is normal.
If the dots are scattered at either tail, the data is skewed.
Not possible in Graph Pad.
Can be done in Excel (Data Analysis – Regression)
Stepwise regression
– a method of selecting significant factors in above.