L12 Linear Regression Flashcards
Differentiate between correlation & simple linear
regression.
Correlation:
Quantification of the degree to which two random variables (continuous/ordinal) are related, provided their relationship is linear.
- Thus, correlation makes NO distinction between two variables (i.e. variables are treated symmetrically)!
Simple linear regression:
Determines the best-fitting straight line for a dataset to investigate the change in one variable (dependent variable, Y) (continuous) that corresponds to a given change in the other variable (independent variable, X) (continuous, ordinal or nominal), provided that there is a significant correlation.
- Thus, the two variables, X & Y, are treated asymmetrically!
What are some applications of simple linear regression?
1) To describe the linear relationship between two variables.
2) To predict or estimate the value of the independent variable (Y) associated with a fixed value of the independent variable (X).
- e.g. construction of a calibration curve
What is one consideration to be aware of when predicting the value of Y from X, given a calibration curve (i.e. simple linear regression)?
Cautious about extrapolating the regression line beyond the observed range, as the relationship between X & Y may not be the same outside the observed values of X.
Given y = (alpha) + (beta)x, what is the significance of alpha and beta values respectively?
Alpha = y-intercept of the best-fitting straight line
- i.e. the mean value of y when x = 0, since observed values of y are scattered about best-fit line
Beta = slope of the best-fitting straight line
- i.e. the change in mean value of y that corresponds to one-unit change in x
e. g. Absorbance = -0.0025 + 0.0777 (Concentration)
- When concentration = 0, absorbance = -0.0025
- For every 1 mg/L increase in concentration, the mean absorbance will increase by 0.0777 units.
Define ‘simple linear regression’.
Determines the best-fitting straight line for a dataset to investigate the change in one variable (dependent variable, Y) (continuous) that corresponds to a given change in the other variable (independent variable, X) (continuous, ordinal or nominal), provided that there is a significant correlation.
- i.e. Simple linear regression does NOT test whether the relationship between dependent & independent variables are linear
- Instead, it ASSUMES linear relationship between the variables, and finds the y-intercept & slope of best-fitting straight line.
1st step before performing regression analysis is to construct a scatter plot of y against x.
- To first visually examine whether a relationship exists between two numerical variables, before performing correlation analysis -> regression analysis.
- Determine if the relationship between two variables is linear or nonlinear -> determine if nonlinear or linear regression is applied
State the assumptions when using simple linear regression analysis.
1) There is a linear relationship between the variables.
- Thus, it is important to first construct a scatter plot of the data to determine if the relationship between the two variables is linear.
2) The observations are independent of one another.
3) For any specified values of x, the distribution of the y values is normal (i.e. the conditional distributions are
normally distributed).
4) For any set of values x, the variance is constant (i.e. all the conditional distributions have equal variance) (i.e. homoscedasticity).
Provided that the assumptions of a simple linear regression model are met, how do we determine the best-fitting straight line?
Methods of least squares
- i.e. the line with the smallest residual sum of squares
- If the assumptions are met, residuals will be randomly scattered above & below the line ei = 0 in a plot of ei against yi^.
State the purpose behind the hypothesis testing of simple linear regression analysis.
To test the H0 that there is no effect of the independent variable X on the dependent variable Y.
H0: There is NO effect of the independent variable X on the dependent variable Y.
H1: There is an effect (i.e. two-tailed test) of the independent variable X on the dependent variable Y.
Between alpha & beta values, which variable undergoes hypothesis testing more often under simple linear regression analysis?
Beta
- Since only y-values are concerned if hypothesis testing is done on alpha.
How does one assess the goodness-of-fit of the simple linear regression model with the observed data?
Inspection of the coefficient of determination of the regression model (R^2).
- In simple linear regression, R^2 = r^2, where r = Pearson product-moment correlation coefficient
- R^2 can be interpreted as the proportion of variability among the observed values of y that is explained by the linear regression of y on x.
- Loosely speaking, it means changes in values of y can be predicted by changes in values of x w/ R^2% accuracy.
Range of values = 0 to 1
- R^2 = 1 means all the data points lie on the best-fit line
- R^2 = 0 means there is NO linear relationship between x and y.
E.g. of how to write conclusion of simple linear regression analysis.
Coefficient of determination:
81.8% of the variability among the observed values of SBP (y) is explained by its linear relationship with body weight (x).
Regression equation:
SBP = 23.811 + 1.657 (weight)
Interpretation of beta-coefficient:
For every 1kg increase in body weight (x), the mean SBP (y) will increase by 1.657 mmHg.
Conclusion:
At a significance level of 0.05, there is a statistically significant effect of body weight on SBP (p < 0.0005).
Differentiate between simple & multiple linear regression.
Simple linear regression:
- Describes the relationship between the dependent variable (continuous) and a single independent variable (continuous, ordinal or nominal)
- Regression model: y = (alpha) + (beta)x
Multiple linear regression:
- An extension of simple linear regression
- Describes the relationship between the dependent variable (continuous) and more than one independent variable (continuous, ordinal or nominal)
- Regression model: y = (alpha) + (beta1)x1 + (beta2)x2 + … + (betak)xk
State the assumptions when using multiple linear regression analysis.
1) The relationship among the variables is represented by the equation: y = (alpha) + (beta1)x1 + (beta2)x2 + … + (betak)xk.
2) The observations are independent of one another.
3) For any specified values of x1, x2, … and xk, the distribution of the y values is normal (i.e. the conditional distributions are
normally distributed).
4) For any set of values of x1, x2, … and xk, the variance is constant (i.e. all the conditional distributions have equal variance) (i.e. homoscedasticity).
5) There is little or no multicollinearity among the independent variables (x1, x2, … and xk) i.e. independent variables should NOT be too highly correlated with each other.
Given y = (alpha) + (beta1)x1 + (beta2)x2 + … + (betak)xk, what is the significance of alpha and beta values respectively?
x1, x2, … and xk are the values of k distinct, independent (or explanatory) variables.
Alpha = y-intercept of the best-fitting multi-dimensional curve
- i.e. the mean value of y when all independent variables = 0
Beta = slope of the best-fitting multi-dimensional curve
- i.e. the change in mean value of y that corresponds to one-unit change in xi, after controlling for all other independent variables (i.e. keeping the values of all other independent variables constant).
How are nominal variables incorporated into regression models for analysis?
Introduce dummy / indicator variables to identify these categories of nominal variables
- Since the independent or explantory variables in a regression analysis MUST assume numerical values, numbers are used to identify these categories
- As these numerical values do NOT have any quantitative meaning, they are called indicator or dummy variable.
Interpretation of betai = the average differences in y between two groups given identical values in the other x variables.
For a nominal variable with k categories, (k - 1) dummy variables are needed, with the categories of each dummy variable coded as 0 or 1.
- e.g. Chinese (00), Malay (10), Indian (01)