L12 Linear Regression Flashcards

1
Q

Differentiate between correlation & simple linear

regression.

A

Correlation:
Quantification of the degree to which two random variables (continuous/ordinal) are related, provided their relationship is linear.
- Thus, correlation makes NO distinction between two variables (i.e. variables are treated symmetrically)!

Simple linear regression:
Determines the best-fitting straight line for a dataset to investigate the change in one variable (dependent variable, Y) (continuous) that corresponds to a given change in the other variable (independent variable, X) (continuous, ordinal or nominal), provided that there is a significant correlation.
- Thus, the two variables, X & Y, are treated asymmetrically!

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some applications of simple linear regression?

A

1) To describe the linear relationship between two variables.
2) To predict or estimate the value of the independent variable (Y) associated with a fixed value of the independent variable (X).
- e.g. construction of a calibration curve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is one consideration to be aware of when predicting the value of Y from X, given a calibration curve (i.e. simple linear regression)?

A

Cautious about extrapolating the regression line beyond the observed range, as the relationship between X & Y may not be the same outside the observed values of X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Given y = (alpha) + (beta)x, what is the significance of alpha and beta values respectively?

A

Alpha = y-intercept of the best-fitting straight line
- i.e. the mean value of y when x = 0, since observed values of y are scattered about best-fit line

Beta = slope of the best-fitting straight line
- i.e. the change in mean value of y that corresponds to one-unit change in x

e. g. Absorbance = -0.0025 + 0.0777 (Concentration)
- When concentration = 0, absorbance = -0.0025
- For every 1 mg/L increase in concentration, the mean absorbance will increase by 0.0777 units.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Define ‘simple linear regression’.

A

Determines the best-fitting straight line for a dataset to investigate the change in one variable (dependent variable, Y) (continuous) that corresponds to a given change in the other variable (independent variable, X) (continuous, ordinal or nominal), provided that there is a significant correlation.

  • i.e. Simple linear regression does NOT test whether the relationship between dependent & independent variables are linear
  • Instead, it ASSUMES linear relationship between the variables, and finds the y-intercept & slope of best-fitting straight line.

1st step before performing regression analysis is to construct a scatter plot of y against x.

  • To first visually examine whether a relationship exists between two numerical variables, before performing correlation analysis -> regression analysis.
  • Determine if the relationship between two variables is linear or nonlinear -> determine if nonlinear or linear regression is applied
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

State the assumptions when using simple linear regression analysis.

A

1) There is a linear relationship between the variables.
- Thus, it is important to first construct a scatter plot of the data to determine if the relationship between the two variables is linear.

2) The observations are independent of one another.

3) For any specified values of x, the distribution of the y values is normal (i.e. the conditional distributions are
normally distributed).

4) For any set of values x, the variance is constant (i.e. all the conditional distributions have equal variance) (i.e. homoscedasticity).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Provided that the assumptions of a simple linear regression model are met, how do we determine the best-fitting straight line?

A

Methods of least squares

  • i.e. the line with the smallest residual sum of squares
  • If the assumptions are met, residuals will be randomly scattered above & below the line ei = 0 in a plot of ei against yi^.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

State the purpose behind the hypothesis testing of simple linear regression analysis.

A

To test the H0 that there is no effect of the independent variable X on the dependent variable Y.

H0: There is NO effect of the independent variable X on the dependent variable Y.
H1: There is an effect (i.e. two-tailed test) of the independent variable X on the dependent variable Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Between alpha & beta values, which variable undergoes hypothesis testing more often under simple linear regression analysis?

A

Beta

- Since only y-values are concerned if hypothesis testing is done on alpha.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does one assess the goodness-of-fit of the simple linear regression model with the observed data?

A

Inspection of the coefficient of determination of the regression model (R^2).

  • In simple linear regression, R^2 = r^2, where r = Pearson product-moment correlation coefficient
  • R^2 can be interpreted as the proportion of variability among the observed values of y that is explained by the linear regression of y on x.
  • Loosely speaking, it means changes in values of y can be predicted by changes in values of x w/ R^2% accuracy.

Range of values = 0 to 1

  • R^2 = 1 means all the data points lie on the best-fit line
  • R^2 = 0 means there is NO linear relationship between x and y.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

E.g. of how to write conclusion of simple linear regression analysis.

A

Coefficient of determination:
81.8% of the variability among the observed values of SBP (y) is explained by its linear relationship with body weight (x).

Regression equation:
SBP = 23.811 + 1.657 (weight)

Interpretation of beta-coefficient:
For every 1kg increase in body weight (x), the mean SBP (y) will increase by 1.657 mmHg.

Conclusion:
At a significance level of 0.05, there is a statistically significant effect of body weight on SBP (p < 0.0005).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Differentiate between simple & multiple linear regression.

A

Simple linear regression:

  • Describes the relationship between the dependent variable (continuous) and a single independent variable (continuous, ordinal or nominal)
  • Regression model: y = (alpha) + (beta)x

Multiple linear regression:

  • An extension of simple linear regression
  • Describes the relationship between the dependent variable (continuous) and more than one independent variable (continuous, ordinal or nominal)
  • Regression model: y = (alpha) + (beta1)x1 + (beta2)x2 + … + (betak)xk
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

State the assumptions when using multiple linear regression analysis.

A

1) The relationship among the variables is represented by the equation: y = (alpha) + (beta1)x1 + (beta2)x2 + … + (betak)xk.
2) The observations are independent of one another.

3) For any specified values of x1, x2, … and xk, the distribution of the y values is normal (i.e. the conditional distributions are
normally distributed).

4) For any set of values of x1, x2, … and xk, the variance is constant (i.e. all the conditional distributions have equal variance) (i.e. homoscedasticity).
5) There is little or no multicollinearity among the independent variables (x1, x2, … and xk) i.e. independent variables should NOT be too highly correlated with each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Given y = (alpha) + (beta1)x1 + (beta2)x2 + … + (betak)xk, what is the significance of alpha and beta values respectively?

A

x1, x2, … and xk are the values of k distinct, independent (or explanatory) variables.

Alpha = y-intercept of the best-fitting multi-dimensional curve
- i.e. the mean value of y when all independent variables = 0

Beta = slope of the best-fitting multi-dimensional curve
- i.e. the change in mean value of y that corresponds to one-unit change in xi, after controlling for all other independent variables (i.e. keeping the values of all other independent variables constant).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How are nominal variables incorporated into regression models for analysis?

A

Introduce dummy / indicator variables to identify these categories of nominal variables

  • Since the independent or explantory variables in a regression analysis MUST assume numerical values, numbers are used to identify these categories
  • As these numerical values do NOT have any quantitative meaning, they are called indicator or dummy variable.

Interpretation of betai = the average differences in y between two groups given identical values in the other x variables.

For a nominal variable with k categories, (k - 1) dummy variables are needed, with the categories of each dummy variable coded as 0 or 1.
- e.g. Chinese (00), Malay (10), Indian (01)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

State the purpose behind the hypothesis testing of multiple linear regression analysis.

A

To test the H0 that there is no effect of the independent variable Xi on the dependent variable Y, assuming that the values of all other independent variables remain constant.

H0: There is NO effect of the independent variable Xi on the dependent variable Y, assuming that the values of all other independent variables remain constant.
H1: There is an effect (i.e. two-tailed test) of the independent variable Xi on the dependent variable Y, assuming that the values of all other independent variables remain constant.

17
Q

How does one assess the goodness-of-fit of the multiple linear regression model with the observed data?

A

Inspection of the coefficient of determination of the regression model (R^2).
- R^2 can be interpreted as the proportion of variability among the observed values of y that is explained by the linear regression containing the set of independent variables.

Range of values = 0 to 1

18
Q

How does one compare between regression models that contain different numbers of independent variables?

A

Compare with adjusted R^2, NOT R^2!

  • Adjusted R^2 compensates for the added complexity of a model, where it:
    (a) increases when the inclusion of an independent variable improves the ability to predict y and
    (b) decreases when it does not.
    (c) The inclusion of an additional independent variable in a regression model can NEVER cause R^2 to decrease!
  • However, adjusted R^2 CANNOT be directly interpreted as the proportion of variability among the observed values of y that is explained by the linear regression containing the set of independent variables.
  • ONLY to compare between models of diff. no. of X!
19
Q

E.g. of how to write conclusion of multiple linear regression analysis using continuous independent variables.

A

Coefficient of determination (use R^2, NOT adjusted R^2):
94.5% of the variability among the observed values of SBP (y) is explained by its linear regression model containing both body weight (x1) and serum cholesterol (x2).

Regression equation:
SBP = 3.109 + 1.386 (weight) + 0.219 (serum cholesterol)

Interpretation of beta-coefficient:
For every 1kg increase in body weight (x1), the mean SBP (y) will increase by 1.386 mmHg after controlling for serum cholesterol (x2).

For every 1 mg/100mL increase in serum cholesterol (x2), the mean SBP (y) will increase by 0.219 mmHg after controlling for body weight (x1).

Conclusion:
At a significance level of 0.05, body weight and serum cholesterol are independently associated with SBP.
- If one variable isn’t, write not independently associated instead.

20
Q

E.g. of how to write conclusion of multiple linear regression analysis using nominal independent variables.

A

To examine the effect of treatment (x1 and x2) on BMI at follow-up (y) after controlling for baseline BMI (x3).

Tx Groups: Control (00), Dosage 1 (10), Dosage 2 (01)

Regression equation: BMI at follow-up = 0.428 – 2.064 (Dosage 1) – 1.941 (Dosage 2) + 0.984 (Baseline BMI)

  • Control: BMI at follow-up = 0.428 + 0 + 0 + 0.984 (Baseline BMI)
  • Dosage 1: BMI at follow-up = 0.428 – 2.064 + 0 + 0.984 (Baseline BMI)
  • Dosage 2: BMI at follow-up = 0.428 + 0 – 1.941 + 0.984 (Baseline BMI)

Interpretation of beta-coefficient:

  • The mean BMI at follow-up for Dosage 1 group is 2.064 kg/m^2 smaller than that for the Control group, after controlling for baseline BMI.
  • The mean BMI at follow-up for Dosage 2 group is 1.941 kg/m^2 smaller than that for the Control group, after controlling for baseline BMI.
  • For every 1 kg/m^2 increase in baseline BMI, the mean BMI at follow-up will increase by 0.984 kg/m^2, after controlling for the treatment Group.

Conclusion:
At a significance level of 0.05, there is a statistically significant association between treatment and BMI at follow-up, after controlling for baseline BMI.
- Even if either one of the dosage groups ends up w/ p > 0.05, we can STILL conclude that there is a statistically significant association because Tx is a variable w/ the three categories.
- Otherwise, if one variable isn’t, write not independently associated instead.