regression Flashcards

1
Q

A scatter plot

A

uses Cartesian coordinates to display the values for two variables so we can visualise the
relationship between them

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Correlation

A

measures the relationship between these two continuous variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The Pearson correlation coefficient

A

describes the strength of (linear) association between them.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Hypothesis test for the correlation between two samples

A

using Pearson’s Correlation
Coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

correlation coefficient always takes on a value between -1 and 1 where:

A

-1: Perfectly negative linear correlation between two variables.
0: No linear correlation between two variables.
1: Perfectly positive linear correlation between two variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

how to determine if a correlation coefficient is statistically significant,

A

you can calculate the corresponding t-score and p-value.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

##
## data: Var_1 and Var_2
## t = 7.6064, df = 2, p-value = 0.01685
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4004041 0.9996629
## sample estimates:
## cor
## 0.9831516
How can we interpret the R output?

A

Since the correlation coefficient is postitve, it indicates that there is a postitve linear relationship between the two variables.

The p-value of the correlation coefficient is less than 0.05, the correlation is statistically significant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

difference between Pearson’s correlation coefficient and Spearman’s rank correlation

A

The Pearson’s correlation coefficient assumes normality for the two samples. However, Spearman’s rank correlation does not (as it is non-parametric).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

what data is Spearman’s rank correlation appropriate for

A

both continuous and discrete variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Spearman’s rank correlation, rather than assuming a linear relationship, measures

A

the strength of a monotone relationship (i.e. the extent to which if one variable increases, the other one systematically increases/decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

when is a regression analysis used

A

to try to predict the value of a dependent variable from one or more independent variables,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

linear model

A

simplest form of regression model
the response (or dependent) variable is y and the continuous explanatory (or independent) variable
is x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What can we
observed from the graph?

A

modeling the relationship between these two variables (which we will now call ‘fitting a model’), we
can use this model to predict the value of the dependent variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Linear regression makes a series of assumptions

A

observations are independent
The residuals should not be predictable from the fitted values in any way.
If any feature of the residuals can be predicted once the fitted value is known, this assumption is violated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

## lm(formula = tannin_data$growth ~ tannin_data$tannin)
##

Coefficients:
## (Intercept) tannin_data$tannin
## 11.756 -1.217

how is this interpreted

A

β0=11.756 (intercept) and β1=-1.217 (slope)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Checking the assumptions of the model in R

A

plot the model
For the first plot what we do not want is for the points to be more scattered as the fitted values increase.
For the second plot the points need to be as close to the line as possible to confirm that the residuals are
normally distributed.

17
Q

what test to compare model fit

A

F-test can be used to compare a model with no predictors (i.e. intercept-only model)

18
Q

what hypotheses for the F-test are

A

H0 : The fit of the intercept-only model and your model are equal.
H1 : The fit of the intercept-only model is significantly reduced compared to your model.

19
Q

we say ‘the fit’ we mean;

A

how well the model fits the data i.e. if an intercept-only data explains the data or if our model that includes the slope explain the data better.

20
Q

how does the f-test determine model fit

A

Use the quantiles of the F-distribution (qf funcion in R) to find the critical value.
If F-ratio (i.e. SSR/s2) is larger than the critical value we can reject the null hypothesis. To confirm that,
we can obtain the p-value using the pf function (as in Workshop 3) and compare it with the significance
level that we chose.

21
Q

Model uncertainty

A

When we conduct a regresssion model and estimate β0 and β1 there is uncertainty around the coefficients
we have estimated. This uncertainty can be shown by calculating the standard error of the intercept and
slope The bigger the standard error, the less certain we are about our estimate.

22
Q

Coefficient of determination

A

denoted r2 is a measure of how well the model fits the data and is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

23
Q

what does the value of R2 mean

A

R2 can take values from 0 to 1 (both included), with 0 meaning that the model explains no variation at
all (i.e. the model does not fit the data) and 1 when it explains all the variation. Suppose R2 = 0.82, This
implies that 82% of the variability in the dependent variable has been accounted for in our model and the
remaining 18% is unaccounted for. In reality our models will never perfectly fit the data and it is unrealistic
to expect a value of 1.

24
Q

Finding the t-value and p-value for each coefficient (Hypothesis testing)

A

Every time we fit a regression model we perform a hypothesis testing for each of the regression parameters.

25
Q

finding the t-value and p-value for linear model

A

two hypothesis tests: one for the intercept and one for the
slope.
H0 : coefficient = 0
H1 : coefficient ̸= 0.
The t-value for the intercept is calculated by dividing the estimate of the intercept with the standard error
of the intercept. The t-value of the slope is calculated in the same way.
To find the p-value, since this is a two-tail test
If the p-value is less than the significance level, we can reject the null hypothesis and say that the slope is
not zero.

26
Q

summary(model)

Call:
lm(formula = y ~ x, data = df)

Residuals:
Min 1Q Median 3Q Max
-4.4793 -0.9772 -0.4772 1.4388 4.6328

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.1432 1.9104 5.833 0.00039 *
x 1.2780 0.2984 4.284 0.00267 **

Signif. codes: 0 ‘
’ 0.001 ‘**’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.929 on 8 degrees of freedom
Multiple R-squared: 0.6964, Adjusted R-squared: 0.6584
F-statistic: 18.35 on 1 and 8 DF, p-value: 0.002675

how can this be interpreted

A

F-statistic = 18.35, corresponding p-value = .002675. Since this p-value is less than .05, the model as a whole is statistically significant.

Multiple R-squared = .6964. This tells us that 69.64% of the variation in the response variable, y, can be explained by the predictor variable, x.

Coefficient estimate of x: 1.2780. This tells us that each additional one unit increase in x is associated with an average increase of 1.2780 in y.

We can then use the coefficient estimates from the output to write the estimated regression equation:

y = 11.1432 + 1.2780*(x)

27
Q

why use the plot() function to plot the diagnostic plots for the regression model

A

These plots allow us to analyze the residuals of the regression model to determine if the model is appropriate to use for the data.

28
Q

whats A multivariable linear regression

A

includes more than one independent variable to predict the dependent variable

29
Q

linear regression denotation

A

y = β0 + β1x
where y is the outcome,
x is our explanatory variable

30
Q

If x is not continuous, but a categorical variable with 3 levels e.g ‘A’, ‘B’ and ‘C’. Then the linear regression
is denoted as

A

y = β0+β1xlevel1+β2xlevel2,
where xlevel1 = 1 if x is equal to the first level and zero otherwise,
xlevel2 = 1 if x is equal to the second level and zero otherwise.
If x is equal to the third level then y = β0 only

31
Q

## lm(formula = pain ~ drug, data = migraine)
##

Residuals:
## Min 1Q Median 3Q Max
## -1.7778 -0.7778 0.1111 0.3333 2.2222

## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6667 0.3629 10.104 4.01e-10 **
## drugB 2.1111 0.5132 4.114 0.000395 *
## drugC 2.2222 0.5132 4.330 0.000228 *
## —
## Signif. codes: 0 ’
*’ 0.001 ’
’ 0.01 ’
’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 1.089 on 24 degrees of freedom
## Multiple R-squared: 0.498, Adjusted R-squared: 0.4562
## F-statistic: 11.91 on 2 and 24 DF, p-value: 0.0002559
Write the equation and interpret the model fitted above.

A

so the equation is y = all coefficients added together

y = 3.6667 + 2.1111 (drugB) + 2.2222 (drugC)

r2 is 0.498, so only 50% of variation explained

32
Q

generalised linear model (GLM),

A

allows the linear model to be related to
the outcome via a link function and is much more flexible than a simple or multivariable linear model.

33
Q

logistic regression model

A

used for binary outcome

34
Q

## glm(formula = chd ~ obesity, family = binomial(link = “logit”),
## data = CHD)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3396 -0.9257 -0.8558 1.4021 1.7116
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.92831 0.61692 -3.126 0.00177 **
## obesity 0.04942 0.02318 2.132 0.03302 *
## —
## Signif. codes: 0 ***’0.001 ** 0.01 * ’0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 591.53 on 460 degrees of freedom
## AIC: 595.53
##
## Number of Fisher Scoring iterations: 4

interpret this r output?

A

obesity has a significant association
due to * = 0.05

35
Q

## glm(formula = chd ~ obesity + sbp, family = binomial(link = “logit”),
## data = CHD)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5239 -0.9078 -0.7921 1.3084 1.7982
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.938978 0.839783 -4.690 2.73e-06 **
## obesity 0.029783 0.024085 1.237 0.216257
## sbp 0.018118 0.004972 3.644 0.000268 **

## —
## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 ’.’ 0.1 ’ ’ 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 577.80 on 459 degrees of freedom
## AIC: 583.8
##
## Number of Fisher Scoring iterations: 4

A

when we include blood pressure as a predictor, obesity is no longer significant.

’ ‘ = significant level of 1

36
Q

Poisson regression

A

is another type of GLM where the response variable has a Poisson distribution.
In Poisson regression, the response variable Y is a count

37
Q

Spearman’s rank correlation rho
data: A and B
S = 2, p-value = 0.3333
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.8

interpret this output

A

Spearman rank correlation is 0.8 and the corresponding p-value is 0.3333.

This indicates that there is a positive correlation between the two vectors.

However, since the p-value of the correlation is not less than 0.05, the correlation is not statistically significant.

38
Q

general equation for y and x relationship

A

y = a + bx

b is the slope of the line

a is the constant or intercept

39
Q

F-Test of overall significance in regression is a test of

A

whether or not your linear regression model provides a better fit to a dataset than a model with no predictor variables.

The F-Test of overall significance has the following two hypotheses:

Null hypothesis (H0) : The model with no predictor variables (also known as an intercept-only model) fits the data as well as your regression model.

Alternative hypothesis (HA) : Your regression model fits the data better than the intercept-only model.