regression Flashcards by Verity Edwards-Scott

A scatter plot

uses Cartesian coordinates to display the values for two variables so we can visualise the
relationship between them

How well did you know this?

Not at all

Perfectly

Correlation

measures the relationship between these two continuous variables.

How well did you know this?

Not at all

Perfectly

The Pearson correlation coefficient

describes the strength of (linear) association between them.

How well did you know this?

Not at all

Perfectly

Hypothesis test for the correlation between two samples

using Pearson’s Correlation
Coefficient

How well did you know this?

Not at all

Perfectly

correlation coefficient always takes on a value between -1 and 1 where:

-1: Perfectly negative linear correlation between two variables.
0: No linear correlation between two variables.
1: Perfectly positive linear correlation between two variables.

How well did you know this?

Not at all

Perfectly

how to determine if a correlation coefficient is statistically significant,

you can calculate the corresponding t-score and p-value.

How well did you know this?

Not at all

Perfectly

##
## data: Var_1 and Var_2
## t = 7.6064, df = 2, p-value = 0.01685
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4004041 0.9996629
## sample estimates:
## cor
## 0.9831516
How can we interpret the R output?

Since the correlation coefficient is postitve, it indicates that there is a postitve linear relationship between the two variables.

The p-value of the correlation coefficient is less than 0.05, the correlation is statistically significant.

How well did you know this?

Not at all

Perfectly

difference between Pearson’s correlation coefficient and Spearman’s rank correlation

The Pearson’s correlation coefficient assumes normality for the two samples. However, Spearman’s rank correlation does not (as it is non-parametric).

How well did you know this?

Not at all

Perfectly

what data is Spearman’s rank correlation appropriate for

both continuous and discrete variables.

How well did you know this?

Not at all

Perfectly

Spearman’s rank correlation, rather than assuming a linear relationship, measures

the strength of a monotone relationship (i.e. the extent to which if one variable increases, the other one systematically increases/decreases

How well did you know this?

Not at all

Perfectly

when is a regression analysis used

to try to predict the value of a dependent variable from one or more independent variables,

How well did you know this?

Not at all

Perfectly

linear model

simplest form of regression model
the response (or dependent) variable is y and the continuous explanatory (or independent) variable
is x.

How well did you know this?

Not at all

Perfectly

What can we
observed from the graph?

modeling the relationship between these two variables (which we will now call ‘fitting a model’), we
can use this model to predict the value of the dependent variable

How well did you know this?

Not at all

Perfectly

Linear regression makes a series of assumptions

observations are independent
The residuals should not be predictable from the fitted values in any way.
If any feature of the residuals can be predicted once the fitted value is known, this assumption is violated.

How well did you know this?

Not at all

Perfectly

## lm(formula = tannin_data$growth ~ tannin_data$tannin)
##

Coefficients:
## (Intercept) tannin_data$tannin
## 11.756 -1.217

how is this interpreted

β0=11.756 (intercept) and β1=-1.217 (slope)

How well did you know this?

Not at all

Perfectly

Checking the assumptions of the model in R

Study These Flashcards

plot the model
For the first plot what we do not want is for the points to be more scattered as the fitted values increase.
For the second plot the points need to be as close to the line as possible to confirm that the residuals are
normally distributed.

what test to compare model fit

Study These Flashcards

F-test can be used to compare a model with no predictors (i.e. intercept-only model)

what hypotheses for the F-test are

Study These Flashcards

H0 : The fit of the intercept-only model and your model are equal.
H1 : The fit of the intercept-only model is significantly reduced compared to your model.

we say ‘the fit’ we mean;

Study These Flashcards

how well the model fits the data i.e. if an intercept-only data explains the data or if our model that includes the slope explain the data better.

how does the f-test determine model fit

Study These Flashcards

Use the quantiles of the F-distribution (qf funcion in R) to find the critical value.
If F-ratio (i.e. SSR/s2) is larger than the critical value we can reject the null hypothesis. To confirm that,
we can obtain the p-value using the pf function (as in Workshop 3) and compare it with the significance
level that we chose.

Model uncertainty

Study These Flashcards

When we conduct a regresssion model and estimate β0 and β1 there is uncertainty around the coefficients
we have estimated. This uncertainty can be shown by calculating the standard error of the intercept and
slope The bigger the standard error, the less certain we are about our estimate.

Coefficient of determination

Study These Flashcards

denoted r2 is a measure of how well the model fits the data and is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

what does the value of R2 mean

Study These Flashcards

R2 can take values from 0 to 1 (both included), with 0 meaning that the model explains no variation at
all (i.e. the model does not fit the data) and 1 when it explains all the variation. Suppose R2 = 0.82, This
implies that 82% of the variability in the dependent variable has been accounted for in our model and the
remaining 18% is unaccounted for. In reality our models will never perfectly fit the data and it is unrealistic
to expect a value of 1.

Finding the t-value and p-value for each coefficient (Hypothesis testing)

Study These Flashcards

Every time we fit a regression model we perform a hypothesis testing for each of the regression parameters.

finding the t-value and p-value for linear model

two hypothesis tests: one for the intercept and one for the slope. H0 : coefficient = 0 H1 : coefficient ̸= 0. The t-value for the intercept is calculated by dividing the estimate of the intercept with the standard error of the intercept. The t-value of the slope is calculated in the same way. To find the p-value, since this is a two-tail test If the p-value is less than the significance level, we can reject the null hypothesis and say that the slope is not zero.

summary(model) Call: lm(formula = y ~ x, data = df) Residuals: Min 1Q Median 3Q Max -4.4793 -0.9772 -0.4772 1.4388 4.6328 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.1432 1.9104 5.833 0.00039 *** x 1.2780 0.2984 4.284 0.00267 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.929 on 8 degrees of freedom Multiple R-squared: 0.6964, Adjusted R-squared: 0.6584 F-statistic: 18.35 on 1 and 8 DF, p-value: 0.002675 how can this be interpreted

F-statistic = 18.35, corresponding p-value = .002675. Since this p-value is less than .05, the model as a whole is statistically significant. Multiple R-squared = .6964. This tells us that 69.64% of the variation in the response variable, y, can be explained by the predictor variable, x. Coefficient estimate of x: 1.2780. This tells us that each additional one unit increase in x is associated with an average increase of 1.2780 in y. We can then use the coefficient estimates from the output to write the estimated regression equation: y = 11.1432 + 1.2780*(x)

why use the plot() function to plot the diagnostic plots for the regression model

These plots allow us to analyze the residuals of the regression model to determine if the model is appropriate to use for the data.

whats A multivariable linear regression

includes more than one independent variable to predict the dependent variable

linear regression denotation

y = β0 + β1x where y is the outcome, x is our explanatory variable

If x is not continuous, but a categorical variable with 3 levels e.g ‘A’, ‘B’ and ‘C’. Then the linear regression is denoted as

y = β0+β1xlevel1+β2xlevel2, where xlevel1 = 1 if x is equal to the first level and zero otherwise, xlevel2 = 1 if x is equal to the second level and zero otherwise. If x is equal to the third level then y = β0 only

## ## Call: ## lm(formula = pain ~ drug, data = migraine) ## Residuals: ## Min 1Q Median 3Q Max ## -1.7778 -0.7778 0.1111 0.3333 2.2222 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.6667 0.3629 10.104 4.01e-10 *** ## drugB 2.1111 0.5132 4.114 0.000395 *** ## drugC 2.2222 0.5132 4.330 0.000228 *** ## --- ## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1 ## ## Residual standard error: 1.089 on 24 degrees of freedom ## Multiple R-squared: 0.498, Adjusted R-squared: 0.4562 ## F-statistic: 11.91 on 2 and 24 DF, p-value: 0.0002559 Write the equation and interpret the model fitted above.

so the equation is y = all coefficients added together y = 3.6667 + 2.1111 (drugB) + 2.2222 (drugC) r2 is 0.498, so only 50% of variation explained

generalised linear model (GLM),

allows the linear model to be related to the outcome via a link function and is much more flexible than a simple or multivariable linear model.

logistic regression model

used for binary outcome

## ## Call: ## glm(formula = chd ~ obesity, family = binomial(link = "logit"), ## data = CHD) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.3396 -0.9257 -0.8558 1.4021 1.7116 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.92831 0.61692 -3.126 0.00177 ** ## obesity 0.04942 0.02318 2.132 0.03302 * ## --- ## Signif. codes: 0 ***’0.001 ** 0.01 * ’0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 596.11 on 461 degrees of freedom ## Residual deviance: 591.53 on 460 degrees of freedom ## AIC: 595.53 ## ## Number of Fisher Scoring iterations: 4 interpret this r output?

obesity has a significant association due to * = 0.05

## ## Call: ## glm(formula = chd ~ obesity + sbp, family = binomial(link = "logit"), ## data = CHD) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.5239 -0.9078 -0.7921 1.3084 1.7982 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -3.938978 0.839783 -4.690 2.73e-06 *** ## obesity 0.029783 0.024085 1.237 0.216257 ## sbp 0.018118 0.004972 3.644 0.000268 *** ## --- ## Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 ’.’ 0.1 ’ ’ 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 596.11 on 461 degrees of freedom ## Residual deviance: 577.80 on 459 degrees of freedom ## AIC: 583.8 ## ## Number of Fisher Scoring iterations: 4

when we include blood pressure as a predictor, obesity is no longer significant. ' ' = significant level of 1

Poisson regression

is another type of GLM where the response variable has a Poisson distribution. In Poisson regression, the response variable Y is a count

Spearman’s rank correlation rho data: A and B S = 2, p-value = 0.3333 alternative hypothesis: true rho is not equal to 0 sample estimates: rho 0.8 interpret this output

Spearman rank correlation is 0.8 and the corresponding p-value is 0.3333. This indicates that there is a positive correlation between the two vectors. However, since the p-value of the correlation is not less than 0.05, the correlation is not statistically significant.

general equation for y and x relationship

y = a + bx b is the slope of the line a is the constant or intercept

F-Test of overall significance in regression is a test of

whether or not your linear regression model provides a better fit to a dataset than a model with no predictor variables. The F-Test of overall significance has the following two hypotheses: Null hypothesis (H0) : The model with no predictor variables (also known as an intercept-only model) fits the data as well as your regression model. Alternative hypothesis (HA) : Your regression model fits the data better than the intercept-only model.

regression Flashcards

(39 cards)