regression Flashcards
A scatter plot
uses Cartesian coordinates to display the values for two variables so we can visualise the
relationship between them
Correlation
measures the relationship between these two continuous variables.
The Pearson correlation coefficient
describes the strength of (linear) association between them.
Hypothesis test for the correlation between two samples
using Pearson’s Correlation
Coefficient
correlation coefficient always takes on a value between -1 and 1 where:
-1: Perfectly negative linear correlation between two variables.
0: No linear correlation between two variables.
1: Perfectly positive linear correlation between two variables.
how to determine if a correlation coefficient is statistically significant,
you can calculate the corresponding t-score and p-value.
##
## data: Var_1 and Var_2
## t = 7.6064, df = 2, p-value = 0.01685
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4004041 0.9996629
## sample estimates:
## cor
## 0.9831516
How can we interpret the R output?
Since the correlation coefficient is postitve, it indicates that there is a postitve linear relationship between the two variables.
The p-value of the correlation coefficient is less than 0.05, the correlation is statistically significant.
difference between Pearson’s correlation coefficient and Spearman’s rank correlation
The Pearson’s correlation coefficient assumes normality for the two samples. However, Spearman’s rank correlation does not (as it is non-parametric).
what data is Spearman’s rank correlation appropriate for
both continuous and discrete variables.
Spearman’s rank correlation, rather than assuming a linear relationship, measures
the strength of a monotone relationship (i.e. the extent to which if one variable increases, the other one systematically increases/decreases
when is a regression analysis used
to try to predict the value of a dependent variable from one or more independent variables,
linear model
simplest form of regression model
the response (or dependent) variable is y and the continuous explanatory (or independent) variable
is x.
What can we
observed from the graph?
modeling the relationship between these two variables (which we will now call ‘fitting a model’), we
can use this model to predict the value of the dependent variable
Linear regression makes a series of assumptions
observations are independent
The residuals should not be predictable from the fitted values in any way.
If any feature of the residuals can be predicted once the fitted value is known, this assumption is violated.
## lm(formula = tannin_data$growth ~ tannin_data$tannin)
##
Coefficients:
## (Intercept) tannin_data$tannin
## 11.756 -1.217
how is this interpreted
β0=11.756 (intercept) and β1=-1.217 (slope)
Checking the assumptions of the model in R
plot the model
For the first plot what we do not want is for the points to be more scattered as the fitted values increase.
For the second plot the points need to be as close to the line as possible to confirm that the residuals are
normally distributed.
what test to compare model fit
F-test can be used to compare a model with no predictors (i.e. intercept-only model)
what hypotheses for the F-test are
H0 : The fit of the intercept-only model and your model are equal.
H1 : The fit of the intercept-only model is significantly reduced compared to your model.
we say ‘the fit’ we mean;
how well the model fits the data i.e. if an intercept-only data explains the data or if our model that includes the slope explain the data better.
how does the f-test determine model fit
Use the quantiles of the F-distribution (qf funcion in R) to find the critical value.
If F-ratio (i.e. SSR/s2) is larger than the critical value we can reject the null hypothesis. To confirm that,
we can obtain the p-value using the pf function (as in Workshop 3) and compare it with the significance
level that we chose.
Model uncertainty
When we conduct a regresssion model and estimate β0 and β1 there is uncertainty around the coefficients
we have estimated. This uncertainty can be shown by calculating the standard error of the intercept and
slope The bigger the standard error, the less certain we are about our estimate.
Coefficient of determination
denoted r2 is a measure of how well the model fits the data and is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
what does the value of R2 mean
R2 can take values from 0 to 1 (both included), with 0 meaning that the model explains no variation at
all (i.e. the model does not fit the data) and 1 when it explains all the variation. Suppose R2 = 0.82, This
implies that 82% of the variability in the dependent variable has been accounted for in our model and the
remaining 18% is unaccounted for. In reality our models will never perfectly fit the data and it is unrealistic
to expect a value of 1.
Finding the t-value and p-value for each coefficient (Hypothesis testing)
Every time we fit a regression model we perform a hypothesis testing for each of the regression parameters.