correlation and regression Flashcards
what is the point of tests of relationship
see if there is an association between variables and if there is a cause and effect relationship
define correlation
change in 2 variables in the same direction at the same time
define regression
investigate the causal effect of one variable on another
what does regression assume
that one variable depends on the other (you can predict y from X)
relationship between X and Y for correlation
doesn’t matter which is y or x
regression or correlation?
you can predict y from x
regression
plot when you are looking at the relationship between 2 continuous variable
scatter plot
what does a scatter plot show you
the relationship between two continuous variables
you have two cntuniuous variables. what plot do you use to see their relationship
scatter plot
what’s it called when you are looking at whether two set of observations are associated
corrrelation
what’s it called when you are looking at how strong an association is
correlation
what does correlation tell you
whether observations are correlated and how strong or significant the association is
what t eat do you run to see if two observations are correlated
Pearson’s product-moment correlation, spearmint rank-order correlation or Kendall rank order correlation
what are the assumptions for Pearson’s correlation
- both variables are continuous
- both are normally distributed (bivariate normal distribution)
when would you use spearmans rank order correlation
if your variables are not normally distributed and you can’t run a Pearson’s correlation
NONPARAMETRIC EQUIVALENT TO PEARSONS PRODUCT MOMENT CORRELATION
spearmans rank order correlation
what tests if there is a linear relationship between variable
Pearson correlation coefficient
what do you use Pearson correlation coefficient for
to see if there is a linear relationship
explain what a value of r>0 indicates for Pearsons correlation coefficient
positive linear relationship
explain what a value of r<0 indicated for Pearsons correlation coefficient
negative linear relationship
explain what a v ally of r=0 indicates for Pearsons correlation coefficient
no linear relationship
what does linear correlation tell you
indicates whether variables are related (p<0.05), and how strong that relationship is
what influences p-values for Pearsons correlation
- sample size
- large N can give low P, even when effect (r) is weak
- high r can have non-significant p-values if N is low
how does r value impact results from Pearsons correlation
high r can have non-significant p-values if N is low
how does sample size influence Pearsons correlation results
high N can give low P, even when effect (r) is weak
high r can have non-significant P-valuyes if N is low
what test to use if you want to compare ranked variables
spearmans rank order correlation
which test is a more conservative approach for correlation
spearmans correlation
what are the assumptions for regression
- there is a causal relationship
- you can predict Y (effect, response) from X (cause, predictor, covariate)
what are the assumptions for linear regression
- assumes you can express the relationship between Y and X as a linear equation
- y is distributed normally at each value of x
- the variance is equal (homeneity)
- errors are independent (no serial correlation
in the regression equation (y=mx+B), which is dependent variable and which is independent
y is dependent
x is independent
difference between parameters vs variables
variables vary, parameters are constant9
in the y=mx+b equation, which are variables and which are parameters
x and y are variables
B and M are parameters
goal for looking at regression equation
to find values of B and M (parameters) that provide the best fit to the data
how to calculate residuals
actual-predicted values
how is the best fit line chosen
the sum of the squared distances of the point for the line is minimized
variance=?
mean squared residual
mean squared residual= ?
variance
what is R2
coefficient of determination
define coefficient of determination (r2)
proportion of the variance in the observed values of the dependent variable that is explained by the regression model
what null hypothesis is being tested with linear regression
there I no linear relationship between X and Y
regression coefficient=
slope
what is slope called in regression equation
regression coeffricnet
how do. you test the assumptions of linear regression
- examine linearity assumption
- examine for constant variance for all levels (homoscedasticity)
- evaluate normal distribution
- evaluate independence assumption
how to do residual analysis
the residual for observation is the difference between its observed and predicted value
how to do graphical analysis of residuals `
plot the residuals :
1. residuals vs independent
2. residuals vs predicted
3. residual vs order of the data
4. residual lag plot
5. histogram of the residuals
how to see independence of errors
durban watsopn statistics
what do you use Durban Watson statistic for
to see independence of errors
how does Durban Watson statistic tell you how to see independence of errors
if D=0, positive correlation
if D=2, no correlation
if D=4, negative correlation
Durban Watson: if D=0, what is correlation
positive correlation
Durban watson: if D=2 what is correlation
no correlation
Durban Watson: if D=4, what is correlation
negative correlation
what is quantile-quantile (Q-Q) plot used for
its a technique for determining if two data sets come from populations with a common distribution
what does Q-Q plot do
plots the quantiles of the first data set against the quantiles of the second dataset
how to see normal distribution with quantiles
plot the theoretical quantiles on horizontal axis and the sample quantiles on vertical axis
uniform distribution for Q-Q plot has what shape
S shape
can you compare models based on R2
r2 always increases when additional predictors are added to the model
- you can compare adjusted r2 over models with different numbers of parameters
what is adjusted r2
- increases when a new predictor is included only if the new prerdictor improves r2 more than would be expected by chance
- comparable over models with different number of parameters
what do you use to compare different models with different numbers of parameters
adjusted r2
adjusted r2 will alway be ___ to R2
adjusted r2 will always be < r2
which r2 value do you use to report
if you are comparing models, use adjusted r2
otherwise, report r2
what measured how much the fitted values in the model change when the nth datapoint is deleted
cooks distance
explain results from cooks D
- large D indicates that the data point strongly influences the fitted values
- if D>0.05 then that data point is worthy of further investigation as it may be influential
- if D>1, then that data point is quite likely to be influential
if cooks D is large, what does that mean
the data point strongly influences the fitted values
if cooks d is >0.05, what does that mean
the data point is worthy of further investigation
if cooks D > 1, what does that mean
the data point is likely to be influential
what do you use if there are several factors affecting the dependent variable
multiple regression
what is multiple regression
estimates the relationship between variables, taking into account additional variables
function of multiple regression?
- controls for confounders
- tests for interactions between predictors
- improves predictions
how to know which predictor (coefficient) has the strongest effect in multiple regression
compare the coefficients in the linear regression coefficient. the higher, the stronger its effect
what if predictors are in different units? how d we compare them to see which has strongest effect
standardize the variables
does standardizing variables change p?
if you have an interaction term, standardizing variables can make the p values change for main effects but not for interaction effects
what is multicollinearity
an independent variable is highly correlated with another independent variable in a multiple regression equation
impacts of multicollinearity
- undermines the statistical significance of an independent variable
- reduces the precision of the estimated coefficients, which weakens the statistical power of the regression model
what is collinear
when two predator variables express a linear relationship
if r>0.6 for correlations in multiple regression, what do you do
exclude them from the model
what is variance inflation factor
esimates how much the variance of a coefficient is inflated because of linear dependence with other predictors
explain the difference between interactions and multicolinearity
one is about the joint effect of the X variables on Y
one is about relationships between X variables (ignoring Y)