Correlation and Regression (4) Flashcards
Relationships between variables
define covariance
• the importance of this lies in the concept of cause-and-effect, or that one variable
forces a different variable to behave in a potentially systematic way
• as we move away from a lake, does the amount of snow decrease?
• do house prices rise as you move further from a major highway?
• this is known as covariance – as one variable changes, a different variable changes as
well
however, ______ does not necessarily always imply a causal relationship
covariance
‘correlation does not imply
causation’
-• although your data analysis may suggest a strong
relationship, you must interpret the results using
logic to confirm that the relationship is real
the primary tool for identifying covariance is using a _____, or if using more than
2 variables, a ______ ____
scatter plot
scatterplot matrix
scatterplots
scatterplots show the relationship between 2 variables, by plotting the x variable
against the y variable
• the x variable is always the independent variable, or the cause
• the y variable is always the dependent variable, or the effect
the covariance value doesnt tell you much except that..
if positive, like 32.3, that the relationship is positive (as the independent increases so does the dependent).
Pearsons correlation coeffecient=
r
provides a measure of the strength of the
relationship between the 2 variables, which can only be used for straight line
relationships
-• to produce a more useful statistic, we can standardize the covariance to give us a value
that falls between -1 and 1
-this is done by adding standard deviations into the equation for correlation
r=0 means no relationship
r=+1 means strong positive relationship
r=-1 means strong negative relationship
correlation analysis typically involves using a scatterplot and Pearson’s r to describe a
relationship, but normally we want to know if the relationship is statistically significant
for example, this relationship has r = -0.31, but we surely can’t look
at the variation in the data and say that it is a good relationship
in the example, we find that in fact this relationship is not
statistically significant, and therefore we might consider it
coincidental
• sample size plays a very important role here –
in general, smaller datasets need a higher r
value to be significant, and very large datasets
with a low r value could be significant
• ____ _____ plays a very important role here –
in general, ______ datasets need a higher r
value to be significant, and very ____ datasets
with a low r value could be significant
sample size
smaller
large
Pearson’s r is a parametric measure of a relationship, but what if your data are nominal
or ordinal types and parametric measures don’t work?
- both are limited to the range [-1, +1]
- both coefficients are positive (negative) when an increase (a decrease) in X
corresponds to an increase (a decrease) in Y - a value near zero indicates that the values of X are uncorrelated with the values of
Y
spearmans p
spearmans p:
- rank all variables from lowest to highest
- find absolute rank between two variables’ rankings, so if x is given a ranking of 7 in one variable and 8 in another, the absolute difference is 1
- concordant pairs is the number of pairs which have matching ranks across variables
- discordent pair is the number of pairs which do not match up across vairbales
then test for significance using Z test
Kendall’s R
• if both X and Y were perfectly correlated, their ranks should match perfectly • when the pairs of ranks fall in order, they are concordant • when the pairs of ranks are out of order, they are discordant • since every rank below this pair is greater than 1, they are all concordant • since there is 1 rank below less than 8, there are 2 concordant and 1 discordant
again test for significance using z test
Corelation for nominal or categorical datasets
.use contigency table
-• first, construct a table of expected frequencies – what would we expect the contingency table to look like if everything was random? -also construct a table of observed
- use totals from each row and column to find expected %
- gives you expected vs. actual
the calculate the x^(2) value based on difference between observed and expected
x^(2)=(f(o)xf(g))^2/f(g)
then use Tshuprow’s T, Cramer’s
V, or pearsons c to standardize for a value between -1 and 1
correlation can be strongly effected by _____ _______
.can be strongly affected by spatial autocorrelation
- have to be careful in how we organize data,
- particular to spatial data
simple linear Regression
• simple, linear regression, defining a straight-line relationship between two
variables, is a first step towards modeling, the simulation of nature using equations
• a model is a simplification of reality, and can be used to understand a system, answer
questions about the system, or make predictions as to how the system may respond to a
stimulus
regression model is
a regression model is a simple mathematical equation that simulates how y will respond
to a change in x
• regression can involve 2 variables (simple), more than 2 variables (multiple), and/or
non-linear relationships
• all regression models begin with a correlation analysis – this establishes the strength of
the relationship
Steps in Regression Analysis
- correlation analysis
- determines strength of relationship
2.establish the nature of
the relationship – ex:how does day length affect air temperature?
-• this is done by fitting a “line of best fit” to the relationship
• this line is a simplification of the data – a regression model
the regression model does 3 things to our
dataset:
1. it provides a simplified view of the relationship 2. it provides a means to evaluate the importance of the variables 3. it provides an opportunity to make predictions beyond the data set
y=mx+b
or
y=a+bx
y=dependent variable
m=slope
x=independent variable
b=y intercept
y=dependent variable
a=y intercept
b=slope
x=independent variable
least squares
regression analysis seeks to minimize the average size of the residuals through a process known as “least squares”, or minimizing the sum of the squared distances between each data point and the line of best fit
m=r x (sy/sx)
sy=standard deviation of y
sx=standard deviation of x
r=pearsons r
b=y-mx
• you might notice that the numerator in the equation for b is the same as the equation
for the correlation coefficient, demonstrating the link between correlation and
regression analysis
Results of regression: numerator of regression equation is the correlation equation
Resuslts of a regression
- there will always be some kind of residuals in chart
- so you end up with things that are explained by the equation and things that remain unexplained
• ideally, regression analysis will maximize the explained variability and minimize the
unexplained variability
• the proportion of the variability that is explained is called the coefficient of
determination, r
2
coeefecient of determination
the proportion of the variability that is explained is called the coefficient of
determination, r
2
the value of r
2
ranges from 0, no variability is explained, to 1, all of the variability
is explained
• we can also use r
2 as a percentage – if r
2 = 0.75, then we have explained 75% of the
variability in the relationship, while also leaving 25% unexplained
- bigger the sample, lower the c.o.d will probably be, even though its still good
- smaller the sample, higher the C.O.D might be, even though it might not be good
T OR F
IF LINE HAS NO SLOPE THERE IS NO RELATIONSHIP
T
Confidence intervals put on the chart basically show the reader how many _______ are outside of the lines
residuals
of course, we would like an objective method to determine whether the coefficient of
determination is statistically significant or not
f-test=(r^2 x(n-2))/1-r^2
• notice that this test statistic is just the square of the test statistic in the correlation
coefficient test – the r and r
2
tests will always share the same results
• so, if Pearson’s r is significant, then so is r
2
assumptions to consider when we apply simple regression analysis
to a data set (4)
• the relationship between x and y is linear and the equation for a straight line
represents the model
• the residuals have a mean = 0 and their variance does not vary with x
• the residuals are all independent (they do not depend on one another)
• for each value of x, the residuals have a normal distribution centred on the line of
best fit
• the assumption of linearity is important for both correlation and regression – if the
relationship is not obviously linear, it may still be intrinsically linear (a non-linear
relationship that can be transformed to linear); otherwise, it is an intrinsically nonlinear
relationship and must be represented by something other than a straight line
why do a residual plot?
examining the residuals is a useful approach for interpreting the results of regression
analysis – most software provides the option of plotting residuals for you
• a residual plot should be a very boring looking plot – there
should be no trends or patterns, just a cloud of data
points
• a line of best fit through the residuals should yield no
useful regression model, r
2 should not be significant
Standard error
• the standard error can be thought of as the size of a typical residual, and since it is
measured in terms of y, it shares the same units as the dataset
• for example, in the day length vs air temperature data, the standard error is 7.6°C
• this means that, on average, there is an error of ±7.6°C associated with our
predictions made from the regression equation
another way of assessing our regression model is to ask if the slope of the best fit line is
significantly different than 0
remember that the slope represents the rate of change of y as x changes, if the slope
is 0, or no different than 0, it tells us that y is not changing significantly with x
sb=(se^2)/((n-1)xsx^2)
ttest=(b-b(fancy))/sb
where b is the calculated slope, is the hypothesized
slope (= 0), sb
is the standard deviation of the slope, se
is the standard error, and sx
is the standard deviation
of the independent variable
Assumptions for a simple regression analysis(4)
remember the assumptions for simple regression analysis
• the relationship between x and y is linear and the equation for a straight line
represents the model
• the residuals have a mean = 0 and their variance does not vary with x
• the residuals are all independent (they do not depend on one another)
• for each value of x, the residuals have a normal distribution centred on the line of
best fit
when plotted, residuals should be _____ _____
normally distributed
alternative residual plot
• an alternative residual plot has the x-axis = predicted value and y-axis = residual (or
standardized residual)