Correlation and Regression (4) Flashcards
Relationships between variables
define covariance
• the importance of this lies in the concept of cause-and-effect, or that one variable
forces a different variable to behave in a potentially systematic way
• as we move away from a lake, does the amount of snow decrease?
• do house prices rise as you move further from a major highway?
• this is known as covariance – as one variable changes, a different variable changes as
well
however, ______ does not necessarily always imply a causal relationship
covariance
‘correlation does not imply
causation’
-• although your data analysis may suggest a strong
relationship, you must interpret the results using
logic to confirm that the relationship is real
the primary tool for identifying covariance is using a _____, or if using more than
2 variables, a ______ ____
scatter plot
scatterplot matrix
scatterplots
scatterplots show the relationship between 2 variables, by plotting the x variable
against the y variable
• the x variable is always the independent variable, or the cause
• the y variable is always the dependent variable, or the effect
the covariance value doesnt tell you much except that..
if positive, like 32.3, that the relationship is positive (as the independent increases so does the dependent).
Pearsons correlation coeffecient=
r
provides a measure of the strength of the
relationship between the 2 variables, which can only be used for straight line
relationships
-• to produce a more useful statistic, we can standardize the covariance to give us a value
that falls between -1 and 1
-this is done by adding standard deviations into the equation for correlation
r=0 means no relationship
r=+1 means strong positive relationship
r=-1 means strong negative relationship
correlation analysis typically involves using a scatterplot and Pearson’s r to describe a
relationship, but normally we want to know if the relationship is statistically significant
for example, this relationship has r = -0.31, but we surely can’t look
at the variation in the data and say that it is a good relationship
in the example, we find that in fact this relationship is not
statistically significant, and therefore we might consider it
coincidental
• sample size plays a very important role here –
in general, smaller datasets need a higher r
value to be significant, and very large datasets
with a low r value could be significant
• ____ _____ plays a very important role here –
in general, ______ datasets need a higher r
value to be significant, and very ____ datasets
with a low r value could be significant
sample size
smaller
large
Pearson’s r is a parametric measure of a relationship, but what if your data are nominal
or ordinal types and parametric measures don’t work?
- both are limited to the range [-1, +1]
- both coefficients are positive (negative) when an increase (a decrease) in X
corresponds to an increase (a decrease) in Y - a value near zero indicates that the values of X are uncorrelated with the values of
Y
spearmans p
spearmans p:
- rank all variables from lowest to highest
- find absolute rank between two variables’ rankings, so if x is given a ranking of 7 in one variable and 8 in another, the absolute difference is 1
- concordant pairs is the number of pairs which have matching ranks across variables
- discordent pair is the number of pairs which do not match up across vairbales
then test for significance using Z test
Kendall’s R
• if both X and Y were perfectly correlated, their ranks should match perfectly • when the pairs of ranks fall in order, they are concordant • when the pairs of ranks are out of order, they are discordant • since every rank below this pair is greater than 1, they are all concordant • since there is 1 rank below less than 8, there are 2 concordant and 1 discordant
again test for significance using z test
Corelation for nominal or categorical datasets
.use contigency table
-• first, construct a table of expected frequencies – what would we expect the contingency table to look like if everything was random? -also construct a table of observed
- use totals from each row and column to find expected %
- gives you expected vs. actual
the calculate the x^(2) value based on difference between observed and expected
x^(2)=(f(o)xf(g))^2/f(g)
then use Tshuprow’s T, Cramer’s
V, or pearsons c to standardize for a value between -1 and 1
correlation can be strongly effected by _____ _______
.can be strongly affected by spatial autocorrelation
- have to be careful in how we organize data,
- particular to spatial data
simple linear Regression
• simple, linear regression, defining a straight-line relationship between two
variables, is a first step towards modeling, the simulation of nature using equations
• a model is a simplification of reality, and can be used to understand a system, answer
questions about the system, or make predictions as to how the system may respond to a
stimulus
regression model is
a regression model is a simple mathematical equation that simulates how y will respond
to a change in x
• regression can involve 2 variables (simple), more than 2 variables (multiple), and/or
non-linear relationships
• all regression models begin with a correlation analysis – this establishes the strength of
the relationship