spearmans p: - rank all variables from lowest to highest - find absolute rank between two variables’ rankings, so if x is given a ranking of 7 in one variable and 8 in another, the absolute difference is 1 - concordant pairs is the number of pairs which have matching ranks across variables - discordent pair is the number of pairs which do not match up across vairbales then test for significance using Z test

Correlation and Regression (4) Flashcards by James Hamel

Relationships between variables

define covariance

• the importance of this lies in the concept of cause-and-effect, or that one variable
forces a different variable to behave in a potentially systematic way
• as we move away from a lake, does the amount of snow decrease?
• do house prices rise as you move further from a major highway?
• this is known as covariance – as one variable changes, a different variable changes as
well

How well did you know this?

Not at all

Perfectly

however, ______ does not necessarily always imply a causal relationship

covariance

‘correlation does not imply
causation’

-• although your data analysis may suggest a strong
relationship, you must interpret the results using
logic to confirm that the relationship is real

How well did you know this?

Not at all

Perfectly

the primary tool for identifying covariance is using a _____, or if using more than
2 variables, a ______ ____

scatter plot

scatterplot matrix

How well did you know this?

Not at all

Perfectly

scatterplots

scatterplots show the relationship between 2 variables, by plotting the x variable
against the y variable
• the x variable is always the independent variable, or the cause
• the y variable is always the dependent variable, or the effect

How well did you know this?

Not at all

Perfectly

the covariance value doesnt tell you much except that..

if positive, like 32.3, that the relationship is positive (as the independent increases so does the dependent).

How well did you know this?

Not at all

Perfectly

Pearsons correlation coeffecient=

provides a measure of the strength of the
relationship between the 2 variables, which can only be used for straight line
relationships

-• to produce a more useful statistic, we can standardize the covariance to give us a value
that falls between -1 and 1
-this is done by adding standard deviations into the equation for correlation

r=0 means no relationship

r=+1 means strong positive relationship

r=-1 means strong negative relationship

How well did you know this?

Not at all

Perfectly

correlation analysis typically involves using a scatterplot and Pearson’s r to describe a
relationship, but normally we want to know if the relationship is statistically significant

for example, this relationship has r = -0.31, but we surely can’t look
at the variation in the data and say that it is a good relationship

in the example, we find that in fact this relationship is not
statistically significant, and therefore we might consider it
coincidental

• sample size plays a very important role here –
in general, smaller datasets need a higher r
value to be significant, and very large datasets
with a low r value could be significant

How well did you know this?

Not at all

Perfectly

• ____ _____ plays a very important role here –
in general, ______ datasets need a higher r
value to be significant, and very ____ datasets
with a low r value could be significant

sample size

smaller

large

How well did you know this?

Not at all

Perfectly

Pearson’s r is a parametric measure of a relationship, but what if your data are nominal
or ordinal types and parametric measures don’t work?

both are limited to the range [-1, +1]
both coefficients are positive (negative) when an increase (a decrease) in X
corresponds to an increase (a decrease) in Y
a value near zero indicates that the values of X are uncorrelated with the values of
Y

How well did you know this?

Not at all

Perfectly

spearmans p

spearmans p:

rank all variables from lowest to highest
find absolute rank between two variables’ rankings, so if x is given a ranking of 7 in one variable and 8 in another, the absolute difference is 1
concordant pairs is the number of pairs which have matching ranks across variables
discordent pair is the number of pairs which do not match up across vairbales

then test for significance using Z test

How well did you know this?

Not at all

Perfectly

Kendall’s R

• if both X and Y were perfectly
correlated, their ranks should
match perfectly
• when the pairs of ranks fall in
order, they are concordant
• when the pairs of ranks are
out of order, they are
discordant
• since every rank below this
pair is greater than 1, they
are all concordant
• since there is 1 rank below
less than 8, there are 2
concordant and 1 discordant

again test for significance using z test

How well did you know this?

Not at all

Perfectly

Corelation for nominal or categorical datasets

.use contigency table

-• first, construct a table of expected
frequencies – what would we expect
the contingency table to look like if
everything was random?
-also construct a table of observed

use totals from each row and column to find expected %
gives you expected vs. actual

the calculate the x^(2) value based on difference between observed and expected
x^(2)=(f(o)xf(g))^2/f(g)

then use Tshuprow’s T, Cramer’s
V, or pearsons c to standardize for a value between -1 and 1

How well did you know this?

Not at all

Perfectly

correlation can be strongly effected by _____ _______

.can be strongly affected by spatial autocorrelation

have to be careful in how we organize data,
particular to spatial data

How well did you know this?

Not at all

Perfectly

simple linear Regression

• simple, linear regression, defining a straight-line relationship between two
variables, is a first step towards modeling, the simulation of nature using equations
• a model is a simplification of reality, and can be used to understand a system, answer
questions about the system, or make predictions as to how the system may respond to a
stimulus

How well did you know this?

Not at all

Perfectly

regression model is

a regression model is a simple mathematical equation that simulates how y will respond
to a change in x

• regression can involve 2 variables (simple), more than 2 variables (multiple), and/or
non-linear relationships
• all regression models begin with a correlation analysis – this establishes the strength of
the relationship

How well did you know this?

Not at all

Perfectly

Steps in Regression Analysis

Study These Flashcards

correlation analysis
- determines strength of relationship

2.establish the nature of
the relationship – ex:how does day length affect air temperature?
-• this is done by fitting a “line of best fit” to the relationship
• this line is a simplification of the data – a regression model

the regression model does 3 things to our

dataset:

Study These Flashcards

1. it provides a simplified view of the
relationship
2. it provides a means to evaluate the
importance of the variables
3. it provides an opportunity to make
predictions beyond the data set

y=mx+b
or
y=a+bx

Study These Flashcards

y=dependent variable
m=slope
x=independent variable
b=y intercept

y=dependent variable
a=y intercept
b=slope
x=independent variable

least squares

Study These Flashcards

regression analysis seeks to minimize the
average size of the residuals through a
process known as “least squares”, or
minimizing the sum of the squared
distances between each data point and
the line of best fit

m=r x (sy/sx)

sy=standard deviation of y
sx=standard deviation of x
r=pearsons r

b=y-mx

Study These Flashcards

• you might notice that the numerator in the equation for b is the same as the equation
for the correlation coefficient, demonstrating the link between correlation and
regression analysis

Results of regression: numerator of regression equation is the correlation equation

Resuslts of a regression

Study These Flashcards

there will always be some kind of residuals in chart
so you end up with things that are explained by the equation and things that remain unexplained

• ideally, regression analysis will maximize the explained variability and minimize the
unexplained variability
• the proportion of the variability that is explained is called the coefficient of
determination, r
2

coeefecient of determination

Study These Flashcards

the proportion of the variability that is explained is called the coefficient of
determination, r
2
the value of r
2
ranges from 0, no variability is explained, to 1, all of the variability
is explained
• we can also use r
2 as a percentage – if r
2 = 0.75, then we have explained 75% of the
variability in the relationship, while also leaving 25% unexplained

bigger the sample, lower the c.o.d will probably be, even though its still good
smaller the sample, higher the C.O.D might be, even though it might not be good

T OR F

IF LINE HAS NO SLOPE THERE IS NO RELATIONSHIP

Study These Flashcards

Confidence intervals put on the chart basically show the reader how many _______ are outside of the lines

Study These Flashcards

residuals

of course, we would like an objective method to determine whether the coefficient of determination is statistically significant or not f-test=(r^2 x(n-2))/1-r^2

• notice that this test statistic is just the square of the test statistic in the correlation coefficient test – the r and r 2 tests will always share the same results • so, if Pearson’s r is significant, then so is r 2

assumptions to consider when we apply simple regression analysis to a data set (4)

• the relationship between x and y is linear and the equation for a straight line represents the model • the residuals have a mean = 0 and their variance does not vary with x • the residuals are all independent (they do not depend on one another) • for each value of x, the residuals have a normal distribution centred on the line of best fit • the assumption of linearity is important for both correlation and regression – if the relationship is not obviously linear, it may still be intrinsically linear (a non-linear relationship that can be transformed to linear); otherwise, it is an intrinsically nonlinear relationship and must be represented by something other than a straight line

why do a residual plot?

examining the residuals is a useful approach for interpreting the results of regression analysis – most software provides the option of plotting residuals for you • a residual plot should be a very boring looking plot – there should be no trends or patterns, just a cloud of data points • a line of best fit through the residuals should yield no useful regression model, r 2 should not be significant

Standard error

• the standard error can be thought of as the size of a typical residual, and since it is measured in terms of y, it shares the same units as the dataset • for example, in the day length vs air temperature data, the standard error is 7.6°C • this means that, on average, there is an error of ±7.6°C associated with our predictions made from the regression equation

another way of assessing our regression model is to ask if the slope of the best fit line is significantly different than 0

remember that the slope represents the rate of change of y as x changes, if the slope is 0, or no different than 0, it tells us that y is not changing significantly with x sb=(se^2)/((n-1)xsx^2) ttest=(b-b(fancy))/sb where b is the calculated slope,  is the hypothesized slope (= 0), sb is the standard deviation of the slope, se is the standard error, and sx is the standard deviation of the independent variable

Assumptions for a simple regression analysis(4)

remember the assumptions for simple regression analysis • the relationship between x and y is linear and the equation for a straight line represents the model • the residuals have a mean = 0 and their variance does not vary with x • the residuals are all independent (they do not depend on one another) • for each value of x, the residuals have a normal distribution centred on the line of best fit

when plotted, residuals should be _____ _____

normally distributed

alternative residual plot

• an alternative residual plot has the x-axis = predicted value and y-axis = residual (or standardized residual)

Correlation and Regression (4) Flashcards

(32 cards)