Correlation and Regression (4) Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Relationships between variables

define covariance

A

• the importance of this lies in the concept of cause-and-effect, or that one variable
forces a different variable to behave in a potentially systematic way
• as we move away from a lake, does the amount of snow decrease?
• do house prices rise as you move further from a major highway?
• this is known as covariance – as one variable changes, a different variable changes as
well

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

however, ______ does not necessarily always imply a causal relationship

A

covariance

‘correlation does not imply
causation’

-• although your data analysis may suggest a strong
relationship, you must interpret the results using
logic to confirm that the relationship is real

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

the primary tool for identifying covariance is using a _____, or if using more than
2 variables, a ______ ____

A

scatter plot

scatterplot matrix

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

scatterplots

A

scatterplots show the relationship between 2 variables, by plotting the x variable
against the y variable
• the x variable is always the independent variable, or the cause
• the y variable is always the dependent variable, or the effect

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

the covariance value doesnt tell you much except that..

A

if positive, like 32.3, that the relationship is positive (as the independent increases so does the dependent).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pearsons correlation coeffecient=

A

r

provides a measure of the strength of the
relationship between the 2 variables, which can only be used for straight line
relationships

-• to produce a more useful statistic, we can standardize the covariance to give us a value
that falls between -1 and 1
-this is done by adding standard deviations into the equation for correlation

r=0 means no relationship

r=+1 means strong positive relationship

r=-1 means strong negative relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

correlation analysis typically involves using a scatterplot and Pearson’s r to describe a
relationship, but normally we want to know if the relationship is statistically significant

for example, this relationship has r = -0.31, but we surely can’t look
at the variation in the data and say that it is a good relationship

A

in the example, we find that in fact this relationship is not
statistically significant, and therefore we might consider it
coincidental

• sample size plays a very important role here –
in general, smaller datasets need a higher r
value to be significant, and very large datasets
with a low r value could be significant

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

• ____ _____ plays a very important role here –
in general, ______ datasets need a higher r
value to be significant, and very ____ datasets
with a low r value could be significant

A

sample size

smaller

large

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Pearson’s r is a parametric measure of a relationship, but what if your data are nominal
or ordinal types and parametric measures don’t work?

A
  1. both are limited to the range [-1, +1]
  2. both coefficients are positive (negative) when an increase (a decrease) in X
    corresponds to an increase (a decrease) in Y
  3. a value near zero indicates that the values of X are uncorrelated with the values of
    Y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

spearmans p

A

spearmans p:

  • rank all variables from lowest to highest
  • find absolute rank between two variables’ rankings, so if x is given a ranking of 7 in one variable and 8 in another, the absolute difference is 1
  • concordant pairs is the number of pairs which have matching ranks across variables
  • discordent pair is the number of pairs which do not match up across vairbales

then test for significance using Z test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Kendall’s R

A
• if both X and Y were perfectly
correlated, their ranks should
match perfectly
• when the pairs of ranks fall in
order, they are concordant
• when the pairs of ranks are
out of order, they are
discordant
• since every rank below this
pair is greater than 1, they
are all concordant
• since there is 1 rank below
less than 8, there are 2
concordant and 1 discordant

again test for significance using z test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Corelation for nominal or categorical datasets

A

.use contigency table

-• first, construct a table of expected
frequencies – what would we expect
the contingency table to look like if
everything was random?
-also construct a table of observed
  • use totals from each row and column to find expected %
  • gives you expected vs. actual

the calculate the x^(2) value based on difference between observed and expected
x^(2)=(f(o)xf(g))^2/f(g)

then use Tshuprow’s T, Cramer’s
V, or pearsons c to standardize for a value between -1 and 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

correlation can be strongly effected by _____ _______

A

.can be strongly affected by spatial autocorrelation

  • have to be careful in how we organize data,
  • particular to spatial data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

simple linear Regression

A

• simple, linear regression, defining a straight-line relationship between two
variables, is a first step towards modeling, the simulation of nature using equations
• a model is a simplification of reality, and can be used to understand a system, answer
questions about the system, or make predictions as to how the system may respond to a
stimulus

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

regression model is

A

a regression model is a simple mathematical equation that simulates how y will respond
to a change in x

• regression can involve 2 variables (simple), more than 2 variables (multiple), and/or
non-linear relationships
• all regression models begin with a correlation analysis – this establishes the strength of
the relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Steps in Regression Analysis

A
  1. correlation analysis
    - determines strength of relationship

2.establish the nature of
the relationship – ex:how does day length affect air temperature?
-• this is done by fitting a “line of best fit” to the relationship
• this line is a simplification of the data – a regression model

17
Q

the regression model does 3 things to our

dataset:

A
1. it provides a simplified view of the
relationship
2. it provides a means to evaluate the
importance of the variables
3. it provides an opportunity to make
predictions beyond the data set
18
Q

y=mx+b
or
y=a+bx

A

y=dependent variable
m=slope
x=independent variable
b=y intercept

y=dependent variable
a=y intercept
b=slope
x=independent variable

19
Q

least squares

A
regression analysis seeks to minimize the
average size of the residuals through a
process known as “least squares”, or
minimizing the sum of the squared
distances between each data point and
the line of best fit
20
Q

m=r x (sy/sx)

sy=standard deviation of y
sx=standard deviation of x
r=pearsons r

b=y-mx

A

• you might notice that the numerator in the equation for b is the same as the equation
for the correlation coefficient, demonstrating the link between correlation and
regression analysis

Results of regression: numerator of regression equation is the correlation equation

21
Q

Resuslts of a regression

A
  • there will always be some kind of residuals in chart
  • so you end up with things that are explained by the equation and things that remain unexplained

• ideally, regression analysis will maximize the explained variability and minimize the
unexplained variability
• the proportion of the variability that is explained is called the coefficient of
determination, r
2

22
Q

coeefecient of determination

A

the proportion of the variability that is explained is called the coefficient of
determination, r
2
the value of r
2
ranges from 0, no variability is explained, to 1, all of the variability
is explained
• we can also use r
2 as a percentage – if r
2 = 0.75, then we have explained 75% of the
variability in the relationship, while also leaving 25% unexplained

  • bigger the sample, lower the c.o.d will probably be, even though its still good
  • smaller the sample, higher the C.O.D might be, even though it might not be good
23
Q

T OR F

IF LINE HAS NO SLOPE THERE IS NO RELATIONSHIP

A

T

24
Q

Confidence intervals put on the chart basically show the reader how many _______ are outside of the lines

A

residuals

25
Q

of course, we would like an objective method to determine whether the coefficient of
determination is statistically significant or not

f-test=(r^2 x(n-2))/1-r^2

A

• notice that this test statistic is just the square of the test statistic in the correlation
coefficient test – the r and r
2
tests will always share the same results
• so, if Pearson’s r is significant, then so is r
2

26
Q

assumptions to consider when we apply simple regression analysis
to a data set (4)

A

• the relationship between x and y is linear and the equation for a straight line
represents the model
• the residuals have a mean = 0 and their variance does not vary with x
• the residuals are all independent (they do not depend on one another)
• for each value of x, the residuals have a normal distribution centred on the line of
best fit
• the assumption of linearity is important for both correlation and regression – if the
relationship is not obviously linear, it may still be intrinsically linear (a non-linear
relationship that can be transformed to linear); otherwise, it is an intrinsically nonlinear
relationship and must be represented by something other than a straight line

27
Q

why do a residual plot?

A

examining the residuals is a useful approach for interpreting the results of regression
analysis – most software provides the option of plotting residuals for you
• a residual plot should be a very boring looking plot – there
should be no trends or patterns, just a cloud of data
points
• a line of best fit through the residuals should yield no
useful regression model, r
2 should not be significant

28
Q

Standard error

A

• the standard error can be thought of as the size of a typical residual, and since it is
measured in terms of y, it shares the same units as the dataset

• for example, in the day length vs air temperature data, the standard error is 7.6°C
• this means that, on average, there is an error of ±7.6°C associated with our
predictions made from the regression equation

29
Q

another way of assessing our regression model is to ask if the slope of the best fit line is
significantly different than 0

A

remember that the slope represents the rate of change of y as x changes, if the slope
is 0, or no different than 0, it tells us that y is not changing significantly with x

sb=(se^2)/((n-1)xsx^2)

ttest=(b-b(fancy))/sb

where b is the calculated slope,  is the hypothesized
slope (= 0), sb
is the standard deviation of the slope, se
is the standard error, and sx
is the standard deviation
of the independent variable

30
Q

Assumptions for a simple regression analysis(4)

A

remember the assumptions for simple regression analysis
• the relationship between x and y is linear and the equation for a straight line
represents the model
• the residuals have a mean = 0 and their variance does not vary with x
• the residuals are all independent (they do not depend on one another)
• for each value of x, the residuals have a normal distribution centred on the line of
best fit

31
Q

when plotted, residuals should be _____ _____

A

normally distributed

32
Q

alternative residual plot

A

• an alternative residual plot has the x-axis = predicted value and y-axis = residual (or
standardized residual)