unit 3 - ch 13 - correlation Flashcards

Question 1

Q

It’s all about relationships:

Question 2

Q

Correlation coefficient: terms

Answer

A

X-variable: independent variable, predictor variable
Y-variable: dependent variable, criterion variable, variable of interest

Question 3

Q

ANOVA:

Answer

A

2 variables ~ 1 nominal (factor), 1 at least interval (criterion variable)

Question 4

Q

Correlation:

Answer

A

2 variables ~ both variables at least interval

Notation:
Sample = r
population : p (rho)

Question 5

Q

r =

Correlation coefficient formula

Answer

A

sample correlation coefficent

X variable - x - x bar
Y variable - y - y bar

Question 6

Q

As the shared variation between the x variable and y variable increases, r approaches its upper or lower limit, respectively +1 and -1

Answer

A

+1 = perfect positive relationship
-1 = perfect negative relationship
0 = absolutely no relationship

Question 7

Q

R is a measure of both

Answer

A

The strength of the relationship and
The direction of the X - Y relationship

Question 8

Q

r has no unit of measurement

Answer

A

= unitless
r is not affected by the scale of the data
r values can be compared to each other

Question 9

Q

example of x and y

Answer

A

X variable: gas prices
Y variable: miles drive

Question 10

Q

correlation coefficient examples

Answer

A

EX 1
Income y axis
Height x axis
Positive correlation
Taller people make more money on average ):

EX 2
Customer satisfaction y axis
Difficulty in product setup x axis
Negative
Ikea

EX 3
Gpa y axis
Hat x axis
No correlation

EX 4
Control y axis
Speed x axis
Negative
Not all relationships are linear
Exponential, linear etc

EX 5
Performance y axis
Emotional involvement (stress) x axis
Curve (upside down U)
Two ends that are low
High end/peak

Question 11

Q

Step 6 of HT
EX: Are married men living longer or dying slower? Why?

Answer

A

EX 1
Data:
Alcohol content and calories for 10 beers
Calculating
X = alcohol content
Y = calories
r = 0.957
Testing

Step 1
4 facts of the null = everything is unrelated
Ho p = 0.00
H1 p =/= 0.00

Step 2
a = 0.01

Step 3
TS = observed - expected / chance
TS = r - p/ standard error of the correlation coefficient
TS = 9.97
P = 0.00 (from Ho)

Step 4
df = n -2
df = 9
CV = +/- 2.62

Step 5
9.79 > 2.62 = reject
TS > CV = reject

Step 6
As the alcohol content increases in beer, the calories also increase. That is not to say alcohol causes calories but both are the result of the beer making process. The conversion of sugar into alcohol during fermentation results in alcohol and calories. It is not a perfect correlation as carbohydrates within beer also contains calories.

Question 12

Q

Correlation vs. causation

Answer

A

r = 0.957

r increases =/= causation
High r does not mean x is causing y

X variable: length of our left arm
Y variable: length of our right arm

Question 13

Q

cautionary tales

Answer

A

sample size
relationship change
correlation is not causation
not all relationships are linear

Question 14

Q

cautionary notes: sample size

Answer

A

at least 10 data points for the x-variable (s) and 10 for the y-variable
Multiple x but only only variable

EXAMPLE
X = age of car
X = odometer miles
Y = selling price
10 points per x and 10 per Y = 30 data points

Question 15

Q

cautionary notes: relationships change

Answer

A

Over time

Outside the range of data
Don’t want to use relationship found in younger people sample on older people sample

Across space
Geographical
Drop model in a new space but sometimes it doesn’t hold up (american customers vs spanish customers)

Question 16

Q

cautionary notes: correlation is not causation

Answer

A

EX
Cigarettes and cancer = correlation and causation
Vitamins and better health = correlation
Suntan lotion and coral reef bleaching = correlation and causation
Gran turismo sales and subaru impreza sales = correlation and causation

Correlation → causation -{ liability, opportunity, beneficiary

Question 17

Q

cautionary notes: not all relationships are linear

Curve relationship (linear, exponential etc.)

EX 1
Academic performance
Listening to classical music
No relationship

EX 2
X = Cars per 1000 people
Y = Overall average BMI
r = 0.63

Answer

A

Step 1
4 facets of the null = everything is unrelated
Ho p = 0.00
H1 p =/= 0.000

Step 2
a = 0.10

Step 3
TS = 2.298

Step 4
DF = n -2
DF = 8
CV = 1.86

Step 5
2.298 > 1.86 = Reject
TS > CV = Reject

Step 6
Rather than car use leading to overweight perhaps people who are overweight are more likely to use cars (Y leads to X)

Question 18

Q

Getting to significance

Answer

A

r = square root of t squared divided by (t squared + (n-2))

Plug CV into t

EX
Plug into 2.298 > 1.86 = Reject
TS > CV = Reject
r = 0.55

Sample size affects…
Significance, strength and practicality

Question 19

Q

getting to significance

r = square root of t squared divided by (t squared + (n-2))

Answer

A

Higher sample size lower r significance

Significance → statistical question
Strength → labeling (talk/write)
Practicality → business judgment

Question 20

Q

figuring out significance based on r

high r - significance - strength - practicality
low r - significance - strength - practicality
low r - significance - strength - practicality
middling r - significance - strength - practicality

Answer

A

High r - Increase - YES - strong/high - useful
Low r - Decrease - NO - weak/low - not useful
Low r - Decrease - YES - weak/low - not useful
Middling r - middle - YES - moderate - maybe

Question 21

Q

bivariate data

Answer

A

it may be from two samples, but it is still a univariate variable. The type of data described in the examples above and for any model of cause and effect is bivariate data — “bi” for two variables. In reality, statisticians use multivariate data, meaning many variables

Question 22

Q

For our work we can classify data into three broad categories

Answer

A

time series data, cross-section data, and panel data

Time series data measures a single unit of observation; say a person, or a company or a country, as time passes.
A second type of data set is for cross-section data. Here the variation is not across time for a single unit of observation, but across units of observation during one point in time.
A third type of data set is panel data. Here a panel of units of observation is followed across time. If we take our example from above we might follow 500 people, the unit of observation, through time, ten years, and observe their income, price paid and quantity of the good purchased.

Question 23

Q

The correlation coefficient, ρ (pronounced rho)

Answer

A

is the mathematical statistic for a population that provides us with a measurement of the strength of a linear relationship between the two variables.

Question 24

Q

BUT ALWAYS REMEMBER THAT THE CORRELATION COEFFICIENT

Answer

A

DOES NOT TELL US THE SLOPE.

Question 25

Q

the sign of the correlation coefficient tells us if the relationship

Answer

A

is a positive or negative (inverse) one. If all the values of X1 and X2 are on a straight line the correlation coefficient will be either 1 or -1 depending on whether the line has a positive or negative slope and the closer to one or negative one the stronger the relationship between the two variables.

In panel (d) the variables obviously have some type of very specific relationship to each other, but the correlation coefficient is zero, indicating no linear relationship exists.

Question 26

Q

What the VALUE of r tells us:

Answer

A

The value of r is always between –1 and +1: –1 ≤ r ≤ 1.
The size of the correlation r indicates the strength of the linear relationship between X1 and X2. Values of r close to –1 or to +1 indicate a stronger linear relationship between X1 and X2.
If r = 0 there is absolutely no linear relationship between X1 and X2 (no linear correlation).
If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line: ANY straight line no matter what the slope. Of course, in the real world, this will not generally happen.

Question 27

Q

What the SIGN of r tells us

Answer

A

A positive value of r means that when X1 increases, X2 tends to increase and when X1 decreases, X2 tends to decrease (positive correlation).
A negative value of r means that when X1 increases, X2 tends to decrease and when X1 decreases, X2 tends to increase (negative correlation).

Question 28

Q

The sample correlation coefficient, r, is our estimate of the unknown population correlation coefficient.

Answer

A

ρ = population correlation coefficient (unknown)
r = sample correlation coefficient (known; calculated from sample data)

Question 29

Q

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is “close to zero” or “significantly different from zero”. We decide this based on the sample correlation coefficient r and the sample size n.

Answer

A

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is “significant.”

Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between X1 and X2 because the correlation coefficient is significantly different from zero.

What the conclusion means: There is a significant linear relationship X1 and X2. If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is “not significant”.

Question 30

Q

null and alternate hypothesis

Answer

A

Null Hypothesis H0: The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between X1 and X2 in the population.

Alternate Hypothesis Ha: The population correlation coefficient is significantly different from zero. There is a significant linear relationship (correlation) between X1 and X2 in the population.