unit 3 - ch 13 - correlation Flashcards

1
Q

It’s all about relationships:

A

x - y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Correlation coefficient: terms

A

X-variable: independent variable, predictor variable
Y-variable: dependent variable, criterion variable, variable of interest

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

ANOVA:

A

2 variables ~ 1 nominal (factor), 1 at least interval (criterion variable)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Correlation:

A

2 variables ~ both variables at least interval

Notation:
Sample = r
population : p (rho)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

r =

Correlation coefficient formula

A

sample correlation coefficent

X variable - x - x bar
Y variable - y - y bar

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

As the shared variation between the x variable and y variable increases, r approaches its upper or lower limit, respectively +1 and -1

A

+1 = perfect positive relationship
-1 = perfect negative relationship
0 = absolutely no relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

R is a measure of both

A

The strength of the relationship and
The direction of the X - Y relationship

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

r has no unit of measurement

A

= unitless
r is not affected by the scale of the data
r values can be compared to each other

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

example of x and y

A

X variable: gas prices
Y variable: miles drive

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

correlation coefficient examples

A

EX 1
Income y axis
Height x axis
Positive correlation
Taller people make more money on average ):

EX 2
Customer satisfaction y axis
Difficulty in product setup x axis
Negative
Ikea

EX 3
Gpa y axis
Hat x axis
No correlation

EX 4
Control y axis
Speed x axis
Negative
Not all relationships are linear
Exponential, linear etc

EX 5
Performance y axis
Emotional involvement (stress) x axis
Curve (upside down U)
Two ends that are low
High end/peak

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Step 6 of HT
EX: Are married men living longer or dying slower? Why?

A

EX 1
Data:
Alcohol content and calories for 10 beers
Calculating
X = alcohol content
Y = calories
r = 0.957
Testing

Step 1
4 facts of the null = everything is unrelated
Ho p = 0.00
H1 p =/= 0.00

Step 2
a = 0.01

Step 3
TS = observed - expected / chance
TS = r - p/ standard error of the correlation coefficient
TS = 9.97
P = 0.00 (from Ho)

Step 4
df = n -2
df = 9
CV = +/- 2.62

Step 5
9.79 > 2.62 = reject
TS > CV = reject

Step 6
As the alcohol content increases in beer, the calories also increase. That is not to say alcohol causes calories but both are the result of the beer making process. The conversion of sugar into alcohol during fermentation results in alcohol and calories. It is not a perfect correlation as carbohydrates within beer also contains calories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Correlation vs. causation

A

r = 0.957

r increases =/= causation
High r does not mean x is causing y

X variable: length of our left arm
Y variable: length of our right arm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

cautionary tales

A
  1. sample size
  2. relationship change
  3. correlation is not causation
  4. not all relationships are linear
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

cautionary notes: sample size

A

at least 10 data points for the x-variable (s) and 10 for the y-variable
Multiple x but only only variable

EXAMPLE
X = age of car
X = odometer miles
Y = selling price
10 points per x and 10 per Y = 30 data points

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

cautionary notes: relationships change

A

Over time

Outside the range of data
Don’t want to use relationship found in younger people sample on older people sample

Across space
Geographical
Drop model in a new space but sometimes it doesn’t hold up (american customers vs spanish customers)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

cautionary notes: correlation is not causation

A

EX
Cigarettes and cancer = correlation and causation
Vitamins and better health = correlation
Suntan lotion and coral reef bleaching = correlation and causation
Gran turismo sales and subaru impreza sales = correlation and causation

Correlation → causation -{ liability, opportunity, beneficiary

17
Q

cautionary notes: not all relationships are linear

Curve relationship (linear, exponential etc.)

EX 1
Academic performance
Listening to classical music
No relationship

EX 2
X = Cars per 1000 people
Y = Overall average BMI
r = 0.63

A

Step 1
4 facets of the null = everything is unrelated
Ho p = 0.00
H1 p =/= 0.000

Step 2
a = 0.10

Step 3
TS = 2.298

Step 4
DF = n -2
DF = 8
CV = 1.86

Step 5
2.298 > 1.86 = Reject
TS > CV = Reject

Step 6
Rather than car use leading to overweight perhaps people who are overweight are more likely to use cars (Y leads to X)

18
Q

Getting to significance

A

r = square root of t squared divided by (t squared + (n-2))

Plug CV into t

EX
Plug into 2.298 > 1.86 = Reject
TS > CV = Reject
r = 0.55

Sample size affects…
Significance, strength and practicality

19
Q

getting to significance

r = square root of t squared divided by (t squared + (n-2))

A

Higher sample size lower r significance

Significance → statistical question
Strength → labeling (talk/write)
Practicality → business judgment

20
Q

figuring out significance based on r

  1. high r - significance - strength - practicality
  2. low r - significance - strength - practicality
  3. low r - significance - strength - practicality
  4. middling r - significance - strength - practicality
A
  1. High r - Increase - YES - strong/high - useful
  2. Low r - Decrease - NO - weak/low - not useful
  3. Low r - Decrease - YES - weak/low - not useful
  4. Middling r - middle - YES - moderate - maybe
21
Q

bivariate data

A

it may be from two samples, but it is still a univariate variable. The type of data described in the examples above and for any model of cause and effect is bivariate data — “bi” for two variables. In reality, statisticians use multivariate data, meaning many variables

21
Q

For our work we can classify data into three broad categories

A

time series data, cross-section data, and panel data

Time series data measures a single unit of observation; say a person, or a company or a country, as time passes.
A second type of data set is for cross-section data. Here the variation is not across time for a single unit of observation, but across units of observation during one point in time.
A third type of data set is panel data. Here a panel of units of observation is followed across time. If we take our example from above we might follow 500 people, the unit of observation, through time, ten years, and observe their income, price paid and quantity of the good purchased.

22
Q

The correlation coefficient, ρ (pronounced rho)

A

is the mathematical statistic for a population that provides us with a measurement of the strength of a linear relationship between the two variables.

23
Q

BUT ALWAYS REMEMBER THAT THE CORRELATION COEFFICIENT

A

DOES NOT TELL US THE SLOPE.

24
Q

the sign of the correlation coefficient tells us if the relationship

A

is a positive or negative (inverse) one. If all the values of X1 and X2 are on a straight line the correlation coefficient will be either 1 or -1 depending on whether the line has a positive or negative slope and the closer to one or negative one the stronger the relationship between the two variables.

In panel (d) the variables obviously have some type of very specific relationship to each other, but the correlation coefficient is zero, indicating no linear relationship exists.

25
Q

What the VALUE of r tells us:

A

The value of r is always between –1 and +1: –1 ≤ r ≤ 1.
The size of the correlation r indicates the strength of the linear relationship between X1 and X2. Values of r close to –1 or to +1 indicate a stronger linear relationship between X1 and X2.
If r = 0 there is absolutely no linear relationship between X1 and X2 (no linear correlation).
If r = 1, there is perfect positive correlation. If r = –1, there is perfect negative correlation. In both these cases, all of the original data points lie on a straight line: ANY straight line no matter what the slope. Of course, in the real world, this will not generally happen.

26
Q

What the SIGN of r tells us

A

A positive value of r means that when X1 increases, X2 tends to increase and when X1 decreases, X2 tends to decrease (positive correlation).
A negative value of r means that when X1 increases, X2 tends to decrease and when X1 decreases, X2 tends to increase (negative correlation).

27
Q

The sample correlation coefficient, r, is our estimate of the unknown population correlation coefficient.

A

ρ = population correlation coefficient (unknown)
r = sample correlation coefficient (known; calculated from sample data)

28
Q

The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is “close to zero” or “significantly different from zero”. We decide this based on the sample correlation coefficient r and the sample size n.

A

If the test concludes that the correlation coefficient is significantly different from zero, we say that the correlation coefficient is “significant.”

Conclusion: There is sufficient evidence to conclude that there is a significant linear relationship between X1 and X2 because the correlation coefficient is significantly different from zero.

What the conclusion means: There is a significant linear relationship X1 and X2. If the test concludes that the correlation coefficient is not significantly different from zero (it is close to zero), we say that correlation coefficient is “not significant”.

29
Q

null and alternate hypothesis

A

Null Hypothesis H0: The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship (correlation) between X1 and X2 in the population.

Alternate Hypothesis Ha: The population correlation coefficient is significantly different from zero. There is a significant linear relationship (correlation) between X1 and X2 in the population.