Chapter 9 and 11.4 Flashcards
What is correlation?
The relationship between two
quantitative variables. This relationship can be tested for its significance as well
x= independent
y=dependent variable (predicted)
Pearson correlation
What is it?
What are the requirements?
Sample vs pop
The Pearson correlation coefficient, r, is a quantitative measure of the strength and
the direction of a linear relationship between two variables
Requirements:
- linear
- variables normaly distributed
- quantiative
-goes from -1 to 1
1= perfect correlation
stronger closer to one
<.7=strong
r=sample
p=pop
Testing significance of Pearson correlation
- use null hypotheses of no relationship
- from there make an alternate hypothesis that assumes there is
- two-tailed test
- Use appendix b table 11
- you are going to be using the absolute value of r
- If obtained critical value of r is greater than the critical value
then you have enough evidence to say that it’s significantly different from 0
Can you convert your r to a t score?
yes!
t= r/o
= t= r/squ (1-r^2/n-2)
Why does correlation not always= causation?
- There may be an x y cause-and-effect relationship between the variables? (e.g., x causes y)
- There may be an y x cause-and-effect relationship between the variables? (e.g., y causes x)
- The relationship between the variables may be caused by a third variable or by a combination
of several other variables.
* Variables that influence the variables being studied but are not included in the study are
called lurking variables. - There may be a coincidence that produces the relationship between two variable.
Spearman:
When to use
- can have ordinal variables
- A measure of the strength of the relationship between
two variables, measured using the ranks of paired
sample data entries
Must beat critical value to reject null and suport claim
Formula:
rs=1- (6d^2/n(n^2-1)
Spearman steps
- rank values from smallest (1) to largest
- If there’s a tie… devied ranked # by two
ex. value 6 and value 7 are the same rank it as 6.5
- subtract diff between ranks
2 differences
sum them
- greater than critical value= there is a relationship
Regression line
straight line of best fit with the slope defining how the change in one variable impacts a change in the other.
residual= diff between observed value and predicted value
you can sum and square the residuals
total variation=
total variation = Explained variation + Unexplained variation
Identify the false statement:
If r = 0.8541 from one set of data and r = -0.841 from another set of data, we can say that the strengths of the linear relationships are equal.
If a and y were interchanged, the value of r would remain the same.
If r = 0, this indicates that there is no relationship between x and y.
If r = 0, this indicates that there is no linear relationship between x and y. It’s possible that there is some other nonlinear relationship, but r only measures the strength of linear relationships.
A sample has correlation coefficient r = 0.691.
What does this say about x and y?
A positive correlation means that the variables both increase together and decrease together.
When there is a significant linear correlation, this could be expressed three different ways:
ρ≠0
means there is a significant linear correlation.
ρ>0
means there is a significant negative linear correlation.
ρ<0
means there is a significant negative positive correlation.
When there is no significant linear correlation, that could be expressed three different ways. Notice that these are the counterparts of the three statements above:
ρ=0
means there is no significant linear correlation.
ρ≤0
means there is no significant negative linear correlation.
ρ≥0
means there is no significant negative positive correlation.
A sample is collected to determine if there is a relationship between the average weekly number of hours of exercise a person did in the past year and the number of times a person was sick in the past year. The sample correlation coefficient is r = -0.612.
What percent of the variation in y, the number of times a person is sick in a year, is explained by its relationship to x, the average number of hours of weekly exercise?
The percent variation in y that is explained by its relationship to x is called the coefficient of determination, whose value is r2, where r is the corelation coefficient.
Since r2 = (-0.612)2 = 0.374544, we can say that 37.5% of the variation in the number of times a person is sick in one year is explained by the average number of hours of weekly exercise.
An economist performed a statistical study to see if there is a relationship between a person’s height and their income. In the article he published, he stated that each 1-inch increase meant about $1500 in extra income, on average.
Which statement below is true about the regression equation that links y = annual earnings in thousands of dollars) to x = height (in inches)?
The slope of the line is the amount that y changes when x changes by one unit.
In this case, the income increases by $1500 for every year that x increases.
Since y is measured in thousands of dollars, this means that y increases by $1.5 thousand for every 1-year increase on x.
Thus, the slope is 1.5.
T or F
The greater the value of r, the stronger linear relationship.
F
the strength of the linear relationship is determined by the magnitude of r, not just its value
If you are given the correlation coefficient how do you figure out what % of Y is explained by x?
r^2
The coefficient of determination is the ratio of:
The coefficient of determination is the proportion of variation that is explained by the relationship that y has to x.
The total variation is ∑(yi−y¯)2
The explained variation is ∑(y^i−y¯)2
The unexplained variation is ∑(yi−y^i)2
It can be shown mathematically that total variation = explained variation + unexplained variation.
Dividing both sides by total variation, we have:
1=Explained Variation/Total Variation+Unexplained Variation/Total Variation
Then, r2=Explained Variation/Total Variation
.
Which of the following statements describes the range of possible values for r, the correlation coefficient?
−1≤r≤1
An observer visited Old Faithful, a geyser in Yellowstone National Park. The duration (x) of each eruption and the time (y) until the next eruption were recorded for 25 eruptions. After analyzing the data, the observer’s friend informed her that an eruption lasted 2.4 minutes.
The 95% prediction interval is (57.48, 69.79).
Which statement is false:
The estimated time between eruptions from the regression line is 63.635 minutes.
The margin of error is 6.155 minutes.
When the eruption lasts 2.4 minutes, there is a 95% chance that it takes between 57.48 and 69.79 minutes until the next eruption.
When the eruption lasts for 2.4 minutes, we are 95% confident that it takes between 57.48 and 69.79 minutes until the next eruption
The interpretation of a 95% confidence interval is that 95% of the time when data are collected, the true value of y when x = 2.4 will be inside the interval.
To find the estimated time between eruptions when x = 2.4, find the average of the lower and upper limits of the confidence interval. This is (57.48+69.79)/2 = 63.635 minutes.
The margin of error is then the difference between 63.635 and one of the limits of the interval: 69.79 - 63.635 = 6.155 minutes.
Therefore, the statement “When the eruption lasts 2.4 minutes, there is a 95% chance that it takes between 57.48 and 69.79 minutes until the next eruption” is false.
The price of a car is modeled by the regression equation y^=25,581−1315x1−72x2
where x1 is the age of the car (in years) and x2 is the number of miles that the car has been driven (in thousands). The equation is based on cars that were between 2 and 7 years old and were driven between 11 and 95 thousand miles.
Which statement is false?
If the number of miles is held constant and the car is one year older, the price decreases by $1315.
This model can be used to estimate the price of a car that is 11 years old with 50,000 miles.
If the age is held constant and another car has 1,000 more miles on it, the price decreases by $72.
This model can be used to estimate the price of a car that is 11 years old with 50,000 miles
A regression model can only be used to estimate y when x1
and x2
are within the original data range.
Since the data used to produce this regression line were from cars between 2 and 7 years old, this regression equation cannot be used to estimate the value of a car that is 11 years old.
Coefficient of determination =
.
The ratio of the explained variation to the total variation.
Denoted by r^2
This shows how much of the variation
is explained by one or more variables in a regression.
r2 = Explained variation/
Total variation
Two variables have a positive linear correlation. Is the slope of the regression line positive or negative?
positive
To predict y-values using a regression, what must be true about the correlation coefficient of the variables?
MUST BE SIGNIFICANTLY CORRELATED, ABOVE THRESHOLD