Chapter 9 and 11.4 Flashcards

1
Q

What is correlation?

A

The relationship between two
quantitative variables. This relationship can be tested for its significance as well
x= independent
y=dependent variable (predicted)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Pearson correlation
What is it?
What are the requirements?

Sample vs pop

A

The Pearson correlation coefficient, r, is a quantitative measure of the strength and
the direction of a linear relationship between two variables
Requirements:
- linear
- variables normaly distributed
- quantiative
-goes from -1 to 1
1= perfect correlation
stronger closer to one
<.7=strong
r=sample
p=pop

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Testing significance of Pearson correlation

A
  • use null hypotheses of no relationship
  • from there make an alternate hypothesis that assumes there is
  • two-tailed test
  • Use appendix b table 11
  • you are going to be using the absolute value of r
  • If obtained critical value of r is greater than the critical value
    then you have enough evidence to say that it’s significantly different from 0
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Can you convert your r to a t score?

A

yes!
t= r/o
= t= r/squ (1-r^2/n-2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Why does correlation not always= causation?

A
  1. There may be an x  y cause-and-effect relationship between the variables? (e.g., x causes y)
  2. There may be an y  x cause-and-effect relationship between the variables? (e.g., y causes x)
  3. The relationship between the variables may be caused by a third variable or by a combination
    of several other variables.
    * Variables that influence the variables being studied but are not included in the study are
    called lurking variables.
  4. There may be a coincidence that produces the relationship between two variable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Spearman:
When to use

A
  • can have ordinal variables
  • A measure of the strength of the relationship between
    two variables, measured using the ranks of paired
    sample data entries
    Must beat critical value to reject null and suport claim

Formula:
rs=1- (6d^2/n(n^2-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Spearman steps

A
  1. rank values from smallest (1) to largest
  2. If there’s a tie… devied ranked # by two
    ex. value 6 and value 7 are the same rank it as 6.5
    - subtract diff between ranks
    2 differences
    sum them
    - greater than critical value= there is a relationship
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Regression line

A

straight line of best fit with the slope defining how the change in one variable impacts a change in the other.
residual= diff between observed value and predicted value
you can sum and square the residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

total variation=

A

total variation = Explained variation + Unexplained variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Identify the false statement:

If r = 0.8541 from one set of data and r = -0.841 from another set of data, we can say that the strengths of the linear relationships are equal.

If a and y were interchanged, the value of r would remain the same.

If r = 0, this indicates that there is no relationship between x and y.

A

If r = 0, this indicates that there is no linear relationship between x and y. It’s possible that there is some other nonlinear relationship, but r only measures the strength of linear relationships.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A sample has correlation coefficient r = 0.691.

What does this say about x and y?

A

A positive correlation means that the variables both increase together and decrease together.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

When there is a significant linear correlation, this could be expressed three different ways:

A

ρ≠0
means there is a significant linear correlation.

ρ>0
means there is a significant negative linear correlation.

ρ<0
means there is a significant negative positive correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

When there is no significant linear correlation, that could be expressed three different ways. Notice that these are the counterparts of the three statements above:

A

ρ=0
means there is no significant linear correlation.

ρ≤0
means there is no significant negative linear correlation.

ρ≥0
means there is no significant negative positive correlation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

A sample is collected to determine if there is a relationship between the average weekly number of hours of exercise a person did in the past year and the number of times a person was sick in the past year. The sample correlation coefficient is r = -0.612.

What percent of the variation in y, the number of times a person is sick in a year, is explained by its relationship to x, the average number of hours of weekly exercise?

A

The percent variation in y that is explained by its relationship to x is called the coefficient of determination, whose value is r2, where r is the corelation coefficient.

Since r2 = (-0.612)2 = 0.374544, we can say that 37.5% of the variation in the number of times a person is sick in one year is explained by the average number of hours of weekly exercise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

An economist performed a statistical study to see if there is a relationship between a person’s height and their income. In the article he published, he stated that each 1-inch increase meant about $1500 in extra income, on average.
Which statement below is true about the regression equation that links y = annual earnings in thousands of dollars) to x = height (in inches)?

A

The slope of the line is the amount that y changes when x changes by one unit.

In this case, the income increases by $1500 for every year that x increases.

Since y is measured in thousands of dollars, this means that y increases by $1.5 thousand for every 1-year increase on x.

Thus, the slope is 1.5.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

T or F
The greater the value of r, the stronger linear relationship.

A

F
the strength of the linear relationship is determined by the magnitude of r, not just its value

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

If you are given the correlation coefficient how do you figure out what % of Y is explained by x?

A

r^2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

The coefficient of determination is the ratio of:

A

The coefficient of determination is the proportion of variation that is explained by the relationship that y has to x.

The total variation is ∑(yi−y¯)2
The explained variation is ∑(y^i−y¯)2

The unexplained variation is ∑(yi−y^i)2

It can be shown mathematically that total variation = explained variation + unexplained variation.

Dividing both sides by total variation, we have:
1=Explained Variation/Total Variation+Unexplained Variation/Total Variation

Then, r2=Explained Variation/Total Variation
.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Which of the following statements describes the range of possible values for r, the correlation coefficient?

A

−1≤r≤1

20
Q

An observer visited Old Faithful, a geyser in Yellowstone National Park. The duration (x) of each eruption and the time (y) until the next eruption were recorded for 25 eruptions. After analyzing the data, the observer’s friend informed her that an eruption lasted 2.4 minutes.

The 95% prediction interval is (57.48, 69.79).

Which statement is false:

The estimated time between eruptions from the regression line is 63.635 minutes.

The margin of error is 6.155 minutes.

When the eruption lasts 2.4 minutes, there is a 95% chance that it takes between 57.48 and 69.79 minutes until the next eruption.

When the eruption lasts for 2.4 minutes, we are 95% confident that it takes between 57.48 and 69.79 minutes until the next eruption

A

The interpretation of a 95% confidence interval is that 95% of the time when data are collected, the true value of y when x = 2.4 will be inside the interval.

To find the estimated time between eruptions when x = 2.4, find the average of the lower and upper limits of the confidence interval. This is (57.48+69.79)/2 = 63.635 minutes.

The margin of error is then the difference between 63.635 and one of the limits of the interval: 69.79 - 63.635 = 6.155 minutes.

Therefore, the statement “When the eruption lasts 2.4 minutes, there is a 95% chance that it takes between 57.48 and 69.79 minutes until the next eruption” is false.

21
Q

The price of a car is modeled by the regression equation y^=25,581−1315x1−72x2
where x1 is the age of the car (in years) and x2 is the number of miles that the car has been driven (in thousands). The equation is based on cars that were between 2 and 7 years old and were driven between 11 and 95 thousand miles.

Which statement is false?
If the number of miles is held constant and the car is one year older, the price decreases by $1315.

This model can be used to estimate the price of a car that is 11 years old with 50,000 miles.

If the age is held constant and another car has 1,000 more miles on it, the price decreases by $72.

A

This model can be used to estimate the price of a car that is 11 years old with 50,000 miles

A regression model can only be used to estimate y when x1
and x2
are within the original data range.

Since the data used to produce this regression line were from cars between 2 and 7 years old, this regression equation cannot be used to estimate the value of a car that is 11 years old.

22
Q

Coefficient of determination =
.

A

The ratio of the explained variation to the total variation.

 Denoted by r^2
 This shows how much of the variation
is explained by one or more variables in a regression.
r2 = Explained variation/
Total variation

23
Q

Two variables have a positive linear correlation. Is the slope of the regression line positive or negative?

24
Q

To predict y-values using a regression, what must be true about the correlation coefficient of the variables?

A

MUST BE SIGNIFICANTLY CORRELATED, ABOVE THRESHOLD

25
c) Why is it not appropriate to use a regression line to predict y-values for x-values that are not in (or close to) the range of x-values found in the dataset?
REGRESSION LINE IS ESTIMATED BASED ON X-VALUES IN RANGE ONLY, X VALUE OUTSIDE RANGE MAY BE LESS ACCURATE. ALSO THE REGRESSION LINE IS LARGELY BASED ON VARIANCE OF X IN PREDICTING Y
26
The coefficient of determination r2 is the ratio between which two types of variations?
Explained vs total
27
What does r2 measure?
% VARIANCE EXPLAINED BY THE X VARIABLE(S) IN YOUR REGRESSION MODEL
28
What does 1-r2 measure?
% VARIANCE NOT EXPLAINED BY THE X VARIABLE(S) IN YOUR REGRESSION MODEL. MAY BE DUE TO OTHER FACTORS (ERROR, BIAS, THIRD VARIABLE).
29
This equation was used to predict the stock price at the end of the year: ŷ = -86 + 7.46(x1) – 1.61(x2), where x1 = Total revenue (billions) and x2 = shareholders’ equity (in billions). Use the regression equation to predict the y-values when the independent variables are equal to: (a) x1 = 27.6, x2 = 15.3 (b) x1 = 24.1, x2 = 14.6
a) Predicted y = -86 + 7.46(27.6) – 1.61(15.3) = $95.26. (b) Predicted y = -86 + 7.46(24.1) – 1.61(14.6) = $70.28
30
When should you use Spearman vs. Pearson correlation?
✅ Use Pearson when: -Both variables are numerical with a linear relationship. -Data is normally distributed (no extreme outliers). - Relationship is continuous and parametric (follows normality assumptions). ✅ Use Spearman when: - One or both variables are ordinal (ranked data). -The relationship is nonlinear but monotonic (always increasing or decreasing). -Data has outliers or is not normally distributed. 🔑 Key Difference: - Pearson measures linear relationships. - Spearman measures monotonic relationships (does not assume linearity).
31
What is the range of values for the correlation coefficient
-1 to 1 inclusive
32
33
What is a​ residual? Explain when a residual is​ positive, negative, and zero.
A residual is the difference between the observed​ y-value of a data point and the predicted​ y-value on a regression line for the​ x-coordinate of the data point. A residual is positive when the point is above the​ line, negative when it is below the​ line, and zero when the observed​ y-value equals the predicted​ y-value.
34
Explain how to predict​ y-values using the equation of a regression line.
Substitute a value of x into the equation of a regression line and solve for y^
35
Given a set of data and a corresponding regression​ line, describe all values of x that provide meaningful predictions for y.
Prediction values are meaningful only for​ x-values in​ (or close​ to) the range of the original data.
36
In order to predict​ y-values using the equation of a regression​ line, what must be true about the correlation coefficient of the​ variables?
The correlation coefficient must be significant to get an accurate prediction of​ y-values using the equation of a regression line.
37
How to match the regression equation ^ y with = -1.04 x + 50.3 with the appropriate graph? What am i looking for?
✅ A downward-sloping line. ✅ The line crossing the y-axis at approximately 50.3. ✅ The slope making y decrease by about 1.04 units per 1 unit increase in x
38
Slope ​y-intercept The​ y-value of a data point corresponding to xi The​ y-value for a point on the regression line corresponding to xi mean y
Slope m ​y-intercept b The​ y-value of a data point corresponding to xi y i The​ y-value for a point on the regression line corresponding to xi y^ Mean y y-
39
What are the components of total variation in regression analysis?
Total Variation = Sum of the squared differences between actual y-values and the mean of y. ✅ Explained Variation = Sum of the squared differences between predicted y-values and the mean of y. Represents what the regression model can explain (variation due to x). ✅ Unexplained Variation = Sum of the squared differences between actual y-values and predicted y-values. Represents variation due to other unknown factors (not explained by x). Formula: Unexplained Variation Total Variation=Explained Variation+Unexplained Variation
40
Describe the explained variation about a regression line
The explained variation is the sum of the squares of the differences between the predicted​ y-values and the mean of the​ y-values of the ordered pairs.
41
Describe the unexplained variation about a regression line in words and in symbols.
The explained variation is the sum of the squares of the differences between the predicted​ y-values and the mean of the​ y-values of the ordered pairs.
42
The coefficient of determination rsquared is the ratio of which two types of​ variations? What does rsquared ​measure? What does 1minusrsquared ​measure?
The coefficient of determination is the ratio of the explained variation to the total variation.
43
Coefficient of​ determination
The coefficient of​ determination, rsquared​, is the percent of variation of y that is explained by the relationship between x and y.
44
1-r2
unexplained variation
45
What is the coefficient of determination for two variables that have perfect positive linear correlation or perfect negative linear​ correlation? Interpret your answer.
Two variables that have perfect positive or perfect negative linear correlation have a correlation coefficient of 1 or minus​1, respectively. In either case the coefficient of determination is​ 1, which means​ 100% of the variation in the response variable is explained by the variation in the explanatory variable.
46
What are some advantages of the Spearman rank correlation coefficient over the Pearson correlation​ coefficient?
The Spearman rank correlation coefficient can be used to describe the relationship between linear or nonlinear data.​ Also, it can be used for data at the ordinal level and it is easier to calculate by hand than the Pearson correlation coefficient.