Stats Exam #2 Flashcards
When interpreting a graph, what is the correct order: variable () versus variable ()
Y versus X
There are four things to look for in scatterplots. What are they?
Direction ( + or -)
Form (linear/non)
Strength (strong/weak)
Unusual Features (outliers, groups, clusters)
What are the three Correlation Conditions?
Quantitative Variables (2 quants)
Straight Enough
No outliers
What is the Linear Model equation? Interpret its components.
Y hat = B0 + B1X
Y- hat: predicted value
B0: y-intercept
B1X: slope
The Linear Model is the model that “______” the data.
best fits
True or False: The line of “best fit” has the LEAST error.
True
What is the Residual equation, and what does it do?
Residual = observed value - predicted value
e = y - y hat
This explains the errors in the model.
If the residual model fits well, residuals will all be close to what number?
0
Ex: A popular food item is said to have 31 grams of protein and 36.6 grams of fat. It actually has 22 grams of fat. If X = grams of protein, Y = grams of fat, and the line of fit for the module is Fat hat = 8.4 + 0.91(protein), calculate the residual for this observation & interpret.
y hat = 8.4 + 0.91(31) = 36.6
e = 22 - 36 = (-14.6)
Actual data is 14.6 grams of fat less than what the model predicts.
True or False: Line with the MOST residual value is the linear model.
False
The std. dev. of the model is the distance from y-bar. When finding std. dev. what is done to the residual values?
They are squared to make all values positive. Best fitting will have the least amount of squared residuals.
Interpret b0 and b1.
b0 = y-intercept and is where the line crosses the y-axis.
b1 = slope of the line that explains how rapidly y hat changes as a result of x.
Interpret this linear module’s slope. Fat hat = 8.4 + 0.91(protein)
b1 = 0.91(protein)
For every additional gram of protein, expect there to be an additional 0.91 grams of fat, on average.
Interpret this linear module’s y-intercept. Fat hat = 8.4 + 0.91(protein)
b0 = 8.4 grams of fat
An item with 0 grams of protein would expect to have 8.4 grams of fat.
True or False: the X variable can be practical or non-practical.
True
Ex:
y = avg. home game attendance per year.
x = number of wins per year
Is X practical?
No, because no professional team has ever lost every home game.
Ex:
y = total # of hours on the internet per month
x = # of Facebook friends
Is X practical?
Yes, because one can be on the internet and not use Facebook.
How do you find b1? Will slope direction/sign match correlation coefficient sign?
Correlation times standard deviation of (y var/x var).
b1 = r(Sy/Sx)
Yes, the signs will match. A negative r makes for a negative slope and vice versa.
How do you find b0?
b0 = y bar (avg. of y) minus b1 (slope) times x bar (avg. of x)
b0 = y bar - (b1 times x bar)
What are the four conditions for regression?
Quantitative Variables (2 quants)
Straight Enough
No Outliers
Does the Plot Thicken?
The value of r is/is not affected by variable placements?
It is! The Explanatory/predictor variable = x
Response variable = y
Ex: Since 1980, yearly average mortgage interest rates have fluctuated from a low of under 6% to a high of over 14%.
r = -0.8400 Sig. of prob = 0.0001
Mortgages = 220.893 - 7.775(interest)
Is there a relationship between the amount of money people borrow and the interest rate that is offered? What would you expect the relationship to look like?
(assume they pass the straight enough and no outliers conditions)
These variables pass the quantitative condition. X = interest Y = mortgage
The correlation shows that this module is negative, strong, and linear. The sig. prob shows that this data is statistically significant as it is less than 0.05.
Ex Continued: Since 1980, yearly average mortgage interest rates have fluctuated from a low of under 6% to a high of over 14%.
r = -0.8400 Sig. of prob = 0.0001
Mortgages = 220.893 - 7.775(interest)
Interpret the data.
b1 = -7.775
On avg., for every additional increase in the interest rate, expect to see the mortgage decrease by 7.775 billion.
b0 = 220.893 billion.
When the i-rate is 0, expect to have a mortgage of 220.893 billion. There is no practical interpretation of x bc the I-rate has never hit 0.
Ex: y hat = 220.893 - 7.775x
Observation #21 is y = 168.2 ($billions) and x = 7.9 (%)
Calculate the predicted value associated with this observation and interpret. Calculate the residual for this observation and interpret.
y hat = 220.893 - 7.775(7.9) = 159.47
When the I-rate is 7.9%, one can expect the avg. mortgage to be 159.47.
e = 168.2 - 159.47 = 8.73
The model underpredicted. The actual mortgage was $8.73 billion more on average than expected.
True or False: Residuals have not been modeled by the regression equation.
True.
e = y - y hat OR residual = data - module
When observing a scatter plot of the residuals, do we want to see a pattern?
No.
Regression has 4 conditions. What are they?
If data passes Quantitative, straight enough, and no outliers conditions, make the module.
The Does the plot thicken condition concerns residual plots and whether or not they change/have patterns.
If the regression line fit all the data perfectly, the standard deviation of residuals (Se) should be what number?
0
Squared correlation (r)^2 = R^2.
R^2 gives the proportion of the data’s variance that is _______ for by the model.
accounted
For the popular food ex:
Correlation chart shows
Fat(g) Protein(g)
Fat(g) 1.00 0.76
Protein(g) 0.76 1.00
Find the R^2 and interpret
R^2 = (0.76)^2 = 0.58
This can be found in the Summary of Fit Chart.
It means that 0.58 or 58% of the data is accounted for by the model.
R^2 = 0 : No variance in data is in the model. All in residuals
R^2 = 1.0 : All of the variance in the data is captured by the model.
In the popular food ex, 58% of the variation in total fat (y) is associated with/explained by the variation in protein content (x).
no answer
Ex: Verify that the square of the correlation coefficient (r) is RSquare. Interpret its value in the context of these data.
Correlations
Mortgages Interest Rate
Mortgages 1.0000 -0.8400
Interest Rate -0.8400 1.0000
Summary of Fit:
RSquare: 0.705573
RSquare Adj: 0.693305
Root Mean of Square Error: 13.21113
Mean of Response: 151.8731
Observations (or Sum Wgts): 26
r = square root of RSquare
r = square root of 0.705573 = 0.84
70.56% of the variation of mortgages is explained by the variation of interest rate.
Ex: A graph shows that Max wind speed = 1031.24 - 0.975(central pressure).
To work backwards with RSquare to find the correlation you take the square root of what? To find the direction of correlation (r) what do you look for?
RSquare
Ex:
RSquare = 0.81
r^2 = square root of 0.81 = 0.9
Because the slope is negative and you use correlation to find slope, assume correlation is also negative. r = (-0.9)
True or False: the value of RSquare should be withheld from the audience.
False. It should be reported.
What is it called when you use a regression equation outside of the context by plugging in a value of x that is outside of the range of values for the data?
An extrapolation. This can be very dangerous and inaccurate information.
True or False: While an r value must be between -1 and 1, the slope of a regression line can by any value.
True! When b1 has a slope other than 0.0, it indicates some linear association between x & y.
Always initially assume regression is by random chance. What can you use to finalize this answer?
P-value & 0.05 threshold.
If an audience has two spinners that are different, should there be any statistically significant association?
No! It should be by random chance.
Threshold says that if the P-value is “greater than 0.05” the association _________. If the P-value is “less than 0.05,” the association _______.
is by random chance; statistically significant
Ex: Data for a linear fit of y = 6.2276 - 0.317x shows a probability of 0.1284. Is this by random chance or no?
Because this is greater than 0.05, this data is by random chance.
True or False: Residual plots can sometimes expose more subtle curved relationships in data than the original scatterplot.
True
Ex: What is the relationship between Run Time (minutes) and Budget (millions)? Which variable should be x and which should be y? Assume all four conditions are met.
b1 = 0.7144001
b0 = -31.38695
RSquare = 0.154156
X = run time; y = budget
Regression Analysis: Budget hat = -31.38695 + 0.714001(run time)
In the case of this data, 15.4% of the budget variance is explained by run time. 85% of the budget variance is explained by other variables.
Can doing multiple, smaller analysis on the same data set create a stronger line of fit?
Yes.
What are extrapolations?
Predicted values outside of the range of data available. They are questionable assumptions.
True or False: Though dangerous, extrapolation is sometimes necessary.
True. Regression, Extrapolation, and “forecasting” are better options that guessing.
Because the correlation coefficient is used to create the slope, outliers can _______ a regression analysis.
strongly influence
Correlation does NOT mean causation. When there is a high RSquare, do changes in x cause variation in y?
Not necessarily.
Ex: In a given data set of doctors/person and avg. life expectancy, RSquare is 0.629. In the set TVs/person vs. avg. life expectancy, RSquare is 0.725. Does this mean TVs are better for health than doctors?
No! This just means there are lurking variables affecting the data. For example, in places with higher standards of living, many people who have longer life expectancies have more doctors AND tvs.
What do randomized samples attempt to do?
Reduce bias
How do we learn details about the population?
Through the average of the population or a sample statistic.
Population parameter is whatever we’re analyzing about the population. It can be a mean, std. deviation, percentile, percent, etc. How is this found?
Through taking a sample
Sample should be as ________ of the population as possible.
representative
Notation is very important. Name the statistic and parameter denotations.
Stat. Parameter
Mean:
Std. Dev:
Correlation:
Regression Coefficient:
Proportion:
Stat. Parameter
Mean: y bar M (mu)
Std. Dev: s sigma
Correlation: r p (row)
Regression Co: b B (beta)
Proportion: p hat p (pi)
What matters when taking a sample? (two answers)
Size & collection process
The sampling frame is a list or collection of things/individuals from which the sample was ________.
drawn/collected
Sampling frame can be biased if it misses part of the population. Provide an example.
Kroger takes sample from ppl w/ Kroger cards. This excludes customers w/o cards. The sampling frame is strictly customers w/ the loyalty card.
The population is the what of the sample?
entire group of individuals
Sampling frame is the what of the sample?
List of all individuals from which the sample was drawn (who is eligible to be sampled).
The sample design is the what of the sample?
Method used to draw the sample
The sample of data is what?
Those actually chosen