Reading Quiz 15 Flashcards
least-squares regression
fits a straight line to data in order to predict a response variable y from the explanatory variable x
inference about regression conditions
- the observations must be independent
- the true relationship is linear
- the standard deviation of the response about the true line is the same everywhere
- the response varies normally about the true regression line
the observations must be independent condition
in particular, repeated observations on the same individual are not allowed
the true relationship is linear condition
we can’t observe the true regression line, so we will almost never see a perfect straight-line relationship in our data
look at the scatter plot to check that the overall pattern is roughly linear
a residual plot against x magnifies any unusual pattern
the standard deviation of the response about the true line is the same everywhere condition
look at the residual plot
the scatter of the data points (the vertical distance) about the y=0 line should be roughly the same over the entire range of the data
the response varies normally about the true regression line condition
the residuals estimate the deviations of the response from the true regression line, so they should follow a normal distribution
make a boxplot, histogram, or stemplot of the residuals and check for clear skewness or other major departures from normality
slight departures from normality do not greatly affect inference for regression, so they are allowed, particularly when we have many observations
regression model
says that there is a true regression line μy = α + βx that describes how the mean response μy varies as x changes
true regression line
μy = α + βx
describes how the mean response μy varies as x changes
the observed response y for any fixed x has a normal
σ for any value of x
the parameters of the regression model are
the intercept α estimated by a, the slope β estimated by b, and the standard deviation σ estimated by s
the true slope β says how much
change in y when x increases by 1
the standard deviation σ describes
how much variation there is in responses y when x is fixed
to estimate σ
use the standard error about the line, s
s
regression standard error
s= sqrt((Σresiduals^2)/(n-2)) = sqrt((Σ(y-yhat)^2)/(n-2))
sample standard deviation of the residuals
spread of data (measure of variability) around the least squares regression line
“typical” amount of prediction error when using a linear regression model to make predictions
calculator compute s
enter data into L1 and L2
STATS, TESTS, LinRegTTest
regression standard error has how many degrees of freedom
n - 2
all t procedures in regression inference have n-2 degrees of freedom
inference for regression goal
predict behavior of y for given values of x
inference for regression cont
there is an “on average” straight line relationship between y and x
saying μy moves along a straight line as explanatory variable x changes
inference can be done in two ways:
confidence intervals for the slope and a significance test to test the hypothesis that the true slope is 0
confidence intervals for the slope of the true regression line have the general form
b ± t*SEb.
in practice, we use software to find
the slope b of the least-squares line and its standard error SEb
formula for confidence interval for slope of true regression line
b ± t*SEb where SEb = s/(sqrt Σ(x-xbar)^2)
some calculators have the program to calculate the confidence interval for the slope
stat, tests, linregtint
if you don’t have the program you can run a linregttest to obtain the b and t values, then calculate the SEb value knowing that t = (b - β)/SEb
then substitute that into the b ± tSEb form with the t value from the t distribution chart using df = n - 2
t =
(b - β)/SEb
to test the hypothesis that the true slope is zero, use the
t statistic t = (b - β)/SEb also given by software t statistic is the standardized slope of the LSRL stat, tests, linregttest (L1 and L2)
regression output from statistical software usually gives
t and its two sided p-value
for a one-sided test, divide the p-value by 2
the most common hypothesis is
Ho: β = 0
this says there is no true linear relationship between x and y
it also says that straight-line dependence on x has no value for predicting y
it also says that the population correlation between x and y is zero
To review: we use least-squares regression to study the relation between a couple of variables, both of which are (quantitative, categorical).
quantitative
Before doing regressions to study the relationship between two quantitative variables, we should explore the data by examining a _______ and a __________.
scatterplot, residual plot
The statistic that describes the strength of a linear relationship, that is the same whichever variable is thought of as the explanatory variable, and which has a familiar relationship to the percent of variance in one variable explained by the other, is the ______ ______.
A. correlation coefficient (or just, the correlation)
What is a residual?
A. A residual is the vertical distance between the data point and the regression line, or y - y-hat.
The r-squared value, which is part of the regression output, tells us how much of what is what?
A. How much of the variation in the y variable is accounted for by the linear relationship with x.
Suppose we draw lots of samples and compute a regression line for each sample. The slope and intercept of each sample line estimates a true value. Thus the slope and intercept we obtain from our sample are _____ that estimate population ______.
statistics; parameters
One of the conditions for regression inference is that for any fixed value of x, the response variable y varies according to a _____ distribution.
normal
Another assumption for regression inference is that for any fixed value of x, the repeated responses y are ____ of each other.
independent
Another assumption for regression inference is that the means of the sets of y-values for each x value have what relationship to the x values?
A. That the means of the y’s for each x are a linear function of x: mean for y’s = alpha + beta * x
Another assumption for regression inference is that what measure of dispersion is equal for each value of x?
A. The standard deviation of the y’s for the various x values.
True or False: the slope and intercept we obtain from the least squares regression for our sample are unbiased estimators, respectively, of the line connecting the population means for each of the x’s.
true
What is the unbiased estimator for the standard deviation of the y values around the regression line (in other words, the standard deviation of the y values around the means of each of those values for each x)?
A. The statistic called s, which is the standard error, or the standard deviation of the residuals. .
The statistic s represents the estimate of the standard deviation ____ in the regression model.
A. (sigma)
The parameter we are usually most interested in estimating from regression output is the (slope, y-
intercept) of the line.
slope
What is the general form for a confidence interval for regression slope?
A. b plus or minus t*SEb
The most commonly tested hypothesis about regressions is that Beta, the “Population slope,” is 0. Can you put this hypothesis in some other phrasings?
A. That the straight line dependence on x is of no value in predicting y. Or that the population correlation between x and y is 0. Or that there is no true linear relationship between x and y in the population.
If you form the ratio of the slope obtained in your sample to the standard error of that slope, what is the sampling distribution of that statistic?
A. It’s distributed according to the t distribution, with n-2 degrees of freedom.
Regression output usually gives a two-sided p value for the hypothesis test that the population slope is 0. How do you obtain a one-sided p-value for the same hypothesis?
A. Divide the two-sided p-value by two.
Suppose that in a residual plot, the values are close to 0 when x is low, but the residuals get bigger and bigger in absolute value as the x values get greater. What condition of regression is violated in this circumstance?
A. The condition that the standard deviation of the response around the true line is the same everywhere.
Someone examines a residual plot and a scatterplot and observes a curvilinear pattern. What condition of regression is being violated, and what should the researcher consider doing in order to correct this?
A. The condition violated is that the true relationship is linear. The researcher should consider transforming one or more of the variables.