chpt 14 Flashcards
What does simple linear regression use
- one independent variable and one dependent variable
2. uses a straight line to approximate the relationship
What does multiple regression use
- 2 or more independent variables
What are the 2 objectives for simple linear regression
- establish if there is a relationship b/w 2 variables (ie income and spending)
- Forecast new observations (ie. sales over next Qrt)
What is the dependent variable denoted by
y
what is the independent variable denoted by
x
what is the dependent variable
the variable being predicted
what is the independent variable
the variable(s) used to predict the values of the dependent variable
What is the formula for the simple linear regression
Y = B0 + B1X + E
What does Y represent in the simple linear regression model
the dependent variable
what does B0 represent in the simple linear regression model
intercept or constant
what does B1x represent
coefficient of x or slope of x
what does E represent in the simple linear regression model
error term
What does the error term account for in the simple linear regression model
accounts for the variability in y that can’t be explained by the linear relationship b/w x and y
What is the simple linear regression equation
E(y) = B- + B1x
What does E(y) represent
mean or expected value of y for a given value of x
What can we note about B0 adn B1 in the simple linear regression equation
they are known
What is the estimate simple linear regression equation
y hat = b0 + b1x
When do we use the estimate simple linear regression equation
when B0 and B1 are NOT known
What does y hat represent in the estimate simple linear regression equation
point estimate of E(y)
- provides a prediction of an individual value of y for a given value of x
What are B0 and B1
population parameters
what are b0 and b1
sample statistics to estimate B0 and B1
IF we are trying to predict sales for a given level of advertising what is the dependent and independent variable
Dependent variable - sales (y)
Independent variable - advertising expenditures (x)
what does “simple” indicate in simple linear regression
one independent variable and one dependent variable
What does “linear” Indicate in Simple linear regression
the relationship is approximated using a straight line
What is B0 in the simple linear regression model
the y-intercept of the regression line or the value of y when x is 0
What is B1 in the simple linear regression model
the slope of the regression line
- the line tells us two things
1. whether the line is increasing or decreasing
2. how steep it is
What is E in the simple linear regression model
the error term
- as good as our model might be, there is always random error term that cannot be accounted for
if the line slopes upward, what is the relationship
as x increases, so does y - positive relationship
B1 - will be positive
if the line slopes downward, what is the relationship
as x increases, y decreases, negative relationship
B1 - would be negative
what if the line is straight across (the regression line is flat)
no relationship, as x increases, y remains the same
B1 is 0
What are the POPULATION parameters for the y intercept and the slope
B0 and B1
what are the sample statistics used to estimate B0 and B1
b0 and b1
what does y hat represent in the simple linear regression
the predicted value of y for a given x value
what is the estimated simple linear regression equation
y hat = b0 +b1x
What does the Coefficient of Determination tell us about the estimated regression equation
how well does the estimated regression equation fit the data
What does the Coefficient of Determination provides us with
a measure of the goodness of fit
in Coefficient of determination, what is the ith residual
the predicted value of the dependent variable y hat i
for the ith observation, the residual is indicated by what
yi- y hat i
What is the formula for the coefficient of determination
r squared = SSR/SST
what does r squared represent
the coefficient of determination
What does SSR stand for in coefficient of determination
sum of squares due to regression
what does SST stand for in Coefficient of determination
sum of squares for the total deviation
What is the formula for SSR in coefficient of determination
sum (y hat i - y bar) squared
what does the SSR in coefficient of determination measure
the difference b/w the predicted values and the average or
how much the y hat values on teh estimated regression line deviates from y hat
What does SSE in coefficient of determination stand for
sum of squares due to Error
What is the formula in Coefficient of determination for SST
sum (yi - ybar) squared
what is the formula in Coefficient of determination for SSE
sum (yi - y hat i) squared
In Coefficient of determination, how do you calculate SST
SST = SSR + SSE
What should we expect regarding SST, SSR and SSE in the coefficient of determination
we should expect that SST, SSR and SSE related from
What would be a perfect fit in coefficient of determination
SSR = SST
SSR / SST = 1
What would a poor fit be in coefficient of determination
large values for SSE
- poorest fit when SSR = 0 and SSE = SST
What is r squared
percent of variability in y can be explained by x
if r squared = 95.5%, what can we say
95.5% of the variability in grades for instance, can be explained by the number of hours studied
What does the correlation Coefficient measure
it measures the strength of association b/w x and y
What does the correlation Coefficient measure
it measures the strength of association b/w x and y
what is the correlation Coefficient denoted by
r
what are the values of r in correlation Coefficient
between -1 and +1
In Correlation Coefficient, if r = 1, what does this mean
means perfect positive linear relationship b/w x and y
- no deviation
- all the data points from the sample lay exactly on the line of regression with no deviation and the line slopes upward
In Correlation Coefficient, if r = -1, what does this mean
means perfect negative linear relationship b/w x and y
- no deviation
- all data points from the sample lay exactly on the line of regression with no deviation and the line slopes downward
In Correlation Coefficient, if r = 0, what does this mean
no relationship b/w x and y
what is the formula for correlation Coefficient
rxy = (sign of b1)x square root of coefficient of determination
or
rxy = (sing of b1) x square root of rsqaured
in correlation Coefficient, what is b1
slope of the estimate
In correlation coefficient, since the square root of anything doesn’t tell us if the number was negative or postive we have to look at what
the slope and then we use the sign for our slope
example b1 is positive 4.74 then we use positive sign
rxy = +.9505
if rxy is .9749 what does this indicate
a very strong positive linear relationship bw x and y
Testing for Significance if y=B0+B1x +E
if B1 = 0 then Y=
B0 no matter what value x is
- the value of y does not depend on x
(no linear relationship b/w x and y)
What is the null hypothesis and the alternative for testing significance in Simple Linear Regression
Ho= B1 = 0 Ha = B1 does not = 0
What test do we use when testing for signfiicanace in simple linear regression
t test
what is the formula for the test statitistic when testing for significance in simple linear regression
t = b1 / sb1
what does sb1 stand for
standard error for slope
what is the formula for sb1 (the standard error for the slope)
sb1 = s (standard deviation) / square root sum (xi - xbar)squared
what is the formula for s in sb1
s = square root of (SSE/n-2)
Coefficient of determination - Definition
A measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation
Confidence interval - Definition
The interval estimate of the mean value of y for a given value of x.
Correlation coefficient - Definition
A measure of the strength of the linear relationship between two variables
Dependent variable - definition
The variable that is being predicted or explained. It is denoted by y.
Estimated regression equation - Definition
The estimate of the regression equation developed from sample data by using the least squares method. For simple linear regression, the estimated regression equation is yˆ = b0 + b1x.
High leverage points - Definition
Observations with extreme values for the independent variables
Independent variable - Definition
The variable that is doing the predicting or explaining. It is denoted by x.
Influential observation - Definition
An observation that has a strong influence or effect on the regression results.
ith residual - Definition
The difference between the observed value of the dependent variable and the value predicted using the estimated regression equation; for the ith observation the ith residual is yi − yˆi.
Least squares method - Definition
A procedure used to develop the estimated regression equation. The objective is to minimize o( yi − yˆi)2.
Mean square error - Definition
The unbiased estimate of the variance of the error term s2. It is denoted by MSE or s2.
Normal probability plot - Definition
A graph of the standardized residuals plotted against values of the normal scores. This plot helps determine whether the assumption that the error term has a normal probability distribution appears to be valid
Outlier - Definition
A data point or observation that does not fit the trend shown by the remaining data
Prediction interval - Definition
The interval estimate of an individual value of y for a given value of x.
Regression equation - Definition
THe Equation that describes how the mean or expected value of the dependent variable is related to the independent variable; in simple linear regression, E(y) = b0 + b1x.
Regression model - Definition
The equation that describes how y is related to x and an error term; in simple linear regression, the regression model is y = b0 + b1x + e.
Residual Analysis - Definition
The analysis of the residuals used to determine whether the assumptions made about the regression model appear to be valid. Residual analysis is also used to identify outliers and influential observations
Residual Plot - Definition
Graphical representation of the residuals that can be used to determine whether the assumptions made about the regression model appear to be valid
Scatter Diagram - Definition
A graph of bivariate data in which the independent variable is on the horizontal axis and the dependent variable is on the vertical axis
Simple linear regression - Definition
Regression analysis involving one independent variable and one dependent variable in which the relationship between the variables is approximated by a straight line.
Standard error of the estimate - Definition
The square root of the mean square error, denoted by s. It is the estimate of s, the standard deviation of the error term e
Standardized residual - Definition
The value obtained by dividing a residual by its standard deviation
Regression and Correlation analysis are used to study what
relationships between two or more variables
The focus in correlation analysis is on assessment of
the size and the direction of the relationship
The relationship between variables is said to be positive if
the two variables increase together and decrease together
The relationship between variables is said to be negative if
they move in different directions
In regression analysis, on the other hand, the focus is on what
prediction
The value of one variable is predicted form the value of another variable based on what
a model relating the two variables. the model has to be estimated, using a sample from the bivariate distribution, before it can be used
What is the first thing to do in a regression analysis
plot the data as a scatter diagram, and see if the assumption of a linear relationship is plausible
If the data points are scatter in such a way that a straight line can be drawn through them they are
clustered around the line and the assumption of a linear relationship is reasonable
if the data is scattered in such a way that a straight line cannot be drawn through them then they are
such as a curve, or in no pattern at all the assumption is violated that it is a linear relationship
Geometrically, the actual value yi is the ________ of the point form the _________axis to the point on the regression line
height
horizontal
The distance between the two (yi and Y triangle hat) is the
error at the given xi
if the actual value of y triangle hat is on the regression line, then yi ________ and the error is ________
= y triangle hat and the error is zero
If the actual value yi-y triangle hat is above the regression line, ,this results in a _________
positive error
if the actual value of yi-y triangle hat is below the regression line, this results in a
negative error
Is there an error term for each data point?
yes
When does a perfect fit occur in the simple linear regression
when all these error terms are zero, which means all the data points are lined up along a straight line
is a perfect fit rare in simple linear regression
yes
what would be the best regression line
is a line throughthe points that minimizes the errors in some sense
Who proposed the least squares method
Gauss
Explain the least squares method
deals with the squares of the errors, instead of the errors themselves, so it treats only the size of the errors and not heir sings
What does the least squares method do
it minimizes the sum of squared errors
Why is the relation of SST = SSR + SSE fundamental
because it assesses the contribution of the regression as a source of variation compared to other sources of variation in the data
since we are studying the effect of regression alone, all other sources of variation are what
lumped under the general label “error” and are treated as one source. This is similar to the notion of “between” and “within” variations
What is the mean Square due to error for simple linear regression
one measure of the goodness of fit for a regression equation
MSE is also
S^2
MSE is useful
only in a relative sense
a value of say, 13.829 for MSE does not tell us whether the fit is good or bad. nor, if good, does it tell us how good the fit is.
It is only useful when we compared it with MSE for another model or fit
When comparing MSE which one is better
the one with the smaller MSE is better
What is MSE very sueful for
constructing tests of significance and confidence intervals
What is the square root of MSE use for
to estimate the standard error of an estimate
which servers as a benchmark for decisions regarding the size of a difference between an estimate and its hypothesized value
What test is used with MSE for Simple linear regression
t-test
Can MSE be used as a comparison by itself and what test is used
a comparison can be made directly with the MSE, since MSE is itself a measure of variation. This results in the F test ?
An observation can be both an outlier and what
an influential observation, it can be an outlier but not an influential observation, or it can be an influential observation but not an outlier
In identifying an outlier, we focus on what
the y value (or equivalently, on the residual or standardized residual) of a point
When identifying an influential observation, the focus is on what
the x values
Observations who x values are very different from the x values of the rest of the data are most likely
influential observations
Those whose y values are way off the trend of the other points are most likely
outliers
The variable being predicted is called
the dependent variable
What are the independent variable
The variable or variables being used to predict the value of the dependent variable are called the independent variables
In simple linear regression, each observation consists of two values, what are they
- one for the independent variable
2. one for the dependent variable
Can regression analysis be interpreted as a procedure for establishing a cause-and-effect relationship between variables?
no, it can only indicate how or to what extent variables are associated with each other
any conclusions about cause and effect must be based upon the judgement of those individuals most knowledgeable about eh application
Using the estimated regression equation to make predictions outside the range of the values of the independent variable - what caution is there
should be done with caution because outside that range we cannot be sure that the same relationship is valid
What does the least squares method provides for the estimated regression equation do
minimizes the sum of squared deviations between the observed values of the dependent variable yi and the predicated values of the dependent variable ytraingle hat i
The least squared criterion for estimated regression is used to do what
to choose the equation that provides the best fit. It is the mostly widely used method
Coefficent of determination provides what
a measure of goodness of fit for the estimated regression equation
What is SSE a measure of in the estimated regression
it is a measure of the error in using the estimated regression equation to predict values of the dependent variable in the sample
If you don’t have the knowledge of the xi, what would you use to estimate of something
you would use the mean value
the estimated regression is a much better predictor than using the mean value
What can SSR be thought of as
the explained portion of SST
What can SSE be thought of as
the unexplained portion of SST
What would be a perfect fit for the estimated regression
yi - y triangle hat = 0
this means that every value of the dependent variable yi lies on the estimated regression line
If the estimate regression is a perfect fit, what can we say about SSE
SSE = 0
and SSR/SST = 1
If SSE is large, what can we say about the estimated regression
poorer fits will have larger values of SSE
What would be the poorest fit when
the largest value for SSE occurs when SSR = 0 and SSE=SST
When SSE = SST what kind of fit is this
poorest fit
What values will the ratio SSR/SST take
take on teh values between 0 and 1
what does r^2 stand for
coefficient of determination
r^2 formula
SSR/SST
if r^2 SSR/SST =close to 1
good a fit
What would r^2 = .9027 mean
90.27% of the variability in yi can be explained by the estimated regression equation
What is correlation Coefficient a measure of
a descriptive measure of the STRENGTH of linear association between two variables (x and y)
What are the values that correlation coefficient take on
between -1 and +1
What does a value of +1 Correlation Coefficient mean
indicates that the two variables x and y are perfectly related in a positive linear sense. That all data points are on a straight line that has a positive slope
What do values close to zero represent for Correlation Coefficent
indicate that x and y are not linearly related
What is the formula for Correlation Coefficient
rxy = (sing of b1) Square root of Coefficient of determination
rxy = square root of r^2
Coefficient of determination provides a measure between what numbers
and
Correlation Coefficient provides a measure between what numbers
Coefficient of Determination
r^2 is between 0 and 1
Correlation Coefficient
rxy = square root of r^2 is between -1 and +1
The sample correlation coefficient is restricted to what
A linear relationship between two variables
The coefficient of determination can be used for what
nonlinear relationships and for relationships that have two or more intendent variables
thus, the coefficient of determination r^2, provides a wider range of applicability
Which provides a wider range of applicability coefficient of determination or Correlation Coefficient
Coefficient of Determination
When using r^2 we can draw no conclusion about what
whether the relationship between x and y is statistically significant - such conclusion must be based on considerations that involve the sample size and the properties of the appropriate sampling distributions of the least squares estimators
SSE is what in Simple linear regression
sum of squared residuals
SSE is a measure of what
of the variability of the actual observations about the estimated regression line
In simple linear regression, does the F test and t test provide the same results?
yes, if it is just for one I.V.
If it is more than one IV, does the F test and t test provide the same results
no, only the F test can be used to test for an overall significant relationship
Confidence intervals and prediction intervals show the precision of the regression results. Narrower intervals provide what
a higher degree of precision
A confidence interval is an interval estimate of what
the mean value of y for a given value of x
a prediction interval is an interval estimate of what
used to predict an individual value of y for a new observation corresponding to a given value of x
the margin of error is large for which interval, a confidence interval or prediction interval
prediction interval
What is the margin of error associated with a prediction interval
t a/2 spread
In general, the lines of the confidence interval limits and the prediction interval limits both have what
curvature
confidence intervals and prediction intervals are both more precise when the value of the IV x* is closer to
x bar
What may an outlier represent
- erroneous data - error recording, s/b corrected
- signal a violation of the model assumption - may need to consider another model
- unusual values that occurred by chance - should stay
What is an influential observation
- it could be an outlier
- can influence how the data is interpreted
if this data set was removed, it would change our slope from negative to positive for example
if the Influential observation is valid
- can contribute to a better understanding of the appropriate mode and lead to a better estimate regression equation
- try to obtain data on intermediate values of x to better understand the relationship b/w x and y
What is high leverage
the father xi is form it’s mean (x bar) the higher the leverage of the observation
- need computer software to help with this
Explain lower leverage
outside of the other data sets but won’t change the line
explain high leverage
outside of the other data sets by a lot
explain lower leverage low influence
near to the line
explain high leverage low influence
need some work
HOw is outlier determined
if it is outside of the +2 or -2 from the mean line
What is an outlier denoted as on a computer print out
R
if we only have one variable, how can we predict another amount
by using the mean
if we are only using one variable to predict, what is the best fit line
the mean
What do you use to measure of how well the estimated regression line FITS the data
R2 the coefficient of determination
What test do we use to test whether B1 is significant
t-test b1/sb1