HBX- BA - 4 Flashcards

Question

Answer 1

R-squared) measures how closely a regression line fits a data set. It is a standardized measure of the regression line’s explanatory power. It is defined as the percentage of total variation in the dependent variable, y, that is explained by the regression line. In single variable linear regression, i.e. a regression model that has only one independent variable, is the square of the correlation coefficient between the independent and dependent variables. **R²= (Correlation Coefficient)²** Equivalently, the correlation coefficient is the positive or negative square root of **R²**. The sign is determined by whether there is a positive or negative relationship between the two variables. **Correlation Coefficient (R) = ±√R²** **_EXCEL: Adding R-squared to a Scatter Plot_** * We can quickly find the for a single variable linear regression by creating a scatter plot of the two variables. * When we select **Trendline** and **check the Display Equation box** to display the equation of the best fit line, we can also check the **Display R-squared Value** box to display the value on the scatter plot. **R-squared can only take on values between 0 and 1. Let's look at those two extremes.** * When R-squared equals 0, the regression line explains none of the variation in the dependent variable. * When R-squared equals 1, the line explains all of the variation in the dependent variable. The regression line fits the data perfectly. As always, it's critical that we consider the problem we're trying to solve and its context before performing or evaluating a regression analysis. * Although we may wish for an R-squared close to one, there are many contexts for which this is not realistic.* * LOW R-squared values are expected and accepted sometimes! KNOW YOUR FIELD/QUESTION* * In fields such as human behavior, lower R-squared values are both expected and accepted because human behavior is difficult to predict. Suppose we wish to know how well a person's tenth grade reading speed predicts his or her lifetime salary. There may be a relationship between the two variables, but we would not expect reading speed to be a very good predictor of salary. An R-squared of 0.1 might be considered high in this case.*

Answer 2

**0.74** Remember that for a single variable linear regression, R2 is the square of the correlation coefficient. Here, the correlation coefficient is 0.86, so R2=0.862=0.74.

Answer 3

**0.80** The independent variable explains a lot of the variation in the dependent variable, but not quite all of it. In total, the data points are close to the best fit line, but they do not lie on it. Thus, an R2 of 0.80 seems like a good estimate.

Answer 4

the likelihood that we would select a sample at least as extreme as the one we observed if the null hypothesis were true

Answer 5

**We need to do this because:** if the 95% confidence interval for the slope does not include zero, we can be 95% confident that the true value of the slope is not zero and thus that a significant relationship exists between the variables. 1. **Check whether the confidence interval for the line's slope contains zero** Remember that the coefficients of the regression line are just estimates of the true linear relationship between the dependent and independent variables. A coefficient’s lower 95% and upper 95% values give us the lower and upper bounds of the 95% confidence interval for that coefficient. Recall that if the best fit regression line has a slope of zero, then the regression line is just a flat line equal to the mean of the dependent variable, indicating that that there is no linear relationship between the two variables. Thus, if the 95% confidence interval for the slope does not include zero, we can be 95% confident that the true value of the slope is not zero and thus that a significant relationship exists between the variables. 1. In the photo below, we can say we are 95% confident that the true slope of the regression line describing the relationship between selling price and house size is between 196.10 and 314.63. Because this range does not include the value zero, we can be 95% confident that there is a significant linear relationship between the variables. 2. **Check whether the p-value is greater than or equal to 0.05** As we noted earlier, regression analysis builds on hypothesis testing. In fact, a single variable linear regression analysis is equivalent to the hypothesis test, Recall that the p-value for a hypothesis test is the likelihood that we would select a sample at least as extreme as the one we observed if the null hypothesis were true. **The p-value associated with a regression coefficient** is the likelihood of choosing a sample at least as extreme as the sample we used to derive the regression equation if the slope of the true regression line is actually zero, or equivalently, if there is no linear relationship between the two variables. In the photo below, Since the p-value for house size, 0.0000, is less than 0.05, we reject the null hypothesis that the slope is zero and can be confident that there is a significant linear relationship between selling price and house size. (We can ignore the p-value of the intercept coefficient because the y-intercept is just a constant. It does not represent an independent variable and thus provides no information about the significance of the relationship between two variables.) *Recall that a significance level of 5% corresponds to a confidence level of 95%, so checking whether a regression coefficient’s p-value is less than 5% is equivalent to checking whether the coefficient’s 95% confidence interval contains zero. Both approaches test whether or not we can be 95% confident* *that that there* *is* *significant* *linear relationship between the variables.*

Answer 6

* -11.89; -2.17 Remember that the 95% confidence interval of the slope must contain zero to indicate that the linear relationship is not significant at the 5% level. -11.89 and -2.17 are both negative, so this range does not contain zero. * 25.11; 44.37 Remember that the 95% confidence interval of the slope must contain zero to indicate that the linear relationship is not significant at the 5% level. 25.11 and 44.37 are both positive, so this range does not contain zero. * **-20.00; 5.00 The range between -20.00 and 5.00 contains zero, which indicates that the linear relationship is not significant at the 5% level. Note that another option is also correct.** * **-0.36; 0.55 The range between -0.36 and 0.55 contains zero, which indicates that the linear relationship is not significant at the 5% level. Note that another option is also correct.**

Answer 7

Yes Since the p-value of the independent variable, 0.0000, is less than 0.05, we can be 95% confident that there is a significant linear relationship between gross box office and home video units. We could also note that (19.58; 22.95), the 95% confidence interval for the slope, does not contain zero.

Answer 8

**Close to 1** The data points are very close to the line, so our regression line must explain a very large proportion of the variation in yy. This would indicate a very high R2.

Answer 9

Less than 0.05 A p-value less than 0.05 indicates that we can be 95% confident that the true slope is not zero, that is, that there is a significant linear relationship between the two variables. This graph provides strong evidence that there is a significant linear relationship between the two variables.

Answer 10

Smaller A smaller R2 means that less variation in the dependent variable yy is explained by the regression line. Compared with the previous graph, the data points here are more dispersed around the regression line, indicating that less of the variation is explained by the regression line.

Answer 11

Less than 0.05 Even though the regression line has a smaller R2 than the previous regression, there is still clearly a strong linear relationship between the variables. A p-value less than 0.05 indicates that we can be 95% confident that the true slope is not zero, i.e., that there is a significant linear relationship.

Answer 12

**_The 1st graph has:_** * **High R²(0.99)**: A large portion of the variation in y is explained by the regression line. * **Low** **p-value (0.0000)**: There is a significant linear relationship between the dependent and independent variables. **_The 2nd graph has:_** * **Lower R² (0.70)**: A smaller portion of the variation in y is explained by the regression line than in the previous graph. * **Low** **p-value (0.0000)**: There is a significant linear relationship between the dependent and independent variables, even though the R2 is lower than in the previous graph.

Answer 13

The residual plot is a scatter plot with residuals on the y-axis and the independent variable on the x-axis. The plot graphically represents the residual (the difference between the observed value and predicted value of the dependent variable) for each observation. Examining residual plots can provide significant insight into the relationships among variables and the validity of the assumptions underlying regression models. **_How to Create One:_** * We first take each observed data point and measure its residual-- the vertical distance from that point to the regression line. * Then we graph each residual against the independent variable to form the residual plot. * **If there is a linear relationship** between the dependent and independent variables and the assumptions underlying regression analysis hold, **_we should not see any systematic pattern in the residual plot._** The residual should be spread randomly above and below the horizontal axis. Specifically, based on the assumptions underlying linear regression, the distribution of the residual should follow a normal distribution with mean zero and a fixed variance. * **If we do see a pattern in the residual plot, _then it's possible that other factors may be influencing the dependent variable or that the linear model may not be the best fit for the data_.** For example, if the residuals appear to have a curved shape, there may be a nonlinear relationship between the dependent and independent variables. * **If the residuals become larger as we move along the x-axis, we may be encountering a phenomenon known as _heteroscedasticity._** In heteroscedastic relationships, the variance changes systematically as the independent variable changes. This violates the assumption that the error terms follow a normal distribution with fixed variance.

Answer 14

A characteristic of the distribution of the residuals (error terms) in a regression. The error terms are heteroskedastic if the size of the error terms depends systematically upon the value(s) of the independent variable(s). Examining residual plots for patterns is useful for identifying heteroskedasticity (for example, if the error terms grow larger as the value of the independent variable grows larger, a classic funnel shape may be visible in the residual plot). Inferences drawn from a regression analysis with heteroskedastic error terms are suspect.

Answer 15

* **R²** - measures the percent of total variation in the dependent variable, y, that is explained by the regression line. * **Analyze the p-value** - we must test whether the relationship between the dependent and independent variable is significant and whether the linear model is a good fit for the data. * Note that the p-value and R²provide different information. A linear relationship can be significant (have a low p-value) but not explain a large percentage of the variation (not have a high .) * **Check the confidence intervals** associated with an independent variable’s coefficient indicates the likely range for that coefficient. If the 95% confidence interval does not contain zero, we can be 95% confident that there is a significant linear relationship between the variables. * **Residual plots** can provide insights into whether a linear model is a good fit.

Answer 16

* **We can be 90% confident that there is a significant linear relationship between the two variables.** * Since the p-value, 0.0210, is less than 1-0.90=0.10, we can be 90% confident that there is a significant linear relationship between the two variables. Note another option is also correct. * **We can be 95% confident that there is a significant linear relationship between the two variables.** * Since the p-value, 0.0210, is less than 1-0.95=0.05, we can be 95% confident that there is a significant linear relationship between the two variables. Note another option is also correct. * We can be 98% confident that there is a significant linear relationship between the two variables. * Since the p-value, 0.0210, is greater than 1-0.98=0.02, we cannot be 98% confident that there is a significant linear relationship between the two variables. * We can be 99% confident that there is a significant linear relationship between the two variables. * Since the p-value, 0.0210, is greater than 1-0.99=0.01, we cannot be 99% confident that there is a significant linear relationship between the two variables.

Answer 17

**Low R-squared, Low p-value** A low R-squared and low p-value indicates that the independent variable explains little variation in the dependent variable and the linear relationship between the two variables is significant.

Answer 18

* **From the Data menu, select Data Analysis -\> Regression.** * **Enter the appropriate Input Y Range and Input X Range:** * **The Input Y Range is the dependent variable**, in this case selling price. The data are in column C with its label, C1:C31. * **The Input X Range is the independent variable**, in this case house size. To ensure the independent variable is labeled correctly in the output table, enter the data with its label in column B, B1:B31. * **Check the Labels box.** * **Check the Residuals and Residual Plots boxes to ensure we see the relevant residual information.** \*With Dummy Variables, you make the exact same steps, just make sure that the equation is properly in the table for the dummy variable before you enter it into the analysis. \*We interpret a dummy variable’s coefficient in the same way we interpret a coefficient for a quantitative independent variable. \*The regression analysis gives us more information than the hypothesis test alone would. Rather than simply calculating the p-value, rejecting the null hypothesis and concluding that there is a significant linear relationship, the regression results provide the direction and magnitude of this relationship.

Answer 19

The expected selling price of homes in school districts where students have low SAT scores is B15+B16\*0=B15=$389,376. You must link directly to the values in order to obtain the correct answer.

Answer 20

The average selling price of homes, given they are located in school districts where students have low SAT scores can be calculated as **AVERAGEIF(B2****:B31,0****,C2****:C31****)=$389,376.**

Answer 21

The expected selling price of homes in school districts where students have average SAT scores above 1700 is **B15+B16\*1=B15+B16=$809,100.** You must link directly to the values in order to obtain the correct answer.

Answer 22

**THREE MAIN PARTS TO THE TABLE** * the Regression Statistics table, * the ANOVA table, and * the Regression Coefficients table. \*Review more tables

Answer 23

“Stock of Corn at Start of Year” is the independent variable, and “Corn Acreage Planted” is the dependent variable. The beginning stock of corn at the start of the year will be used to predict the number of acres of corn that are planted.

Answer 24

80.36% R2 is the amount of variation in home video units that is explained by this model. 80.36% of the variation in home video units can be explained by the relationship with gross box office sales.

Answer 25

The expected number of home video units that will be sold is B15+B16\*360=7,074 thousand. You must link directly to the values in order to obtain the correct answer.