HBX- BA - 5 Flashcards

Question

Creating the Multiple Regression Output Table

Answer 1

* Step 1- From the Data menu, select Data Analysis, then select Regression. * Step 2- Enter the appropriate Input Y Range and Input X Range: * The Input Y Range is the dependent variable, in this case selling price. The data are in column D with its label, D1:D31. * **The Input X Range should include both independent variables,** in this case house size and distance from Boston. To ensure that the independent variables are labeled correctly in the output table, enter the data with its labels in column B and column C, B1:C31. * _Note that to run a regression in Excel the independent variables must be in contiguous columns. (sharing a common border)_ * Since we included the cells containing the variables’ labels when inputting the ranges, check the Labels box. * Step 3 - Scroll down and make sure to check the Residuals and Residual Plots boxes to ensure we see the relevant residual information. You will not be able to submit if you do not include the residual plots.

Answer 2

* House Price Index * Unemployment Rate * Disposable Income * Home Owner Vacancy Rates House Price Index, Unemployment Rate, Disposable Income, and Home Owner Vacancy Rates are the independent variables used to create the regression model. Housing Starts (thousands) is the dependent variable used to create the regression model. Year and Quarter is not included as a dependent or independent variable.

Answer 3

Input Y Range: B1:B81 Input X Range: C1:F81

Answer 4

3 There are three independent variables included in the model: cylinders, engine displacement, and passenger volume. The three independent variables are selected from the specified attributes we wish to include. Make and model variables are not included as independent variables for this multiple regression model.

Answer 5

Input Y Range: F1:F59 Input X Range: C1:E59

Answer 6

Multicollinearity occurs when two independent variables are so highly correlated that it is difficult for the regression model to separate the effect each variable has on the dependent variable. Multicollinearity can obscure the results of a regression analysis. If adding a new independent variable decreases the significance of another independent variable in the model that was previously significant, multicollinearity may well be the culprit. Another symptom of multicollinearity is when the R-square of a regression is high but none of the independent variables are significant. **If a variable that was significant becomes insignificant when we add it to a regression model, we can usually attribute it to a relationship between two or more of the independent variables.** (you can check this significance with the p-value) **One way to detect multicollinearity is by checking to see if any variable's** **P value** **increases when a new independent variable is added.** * **If we're using the regression model to make predictions,** multicollinearity is usually not a problem, so we might keep the lot size variable in the model. It improves the adjusted R-squared, and more importantly, our managerial judgment tells us that lot size should have an impact on price that is separate from the effect of house size. * **If we're trying to understand the net effects of the independent variables,** then multicollinearity is a problem that should be addressed. **The best way to reduce multicollinearity is simply to increase the sample size.** More observations may help discern the net effects of the individual independent variables. We can also reduce or eliminate multicollinearity by removing one of the collinear independent variables. Doing this requires careful analysis of the relationships among the variables.

Answer 7

=IF(logical\_test, [value\_if\_true], [value\_if\_false]) Steps below using an example from the module! * Step 1 * In cell C2, enter the function =IF(B2="Monday",1,0). * This function says that if cell B2 equals "Monday", then enter a 1 in cell C2 and if cell B2 does not equal "Monday", then enter a 0 in cell C2. * You can also enter =IF(B2=$C$1,1,0) since cell C1 equals Monday. Note that you must lock cell C1 so that when you copy the function, it continues to reference C1. * Step 2 * Copy and paste the formula from cell C2 into cells C3:C32. * This assigns a dummy variable value in column C for each data point in column B. * To use auto-fill, enter the first value in cell C2. Highlight C2 and place your cursor at the bottom right-hand corner of the cell. The cursor will turn into a black cross. Drag the cross down the column until you reach cell C32. When you release the mouse, the values will auto-fill.

Answer 8

base case- The category of a categorical variable for which a dummy variable is NOT included in a regression model. A regression model with a categorical variable that has n categories should have n-1 dummy variables. The coefficients of the dummy variables included in the regression model are interpreted in relation to the base case. The analyst can select any category to be excluded from the regression model; however, different base cases lead to different interpretations of the dummy variables’ coefficients. For example, suppose we are trying to determine the average difference in height between men and women in a sample, and suppose that on average men are 5 inches taller than women in the sample. If we use Female as the base case then the coefficient for the dummy variable for Male would be +5. If we use Male as the base case, the coefficient for the dummy variable for Female would be -5.

Answer 9

17 For each category, we must use one fewer dummy variables than the number of options for that category. Since month and day of week are separate categories, we should subtract one for each category. Thus we would use 12–1=11 variables for month and 7–1=6 variables for day of week, giving a total of 17 dummy variables

Answer 10

**Sales=−631,085+533,024(Red)+50.5(Advertising)** We always interpret the coefficient of a dummy variable as being the expected change in the dependent variable when that dummy variable equals one compared to the base case. In this case, controlling for advertising, we expect sales for red sneakers to be $533,024 more than blue sneakers. It is helpful to view this model graphically. Consider two parallel regression lines: one for red sneakers and one for blue sneakers. The vertical distance between the lines is the average increase in sales the manager can expect when red sneakers are sold versus blue sneakers, controlling for advertising. The slope of the two lines, 50.5, is the same: It tells us the average increase in sales, controlling for sneaker color, as we increase advertising by $1.

Answer 11

A type of independent variable often used in a regression analysis. When data are collected as a time series, a regression analysis is often performed by analyzing values of the dependent with independent variables from the same time period. However, if researchers hypothesize that there is a relationship between the dependent variable and values of an independent variable from a previous time period, may include a “lagged variable”, that is, and independent variable based on data from a previous time period. *Advertising provides a good* *example,* *because its effects often persist. For**example:* *last year’s advertising may still influence this year’s sneaker sales. We can incorporate the delayed effect of an independent variable on the dependent variable using a lagged variable.* **_Spreadsheet: Let’s walk through how to create a lagged variable._** Sales=−631,085+533,024(Red)+50.5(Advertising) * Step 1: Copy the advertising data in range C2:C11. * Step 2: To create the lagged variable, paste the advertising data into the range D3:D12 in Column D, under the title "Previous Year's Advertising." That is, the value from C2 will be pasted into D3, from C3 into D4, and so on until the value in C11 is pasted into D12. For example, in D3, the value for 2005 Previous Year’s Advertising will be the advertising expenditure for 2004, $35,000. * When completed properly, Row 12 should contain only one observation (in D12). Since we do not have advertising data for 2003, we do not know Previous Year’s Advertising for 2004; thus, D2 should be blank. * Note: Rather than copying and pasting, you may also choose to link directly to cells (for example, cell D3 would contain the formula =C2). **_Points-_** * The first row has all the necessary data except a lagged value. * And the last row has only a lagged value. * Since every observation we use needs a value for each variable, **we must remove both the first observation and the newly added row.** * Thus, by introducing a lagged variable, **we lose one data point. We run and interpret a regression with lagged variables as we would any other multiple regression model.** We also need to think carefully about what the appropriate lag time should be. How long do we think the effects of an advertising campaign would last? A month, six months, a year?Since we have only annual data, we can only analyze effects in yearly increments. For example, if we believe that the effects last two years, we can include an additional variable with a two-year lag. We have to be careful though. Remember that each additional lagged variable reduces the number of observations we can use, and hence may reduce accuracy and explanatory power. **_Adding a lagged variable is costly in two ways:_** * Each lagged variable creates an incomplete line of data. If we have a single lagged variable, our first observation will be incomplete. If we have two lagged variables, our first two observations will be incomplete, and so on. The loss of each data point decreases our sample size by one, which reduces the precision of our estimates of the regression coefficients. * In addition, if the lagged variable, or variables, do not increase the model’s explanatory power, the addition of the variable decreases Adjusted R2, just as the addition of any variable to a regression model can. **We include lagged variables only if we believe the benefits of doing so outweigh the loss of one or more observations** and the “penalty” imposed by the adjustment to R2. Despite those costs, lagged variables can be very useful. Because they pertain to previous time periods, they are usually available ahead of time. They are often good leading indicators, which help us predict future values of a dependent variable.

Answer 12

Reducing the number of usable rows means that the labels are no longer contiguous with the data of interest, so you should leave the Labels box unchecked. **Note that _NOT_ checking the Labels box is unique to this data setup** (when the lagged data has created blank cells between the number values and the column labels). Generally, you would always want to use labels in a regression. * Step 1: **Select Data, then Data Analysis, then Regression.** * Step 2: **Enter your Input Y range** as B3:B11. (Notice that we cannot use the data for Sales in B2 since we do not have an entry for D2) * Step 3: **Enter your Input X range as** C3:D11. (Notice that we cannot use the data for Advertising for 2004 in C2 since we do not have an entry for D2. Moreover, we cannot use the data in D12 since we don’t have data for other variables for 2014.) * Step 4: **Check the Residuals and Residual Plot boxes, but DO NOT check the Labels box.** Click OK to start the regression analysis.

Answer 13

We may be able to reduce multicollinearity by either increasing the sample size or removing one (or more) of the collinear variables.

Answer 14

DO IT- THIS STAYS PURPLE UNTIL YOU DO SO