Module 5 – Multiple Regression Flashcards
Below is a residual plot from a model predicting the cost (in cents) of a standard postage stamp from the year the stamp was issued. Do you think the linear model is a good fit for the data?
- Yes, this looks like a good regression model.
- No, this looks like a non-linear relationship.
- No, there is heteroskedasticity.
Yes, this looks like a good regression model.
There is at least one aspect of the graph that should concern us.
No, this looks like a non-linear relationship.
The residuals create a curved shape; they are not linear.
No, there is heteroskedasticity.
If there were heteroskedasticity, there would be more variability at some points than others; in this plot, there is little variability at any point.
The owner of an electronics shop creates a regression model to help determine the price of a TV based on the size of its screen (in square inches), the quality of its picture (rated on a scale of 1 to 10 by a panel of judges) and the quality of its sound (also rated on a scale of 1 to 10 by a panel of judges). The regression equation is:
Price = 256 + 1.60 (Screen Size) + 0.55 (Picture Quality) + 0.32 (Sound Quality)
What is the coefficient for Picture Quality?
- 256
- 1.60
- 0.55
- 0.32
256
$256 is the y-intercept.
- 60
- 60 is the coefficient for Screen Size
0.55
0.55 is the coefficient for Picture Quality
- 32
- 32 is the coefficient for Sound Quality
The table below contains a partial view of data about a random sample of 58 car makes and models, including number of cylinders, engine displacement (in cubic inches), passenger volume (in cubic feet), and CO2 emissions (in grams/km).
We want to create a regression model to analyze the relationship between CO2 and the specific attributes of the vehicle: the number of cylinders, the engine displacement, and the passenger volume. Based on this data table, how many independent variables are included in the model?
- 5
- 4
- 3
- 2
- 1
3
There are three independent variables included in the model: cylinders, engine displacement, and passenger volume. The three independent variables are selected from the specified attributes we wish to include. Make and model variables are not included as independent variables for this multiple regression model.
The table below contains a partial view of data about a random sample of 58 car makes and models, including number of cylinders, engine displacement (in cubic inches), passenger volume (in cubic feet), and CO2 emissions (in grams/km).
In order to create a regression model to analyze the relationship between CO2 and the independent variables of interest (the number of cylinders, the engine displacement, and the passenger volume), which cell references should be entered?
- Input Y Range: F1:F59 / Input X Range: C1:E59
- Input Y Range: C1:E59 / Input X Range: F1:F59
- Input Y Range: F1:F59 / Input X Range: A1:E59
- Input Y Range: A1:E59 / Input X Range: F1:F59
Input Y Range: F1:F59 / Input X Range: C1:E59
The “Input Y Range” denotes the cell reference for the dependent variable, CO2 Emissions. The data of the dependent variable is in F1:F59. The “Input X Range” denotes the cell references for the independent variables: the number cylinders, the engine displacement, and the passenger volume. The data of the dependent variables is in C1:E59.
Data contained in columns A and B, make and model, are not included as a dependent or independent variable in the regression model.
The spreadsheet below contains data about a random sample of car makes and models, including number of cylinders, engine displacement (in cubic inches), passenger volume (in cubic feet), and CO2 emissions (in grams/km).
Create a regression model to analyze the relationship between CO2 and the independent variables of interest (the number cylinders, the engine displacement, and the passenger volume. Be sure to include data label, and include the residuals and residual plots in your analysis. Set the Output Range to H1.
From the Data menu, select Data Analysis, then select Regression. The Input Y Range is F1:F59 and the Input X Range is C1:E59. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.
Below is a regression output table based on data from the 2014 Major League Baseball (MLB) season. The dependent variable is Win Percentage (the percentage of games won by MLB teams) in 2014. The independent variables are as follows: Runs (the average number of runs the team scored per game); ERA (the average number of runs the team allowed the opposing team to score per game); Completed Games (the total number of games with only one pitcher for the entire game); and strikeouts (the total number of strikeouts for the season).
Which is the best estimate of the approximate amount of variability in Win Percentage that is explained by the model?
- 79.4%
- 63.0%
- 4.4%
- 10.6%
- 4%
- 7936 or 79.4% is the multiple R. Which number best represents the variability accounted for by the model?
63.0%
0.6298 or 63.0% is the R-square, which indicates how much variability is accounted for by the model.
- 4%
- 0443 or 4.4% is the standard error of R. Which number best represents the variability accounted for by the model?
- 6%
- 63 is the F-value. Which number best represents the variability accounted for by the model?
If we have four sneaker colors (red, blue, green, and black) and we create dummy variables for red, blue, and green, which color is the base case?
- Red
- Blue
- Green
- Black
Red
The base case is the category that does not have a dummy variable. Red is one of the dummy variables.
Blue
The base case is the category that does not have a dummy variable. Blue is one of the dummy variables.
Green
The base case is the category that does not have a dummy variable. Green is one of the dummy variables.
Black
Black is the base case because it is the one category that is not a dummy variable.
Use the multiple regression model
Selling Price = 194,986.59 + 244.54 (House Size) – 10,840.04 (Distance from Boston)
where House Size is in square feet and Distance from Boston is in miles, to predict the selling price of a house that is 1,500 square feet and 10 miles from Boston.
The expected selling price of a 1,500 square foot home that is 10 miles from Boston is B15+B16*1,500+B17*10=$453,397.59. You must link directly to the values in order to obtain the correct answer.
Use the multiple regression model
Selling Price = 194,986.59 + 244.54 (House Size) – 10,840.04 (Distance from Boston)
where House Size is in square feet and Distance from Boston is in miles, to predict the selling price of a house that is 3,000 square feet and 20 miles from Boston.
The expected selling price of a 3,000 square foot home that is 20 miles from Boston is B15+B16*3,000+B17*20=$711,808.59. You must link directly to the values in order to obtain the correct answer.
Assume we have created two single linear regression models, and a multiple regression model to predict selling price based on House Size alone, Distance from Boston alone, or both. The three models are as follows, where House Size is in square feet and Distance from Boston is in miles:
Selling Price = 13,490.45 + 255.36 (House Size)
Selling Price = 686,773.86 – 15,162.92 (Distance from Boston)
Selling Price = 194,986.59 + 244.54 (House Size) – 10,840.04 (Distance from Boston)
House A and House B are the same size, but located in different neighborhoods: House B is five miles closer to Boston than House A. If the selling price of House A is $450,000, what would we expect to be the selling price of House B?
- Approximately $396,000
- Approximately $504,000
- Approximately $526,000
- Approximately $699,000
- The answer cannot be determined without further information
Approximately $396,000
Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model. Remember that as distance decreases, price increases so we would expect that House B would cost more than House A.
Approximately $504,000
Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model. House B is five miles closer to Boston than House A so House B’s expected selling price is: House A’s selling price+net effect of distance on selling price ≈ $450,000+$10,840.04(5 miles) ≈ $450,000+$54,200.20 ≈ $504,200.20
Approximately $526,000
Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model.
Approximately $699,000
Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model. However, we do not need to include the intercept in this calculation.
The answer cannot be determined without further information
Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model.
The census data below shows the number of building permits (in thousands) awarded monthly in each of the four major regions of the United States. The dataset includes a “Region” variable indicating the name of the region. Create a dummy variable called Midwest. Assign a 1 to observations from the Midwest and a 0 to all other observations.
There should be a 1 or 0 in cells D2:D89. In cell D2, enter the function =IF(B2=”Midwest”, 1,0), then copy and paste this function in cells D3:D89.
Now using the same data, create a second dummy variable called Northeast. Assign a 1 to the observations from the Northeast and a 0 to all other observations.
There should be a 1 or 0 in cells E2:E89. In cell E2, enter the function =IF(B2=”Northeast”, 1,0), then copy and paste this function in cells E3:E89.
Now using the same data, create a third dummy variable called South. Assign a 1 to the observations from the South and a 0 to all other observations.
There should be a 1 or 0 in cells F2:E89. In cell F2, enter the function =IF(B2=”South”, 1,0), then copy and paste this function in cells F3:F89.
Note that we are not creating a variable called West. Why do you think this is?
- We are not interested in the data for the West.
- There is less variability in data for the West, so it is less important to study.
- We are leaving the West as the base case.
- The p-value for the West is more than .05 so it is not significant.
We are not interested in the data for the West.
We are interested in all data.
There is less variability in data for the West, so it is less important to study.
Even if there were less variability in the West, we would want to know about permitting patterns.
We are leaving the West as the base case.
We only need three dummy variables: If we know that a region is not Midwest, Northeast, or South, we know for sure that it is West.
The p-value for the West is more than .05 so it is not significant.
We cannot know the p-value for the West before we run the analysis.
Below is a residual plot from a model predicting the weight (in pounds) of Red Sox players from their height (in inches). Do you think the linear model is a good fit for the data?
- Yes, this looks like a linear relationship.
- No, this looks like a non-linear relationship.
- No, there is heteroskedasticity.
Yes, this looks like a linear relationship.
There is nothing in this scatterplot to concern us.
No, this looks like a non-linear relationship.
The residuals are fairly evenly divided on both sides of 0; there does not appear to be a non-linear relationship.
No, there is heteroskedasticity.
If there were heteroskedasticity, there would be a systematic difference in variability at different points of the graph; this is not the case with this graph.
Below is a regression output table based on data from the 2014 Major League Baseball (MLB) season. The dependent variable is Win Percentage (the percentage of games won by MLB teams) in 2014. The independent variables are as follows: Runs (the average number of runs the team scored per game); ERA (the average number of runs the team allowed the opposing team to score per game); Completed Games (the total number of games with only one pitcher for the entire game); and strikeouts (the total number of strikeouts for the season).
Which of the following independent variables are significant at the p < .05 level? SELECT ALL THAT APPLY
- Runs
- ERA
- Strikeouts
- Completed Games
ERA
The p-value column in the bottom table gives the significance level of each variable. The only p-values that are less than .05 are for the Intercept (which we do not assess for significance) and ERA. Thus, ERA is the only independent variable that is significant at p < .05. Note also that ERA is the only independent variable with a 95% confidence interval that does not contain 0.
A real estate developer has data on a number of U.S. National financial variables for each quarter from 1995 to 2001. The variables are housing starts (in thousands), the housing price index (a measure of average housing selling prices), unemployment rate, average disposable income, and home owner vacancy rates. A partial view of the data is below.
If the developer wanted to create a regression model to predict housing starts from all the other financial variables, which of the following would be INDEPENDENT variables? (Select all that apply.)
- Year and Quarter
- Housing Starts (thousands)
- House Price Index
- Unemployment Rate
- Disposable Income
- Home Owner Vacancy Rates
House Price Index
Unemployment Rate
Disposable Income
Home Owner Vacancy Rates
House Price Index, Unemployment Rate, Disposable Income, and Home Owner Vacancy Rates are the independent variables used to create the regression model.
Housing Starts (thousands) is the dependent variable used to create the regression model.
Year and Quarter is not included as a dependent or independent variable.
A real estate developer has data on a number of U.S. National financial variables for each quarter from 1995 to 2001. The variables are housing starts (in thousands), the housing price index (a measure of average housing selling prices), unemployment rate, average disposable income, and home owner vacancy rates. A partial view of the data is below.
In order to create a regression model to analyze the relationship between housing starts and the other financial variables, which cell references should be entered?
- Input Y Range: A1:B81 / Input X Range: C1:F81
- Input Y Range: B1:F81 / Input X Range: A1:A81
- Input Y Range: C1:F81 / Input X Range: B1:B81
- Input Y Range: B1:B81 / Input X Range: C1:F81
Input Y Range: B1:B81 / Input X Range: C1:F81
The “Input Y Range” denotes the cell reference for the dependent variable, Housing Starts. The data of the dependent variable is in B1:B81. The “Input X Range” denotes the cell references for the independent variables: House Price Index, Unemployment Rate, Disposable Income, and Home Owner Vacancy Rates. The data of the dependent variables is in C1:F81.
Data contained in column A, Year and Quarter, are not included as a dependent or independent variable in the regression model.
The spreadsheet below contains data on a number of financial variables for each quarter from 1995 to 2001. The variables are housing starts (in thousands), the housing price index (a measure of average housing prices), unemployment rate, average disposable income, and home owner vacancy rates.
Create a regression model to analyze the relationship between housing starts and the other financial variables. Be sure to include the residuals and residual plots in your analysis. Set the Output Range to H1.
From the Data menu, select Data Analysis, then select Regression. The Input Y Range is B1:B81 and the Input X Range is C1:F81. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.
Below is a residual plot from a model predicting the price of diamonds from the depth of cut. Do you think the linear model is a good fit for the data?
- Yes, this looks like a linear relationship.
- No, this looks like a non-linear relationship.
- No, there is heteroskedasticity.
Yes, this looks like a linear relationship.
There is at least one aspect of the graph that should concern us.
No, this looks like a non-linear relationship.
The residuals are fairly evenly divided on both sides of 0; there does not appear to be a non-linear relationship.
No, there is heteroskedasticity.
There is evidence of heteroskedasticity; there is more variability at the lower values than at the higher values.