Module 5 – Multiple Regression Flashcards

1
Q

Below is a residual plot from a model predicting the cost (in cents) of a standard postage stamp from the year the stamp was issued. Do you think the linear model is a good fit for the data?

  • Yes, this looks like a good regression model.
  • No, this looks like a non-linear relationship.
  • No, there is heteroskedasticity.
A

Yes, this looks like a good regression model.

There is at least one aspect of the graph that should concern us.

No, this looks like a non-linear relationship.

The residuals create a curved shape; they are not linear.

No, there is heteroskedasticity.

If there were heteroskedasticity, there would be more variability at some points than others; in this plot, there is little variability at any point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

The owner of an electronics shop creates a regression model to help determine the price of a TV based on the size of its screen (in square inches), the quality of its picture (rated on a scale of 1 to 10 by a panel of judges) and the quality of its sound (also rated on a scale of 1 to 10 by a panel of judges). The regression equation is:

Price = 256 + 1.60 (Screen Size) + 0.55 (Picture Quality) + 0.32 (Sound Quality)

What is the coefficient for Picture Quality?

  • 256
  • 1.60
  • 0.55
  • 0.32
A

256

$256 is the y-intercept.

  1. 60
  2. 60 is the coefficient for Screen Size

0.55

0.55 is the coefficient for Picture Quality

  1. 32
  2. 32 is the coefficient for Sound Quality
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

The table below contains a partial view of data about a random sample of 58 car makes and models, including number of cylinders, engine displacement (in cubic inches), passenger volume (in cubic feet), and CO2 emissions (in grams/km).

We want to create a regression model to analyze the relationship between CO2 and the specific attributes of the vehicle: the number of cylinders, the engine displacement, and the passenger volume. Based on this data table, how many independent variables are included in the model?

  • 5
  • 4
  • 3
  • 2
  • 1
A

3

There are three independent variables included in the model: cylinders, engine displacement, and passenger volume. The three independent variables are selected from the specified attributes we wish to include. Make and model variables are not included as independent variables for this multiple regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

The table below contains a partial view of data about a random sample of 58 car makes and models, including number of cylinders, engine displacement (in cubic inches), passenger volume (in cubic feet), and CO2 emissions (in grams/km).

In order to create a regression model to analyze the relationship between CO2 and the independent variables of interest (the number of cylinders, the engine displacement, and the passenger volume), which cell references should be entered?

  • Input Y Range: F1:F59 / Input X Range: C1:E59
  • Input Y Range: C1:E59 / Input X Range: F1:F59
  • Input Y Range: F1:F59 / Input X Range: A1:E59
  • Input Y Range: A1:E59 / Input X Range: F1:F59
A

Input Y Range: F1:F59 / Input X Range: C1:E59

The “Input Y Range” denotes the cell reference for the dependent variable, CO2 Emissions. The data of the dependent variable is in F1:F59. The “Input X Range” denotes the cell references for the independent variables: the number cylinders, the engine displacement, and the passenger volume. The data of the dependent variables is in C1:E59.

Data contained in columns A and B, make and model, are not included as a dependent or independent variable in the regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

The spreadsheet below contains data about a random sample of car makes and models, including number of cylinders, engine displacement (in cubic inches), passenger volume (in cubic feet), and CO2 emissions (in grams/km).

Create a regression model to analyze the relationship between CO2 and the independent variables of interest (the number cylinders, the engine displacement, and the passenger volume. Be sure to include data label, and include the residuals and residual plots in your analysis. Set the Output Range to H1.

A

From the Data menu, select Data Analysis, then select Regression. The Input Y Range is F1:F59 and the Input X Range is C1:E59. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Below is a regression output table based on data from the 2014 Major League Baseball (MLB) season. The dependent variable is Win Percentage (the percentage of games won by MLB teams) in 2014. The independent variables are as follows: Runs (the average number of runs the team scored per game); ERA (the average number of runs the team allowed the opposing team to score per game); Completed Games (the total number of games with only one pitcher for the entire game); and strikeouts (the total number of strikeouts for the season).

Which is the best estimate of the approximate amount of variability in Win Percentage that is explained by the model?

  • 79.4%
  • 63.0%
  • 4.4%
  • 10.6%
A
  1. 4%
  2. 7936 or 79.4% is the multiple R. Which number best represents the variability accounted for by the model?

63.0%

0.6298 or 63.0% is the R-square, which indicates how much variability is accounted for by the model.

  1. 4%
  2. 0443 or 4.4% is the standard error of R. Which number best represents the variability accounted for by the model?
  3. 6%
  4. 63 is the F-value. Which number best represents the variability accounted for by the model?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

If we have four sneaker colors (red, blue, green, and black) and we create dummy variables for red, blue, and green, which color is the base case?

  • Red
  • Blue
  • Green
  • Black
A

Red

The base case is the category that does not have a dummy variable. Red is one of the dummy variables.

Blue

The base case is the category that does not have a dummy variable. Blue is one of the dummy variables.

Green

The base case is the category that does not have a dummy variable. Green is one of the dummy variables.

Black

Black is the base case because it is the one category that is not a dummy variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Use the multiple regression model

Selling Price = 194,986.59 + 244.54 (House Size) – 10,840.04 (Distance from Boston)

where House Size is in square feet and Distance from Boston is in miles, to predict the selling price of a house that is 1,500 square feet and 10 miles from Boston.

A

The expected selling price of a 1,500 square foot home that is 10 miles from Boston is B15+B16*1,500+B17*10=$453,397.59. You must link directly to the values in order to obtain the correct answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Use the multiple regression model

Selling Price = 194,986.59 + 244.54 (House Size) – 10,840.04 (Distance from Boston)

where House Size is in square feet and Distance from Boston is in miles, to predict the selling price of a house that is 3,000 square feet and 20 miles from Boston.

A

The expected selling price of a 3,000 square foot home that is 20 miles from Boston is B15+B16*3,000+B17*20=$711,808.59. You must link directly to the values in order to obtain the correct answer.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Assume we have created two single linear regression models, and a multiple regression model to predict selling price based on House Size alone, Distance from Boston alone, or both. The three models are as follows, where House Size is in square feet and Distance from Boston is in miles:

Selling Price = 13,490.45 + 255.36 (House Size)

Selling Price = 686,773.86 – 15,162.92 (Distance from Boston)

Selling Price = 194,986.59 + 244.54 (House Size) – 10,840.04 (Distance from Boston)

House A and House B are the same size, but located in different neighborhoods: House B is five miles closer to Boston than House A. If the selling price of House A is $450,000, what would we expect to be the selling price of House B?

  • Approximately $396,000
  • Approximately $504,000
  • Approximately $526,000
  • Approximately $699,000
  • The answer cannot be determined without further information
A

Approximately $396,000

Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model. Remember that as distance decreases, price increases so we would expect that House B would cost more than House A.

Approximately $504,000

Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model. House B is five miles closer to Boston than House A so House B’s expected selling price is: House A’s selling price+net effect of distance on selling price ≈ $450,000+$10,840.04(5 miles) ≈ $450,000+$54,200.20 ≈ $504,200.20

Approximately $526,000

Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model.

Approximately $699,000

Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model. However, we do not need to include the intercept in this calculation.

The answer cannot be determined without further information

Since the two houses are the same size, to predict the expected difference in selling prices we should use -$10,840.04/mile, the net effect of distance on selling price (that is, the effect of distance on selling price controlling for house size), which can be found in the multiple regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The census data below shows the number of building permits (in thousands) awarded monthly in each of the four major regions of the United States. The dataset includes a “Region” variable indicating the name of the region. Create a dummy variable called Midwest. Assign a 1 to observations from the Midwest and a 0 to all other observations.

A

There should be a 1 or 0 in cells D2:D89. In cell D2, enter the function =IF(B2=”Midwest”, 1,0), then copy and paste this function in cells D3:D89.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Now using the same data, create a second dummy variable called Northeast. Assign a 1 to the observations from the Northeast and a 0 to all other observations.

A

There should be a 1 or 0 in cells E2:E89. In cell E2, enter the function =IF(B2=”Northeast”, 1,0), then copy and paste this function in cells E3:E89.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Now using the same data, create a third dummy variable called South. Assign a 1 to the observations from the South and a 0 to all other observations.

A

There should be a 1 or 0 in cells F2:E89. In cell F2, enter the function =IF(B2=”South”, 1,0), then copy and paste this function in cells F3:F89.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Note that we are not creating a variable called West. Why do you think this is?

  • We are not interested in the data for the West.
  • There is less variability in data for the West, so it is less important to study.
  • We are leaving the West as the base case.
  • The p-value for the West is more than .05 so it is not significant.
A

We are not interested in the data for the West.

We are interested in all data.

There is less variability in data for the West, so it is less important to study.

Even if there were less variability in the West, we would want to know about permitting patterns.

We are leaving the West as the base case.

We only need three dummy variables: If we know that a region is not Midwest, Northeast, or South, we know for sure that it is West.

The p-value for the West is more than .05 so it is not significant.

We cannot know the p-value for the West before we run the analysis.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Below is a residual plot from a model predicting the weight (in pounds) of Red Sox players from their height (in inches). Do you think the linear model is a good fit for the data?

  • Yes, this looks like a linear relationship.
  • No, this looks like a non-linear relationship.
  • No, there is heteroskedasticity.
A

Yes, this looks like a linear relationship.

There is nothing in this scatterplot to concern us.

No, this looks like a non-linear relationship.

The residuals are fairly evenly divided on both sides of 0; there does not appear to be a non-linear relationship.

No, there is heteroskedasticity.

If there were heteroskedasticity, there would be a systematic difference in variability at different points of the graph; this is not the case with this graph.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Below is a regression output table based on data from the 2014 Major League Baseball (MLB) season. The dependent variable is Win Percentage (the percentage of games won by MLB teams) in 2014. The independent variables are as follows: Runs (the average number of runs the team scored per game); ERA (the average number of runs the team allowed the opposing team to score per game); Completed Games (the total number of games with only one pitcher for the entire game); and strikeouts (the total number of strikeouts for the season).

Which of the following independent variables are significant at the p < .05 level? SELECT ALL THAT APPLY

  • Runs
  • ERA
  • Strikeouts
  • Completed Games
A

ERA

The p-value column in the bottom table gives the significance level of each variable. The only p-values that are less than .05 are for the Intercept (which we do not assess for significance) and ERA. Thus, ERA is the only independent variable that is significant at p < .05. Note also that ERA is the only independent variable with a 95% confidence interval that does not contain 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

A real estate developer has data on a number of U.S. National financial variables for each quarter from 1995 to 2001. The variables are housing starts (in thousands), the housing price index (a measure of average housing selling prices), unemployment rate, average disposable income, and home owner vacancy rates. A partial view of the data is below.

If the developer wanted to create a regression model to predict housing starts from all the other financial variables, which of the following would be INDEPENDENT variables? (Select all that apply.)

  • Year and Quarter
  • Housing Starts (thousands)
  • House Price Index
  • Unemployment Rate
  • Disposable Income
  • Home Owner Vacancy Rates
A

House Price Index

Unemployment Rate

Disposable Income

Home Owner Vacancy Rates

House Price Index, Unemployment Rate, Disposable Income, and Home Owner Vacancy Rates are the independent variables used to create the regression model.

Housing Starts (thousands) is the dependent variable used to create the regression model.

Year and Quarter is not included as a dependent or independent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

A real estate developer has data on a number of U.S. National financial variables for each quarter from 1995 to 2001. The variables are housing starts (in thousands), the housing price index (a measure of average housing selling prices), unemployment rate, average disposable income, and home owner vacancy rates. A partial view of the data is below.

In order to create a regression model to analyze the relationship between housing starts and the other financial variables, which cell references should be entered?

  • Input Y Range: A1:B81 / Input X Range: C1:F81
  • Input Y Range: B1:F81 / Input X Range: A1:A81
  • Input Y Range: C1:F81 / Input X Range: B1:B81
  • Input Y Range: B1:B81 / Input X Range: C1:F81
A

Input Y Range: B1:B81 / Input X Range: C1:F81

The “Input Y Range” denotes the cell reference for the dependent variable, Housing Starts. The data of the dependent variable is in B1:B81. The “Input X Range” denotes the cell references for the independent variables: House Price Index, Unemployment Rate, Disposable Income, and Home Owner Vacancy Rates. The data of the dependent variables is in C1:F81.

Data contained in column A, Year and Quarter, are not included as a dependent or independent variable in the regression model.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

The spreadsheet below contains data on a number of financial variables for each quarter from 1995 to 2001. The variables are housing starts (in thousands), the housing price index (a measure of average housing prices), unemployment rate, average disposable income, and home owner vacancy rates.

Create a regression model to analyze the relationship between housing starts and the other financial variables. Be sure to include the residuals and residual plots in your analysis. Set the Output Range to H1.

A

From the Data menu, select Data Analysis, then select Regression. The Input Y Range is B1:B81 and the Input X Range is C1:F81. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Below is a residual plot from a model predicting the price of diamonds from the depth of cut. Do you think the linear model is a good fit for the data?

  • Yes, this looks like a linear relationship.
  • No, this looks like a non-linear relationship.
  • No, there is heteroskedasticity.
A

Yes, this looks like a linear relationship.

There is at least one aspect of the graph that should concern us.

No, this looks like a non-linear relationship.

The residuals are fairly evenly divided on both sides of 0; there does not appear to be a non-linear relationship.

No, there is heteroskedasticity.

There is evidence of heteroskedasticity; there is more variability at the lower values than at the higher values.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Suppose we want to assign dummy variables to the seasons (Winter, Spring, Summer, Fall). How many dummy variables do we need?

  • 1
  • 2
  • 3
  • 4
A

1

We would want our dummy variables to account for all four seasons. One dummy variable would allow us to explore only two options.

2

We would want our dummy variables to account for all four seasons. Two dummy variables would allow us to explore only three options.

3

We always have one fewer dummy variable than the number of options. Since there are four seasons, there would be three dummy variables.

4

Even though there are four seasons, we always have one fewer dummy variable than the number of options.

22
Q

The owner of an electronics shop creates a regression model to help determine the price of a TV based on the size of its screen (in square inches), the quality of its picture (rated on a scale of 1 to 10 by a panel of judges) and the quality of its sound (also rated on a scale of 1 to 10 by a panel of judges). The regression equation is:

Price = 256 + 1.60 (Screen Size) + 0.55 (Picture Quality) + 0.32 (Sound Quality)

Which is the dependent variable?

  • Price Price
  • Screen Size
  • Picture Quality
  • Sound Quality
A

Price Price

We are trying to estimate the price of the TV, so Price Price is our dependent variable.

ScreenSize

Screen Size is one of the variables that determine the price of the TV, so it is an independent variable.

Picture Quality

Picture Quality is one of the variables that determine the price of the TV, so it is an independent variable.

Sound Quality

Sound Quality is one of the variables that determine the price of the TV, so it is an independent variable.

23
Q

Dummy or Quantitative Variable?

Shoe Color

A

Dummy

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

24
Q

Dummy or Quantitative Variable?

NUMBER ON AN ATHLETE’S JERSEY

A

DUMMY

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

25
Q

Dummy or Quantitative Variable?

GENDER

A

DUMMY

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

26
Q

Dummy or Quantitative Variable?

ICE CREAM FLAVOR

A

DUMMY

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

27
Q
A
28
Q

Dummy or Quantitative Variable?

TIME TO RUN A MARATHON

A

QUANTITATIVE

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

29
Q

Dummy or Quantitative Variable?

HEIGHT

A

QUANTITATIVE

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

30
Q

Dummy or Quantitative Variable?

SIZE OF FLATSCREEN TV

A

QUANTITATIVE

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

31
Q

Dummy or Quantitative Variable?

HOURS SPENT STUDYING CORE

A

QUANTITATIVE

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

32
Q

Dummy or Quantitative Variable?

CALORIES IN DESSERTS

A

QUANTITATIVE

Time to run a marathon, height, size of flat–screen television, hours spent studying CORe, and calories in desserts are quantitative variables. Shoe color, number on an athlete’s jersey, gender, and ice cream flavor are categorical/qualitative variables and need to be transformed into dummy variables. Note that although athlete’s jerseys have numbers, those values cannot be interpreted as real numbers. For example, Eli Manning’s number is 10, whereas Peyton Manning’s was 18. However, you can’t interpret them to mean that Peyton is 80% more than Eli in some way.

33
Q
A
34
Q

A sporting goods store manager wants to forecast annual sneaker revenues based on the type of sport (running, tennis, or walking), color (red, blue, white, black, or violet) and its target audience (men or women). How many independent variables should the manager include in her multiple regression analysis? Please enter your answer as an integer; that is with no decimal point.

NUMBER:

A

7

Sales revenue is the dependent variable. Type of sport, color, and target audience are categorical variables which must be represented using dummy variables. Recall that it is necessary to use one fewer dummy variables than the number of options in a category. Thus, type of sport should be represented by 3-1=2 dummy variables, color should be represented by 5-1=4 dummy variables, and target audience should be represented by 2–1=1 dummy variables, for a total of 2+4+1=7 independent variables.

35
Q

An airport shuttle company forecasts the number of hours its drivers will work based on the distance to be driven (in miles) and the number of jobs (each job requires the pickup and drop-off of one set of passengers) using the following regression equation:

Travel time=-0.60+0.05(distance)+0.75(number of jobs)

On a given day, Victor and Sofia drive approximately the same distance but Sofia has two more jobs than Victor. If Victor worked for 4 hours, for how long can the company expect Sofia to work?

Please enter your answer rounded to one digit to the right of the decimal point. For example, if you think Sofia would work 236.7134 hours, enter 236.7.

NUMBER:

A

5.5

The only difference between the workloads of the two drivers is the number of jobs each has; Sofia has two additional jobs. Therefore the company can expect Sofia to work the four hours Victor worked, plus an additional 0.75 hours for each of the two additional jobs, that is, 4+0.75(2)=5.5 hours.

36
Q

The organizer of a late night street fair in a popular tourist city wants to analyze the relationship between daily revenue and the following variables: the number of male visitors, the number of female visitors, the number of retail stands, the number of food (and beverage) stands, and the number of performances that take place on a given night. The regression output table is provided below. Based on these results and using a 10% significance level, the organizer thinks he can improve the model. He wants to try removing at least one variable from the analysis to create and compare new models. Which variable or variables would you recommend that he consider removing from the regression model? SELECT ALL THAT APPLY.

  • Intercept
  • Number of Male Visitors
  • Number of Female Visitors
  • Number of Retail Stands
  • Number of Food Stands
  • Number of Performances
A

Intercept

The intercept cannot be removed from the regression model unless there is good reason to believe that it is zero.

Number of Male Visitors

The p-value of “Number of Male Visitors”, 0.2016, is greater than 0.1 so the organizer should consider removing this variables from the regression model.

Number of Female Visitors

The p-value of “Number of Female Visitors”, 0.0018, is less than 0.1.

Number of Retail Stands

The p-value of “Number of Retail Stands”, 0.0000, is less than 0.1.

Number of Food Stands

The p-value of “Number of Food Stands”, 0.0390, is less than 0.1.

Number of Performances

The p-value of “Number of Performances”, 0.5412, is greater than 0.1 so the organizer should consider removing this variable from the regression model.

37
Q

The spreadsheet below contains the data about daily revenue, the number of male visitors, the number of female visitors, the number of retail stands, the number of food stands, and the number of performances that the organizer used to create the regression model. The organizer now wants to try excluding the number of male visitors and the number of performances from the model because those variables were not significant. Create a regression model to analyze the relationship between daily revenue and the remaining independent variables. Be sure to include the residuals and residual plots in your analysis.

A

From the Data menu, select Data Analysis, then select Regression. The Input Y Range is F1:F40 and the Input X Range is B1:D40. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

38
Q

If the street fair organizer wanted to compare the explanatory power of the original model and the following new regression model, which value should he consult for the new model?

  • 0.9637
  • 0.9287
  • 0.9225
  • 0.0025
A
  1. 9637
  2. 9637 is the Multiple R value.
  3. 9287
  4. 9287 is the R2 value.

0.9225

It is important to use the Adjusted R2 to compare two regression models that have a different number of independent variables. 0.9225 is the Adjusted R2 of the new model.

  1. 0025
  2. 0025 is the p-value of the independent variable, “Number of Food Stands”.
39
Q

Using the new model, forecast the daily revenue when there are 10 retail stands and 15 food stands open, and approximately 1,500 women visiting.

A

The expected daily revenue is B15+(1500*B16)+(10*B17)+(15*B18)=$49,485. You must link directly to values in order to obtain the correct answer.

40
Q

Given the new regression model, which of the following statements is true?

  • As the number of retail stands increases by one, the daily revenue increases by $2,298.36 on average.
  • As the number of retail stands increases by one, the daily revenue increases by $2,298.36 on average, provided that the number of female visitors and food stands remains constant.
  • As the number of female visitors increases by ten, the daily revenue increases by $29.34 on average.
  • As the number of female visitors increases by ten, the daily revenue increases by $29.34 on average, provided that the number of retail stands and food stands remains constant.
A

As the number of retail stands increases by one, the daily revenue increases by $2,298.36 on average.

This option describes a gross relationship whereas the multiple regression model describes a net relationship.

As the number of retail stands increases by one, the daily revenue increases by $2,298.36 on average, provided that the number of female visitors and food stands remains constant.

2,298.36 is the net effect of the number of retail stands on daily revenue. Thus, it describes the average increase in daily revenue as the number of retail stands increases by one while the other variables’ values remain constant.

As the number of female visitors increases by ten, the daily revenue increases by $29.34 on average.

This option describes a gross relationship whereas the multiple regression model describes a net relationship.

As the number of female visitors increases by ten, the daily revenue increases by $29.34 on average, provided that the number of retail stands and food stands remains constant.

The net effect on daily revenue of increasing the number of female visitors by ten would be $293.40, not $29.34.

41
Q

Based on the following partial regression output table, from which the information on the coefficients’ t-statistics and p-values has been removed, which of the independent variables are significant at the 95% confidence level? SELECT ALL THAT APPLY.

  • Intercept
  • Variable A
  • Variable B
  • Variable C
  • Variable D
  • None of the variables are significant at the 95% confidence level
A

Intercept

The intercept is not an independent variable.

Variable A

The 95% confidence interval for the variable’s coefficient does not contain 0, which indicates that Variable A is significant at the 95% confidence level. The p-value (not shown) of Variable A, is 0.0001. Since it is less than 1–0.95=0.05, its value confirms that the variable is significant at the 95% confidence level.

Variable B

The 95% confidence interval for the variable’s coefficient contains 0, which indicates that Variable B is not significant at the 95% confidence level. The p-value (not shown) of Variable B, is 0.0735. Since it is greater than 1–0.95=0.05, its value confirms that the variable is not significant at the 95% confidence level.

Variable C

The 95% confidence interval for the variable’s coefficient contains 0, which indicates that Variable C is not significant at the 95% confidence level. The p-value (not shown) of Variable C, is 0.3357. Since it is greater than 1–0.95=0.05, its value confirms that the variable is not significant at the 95% confidence level.

Variable D

The 95% confidence interval for the variable’s coefficient does not contain 0, which indicates that Variable D is significant at the 95% confidence level. The p-value (not shown) of Variable D, is 0.0028. Since it is less than 1–0.95=0.05, its value confirms that the variable is significant at the 95% confidence level.

None of the variables are significant at the 95% confidence level

Variables A and D are significant at the 95% confidence level.

42
Q

The manager of a direct mailing marketing firm wants to analyze whether, and if so how, the number of pieces of mail sent to households about a credit card promotion affects the number of credit card applications the firm receives. The manager collects data on the number of pieces of mail the firm has sent and number of applications received. Based on industry knowledge, the manager knows that customers don’t necessarily apply for the credit card in the week they receive the mailing, but believes that some may apply in the week following the initial mailing.

The manager wishes to perform a regression analysis using two independent variables: the number of pieces of mail sent during a specified week, and the number sent the week preceding the specified week. Create a new lagged variable in the highlighted column: “Pieces of Mail Sent the Preceding Week” to allow for the appropriate regression analysis.

A

To create the lagged variable, copy the data from C2:C32 and paste it into D3:D33. For example, 93,460 should be in cell D3; 85,220 should be in cell D4, and so on. Cell D2 should be blank and there is a new row of data that has an entry only in cell D33.

43
Q

The spreadsheet below contains the direct marketing firm’s data for the number of pieces of mail sent (both current and lagged) and the number of credit card applications received. The rows with missing values (the first and last rows) have been removed. Perform a regression analysis to analyze the impact of the number of pieces of mail sent in the current and preceding weeks on the number of applications received. Be sure to include labels, residuals, and residual plots in your analysis.

A

From the Data menu, select Data Analysis, then select Regression. The Input Y Range is B1:B31 and the Input X Range is C1:D31. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

44
Q

Based on the resulting regression output table and residual plots (shown below), which of the following statements is FALSE? SELECT ALL THAT APPLY.

  • Both variables are significant at the 5% significance level.
  • There appears to be heteroskedasticity.
  • On average, the net effect of both independent variables is negative.
  • On average, the net effect of both independent variables is positive.
A

Both variables are significant at the 5% significance level.

“Pieces of Mails Sent the Preceding Week” is not significant because its p-value, 0.5446, is greater than 0.05.

There appears to be heteroskedasticity.

There appears to be heteroskedasticity because the residual plot for “Pieces of Mail Sent the Preceding Week” exhibits a funnel shape.

On average, the net effect of both independent variables is negative.

The coefficients for the independent variables are both positive so the net effect of these variables is not negative.

On average, the net effect of both independent variables is positive.

The coefficients for the independent variables are both positive so the net effect of these variables is positive.

45
Q

Which of the following statements about multicollinearity is TRUE? SELECT ALL THAT APPLY.

  • Multicollinearity occurs when two or more independent variables are highly correlated
  • Multicollinearity is usually not an issue when the regression model is only being used for forecasting
  • Multicollinearity is usually not an issue when the regression model is only being used to understand net relationships
  • Multicollinearity can typically be reduced by decreasing the sample size
  • Multicollinearity can typically be reduced by adding more independent variables
A

Multicollinearity occurs when two or more independent variables are highly correlated

Multicollinearity means that two or more of the independent variables are collinear, meaning they are highly correlated. One or more the independent variables may not be significant because the variable with which it is correlated serves as a proxy variable.

Multicollinearity is usually not an issue when the regression model is only being used for forecasting

Multicollinearity is typically not a problem when the model is being used for forecasting, especially if the predicative power of the model is increased by the additional variable(s).

Multicollinearity is usually not an issue when the regression model is only being used to understand net relationships

Multicollinearity affects the estimates of the coefficients, thereby distorting the net relationships.

Multicollinearity can typically be reduced by decreasing the sample size

Multicollinearity can be reduced by increasing the sample size.

Multicollinearity can typically be reduced by adding more independent variables

Multicollinearity can be reduced by removing one or more of the collinear variables.

46
Q

The manager of a furniture factory that operates a morning and evening shift seven days a week wants to forecast the number of chairs its factory workers will produce on a given day and shift. The production manager gathers chair production data from the factory and lists whether the production day was a weekday or a weekend (i.e., Saturday or Sunday), and whether the shift was in the morning or evening. Create a regression model to analyze the relationship between the number of chairs produced, whether a day was a weekday or weekend, and whether the shift was in the morning or evening. Be sure to include the residuals and residual plots in your analysis.

A

From the Data menu, select Data Analysis, then select Regression. The Input Y Range is A1:A62 and the Input X Range is B1:C62. You must check the Labels box to ensure that the regression output table is appropriately labeled. You must also check the Residuals and Residual Plots boxes so that you are able to analyze the residuals.

47
Q

Given the regression output table shown below, which of the following statements is true?

  • On weekdays, the average number of chairs produced per shift is 70.97 greater than on weekends.
  • On weekends, the average number of chairs produced per shift is 70.97 less than on weekdays, provided the shift time remains constant.
  • During the morning shift, the average number of chairs produced is 47.85 greater than during the evening shift.
  • During the morning shift, the average number of chairs produced is 47.85 less than during the evening shift, provided that whether or not it is a weekday remains constant.
A

On weekdays, the average number of chairs produced per shift is 70.97 greater than on weekends.

This option describes a gross effect whereas the multiple regression model describes a net relationship.

On weekends, the average number of chairs produced per shift is 70.97 less than on weekdays, provided the shift time remains constant.

70.97 is the net effect of the day being a weekday on the number of chairs produced per shift. On weekdays the Weekday variable is set to “1” in the regression. On weekends the Weekday variable is set to a “0” in the regression equation, so we essentially exclude this variable. Therefore, we can conclude that on weekdays, the average number of chairs produced is 70.97 greater than on weekends, provided that the shift time remains constant. Equivalently, we can conclude that on weekends, the average number of chairs produced is 70.97 less than on weekdays, provided that the shift time remains constant.

During the morning shift, the average number of chairs produced is 47.85 greater than during the evening shift.

This option describes a gross effect but the multiple regression model describes a net relationship.

During the morning shift, the average number of chairs produced is 47.85 less than during the evening shift, provided that whether or not it is a weekday remains constant.

47.85 is the net effect of a morning shift on the number of chairs produced, controlling for whether or not it is a weekday. Thus, when the morning shift occurs, the average number of chairs produced is 47.85 greater than when the evening shift occurs, provided that whether or not it is a weekday remains constant.

48
Q

The manager of a furniture factory that operates a morning and evening shift seven days a week wants to forecast the number of chairs its factory workers will produce on a given day and shift. The production manager gathers chair production data from the factory and lists whether the production day was a weekday or a weekend (i.e., Saturday or Sunday), and whether the shift was in the morning or evening. Using the regression model, forecast the number of chairs that will be produced on a Thursday during the evening shift.

A

The expected number of chairs that will be produced is B15+(1*B16)+(0*B17)=B15+B16=477. You must link directly to values in order to obtain the correct answer.

49
Q

The manager of a furniture factory that operates a morning and evening shift seven days a week wants to forecast the number of chairs its factory workers will produce on a given day and shift. The production manager gathers chair production data from the factory and lists whether the production day was a weekday or a weekend (i.e., Saturday or Sunday), and whether the shift was in the morning or evening. Using the regression model, forecast the number of chairs that will be produced during a Sunday morning shift.

A

The expected number of chairs that will be produced is B15+(0*B16)+(1*B17)=B15+B17=453. You must link directly to values in order to obtain the correct answer.

50
Q

Below is some output from the regression on the furniture factory data. What does the R-square value tell us?

  • That we cannot reject the null hypothesis
  • That there is multicollinearity between the independent variables
  • That on average, 0.7059 more chairs are produced during weekday shifts than during weekend shifts.
  • That 71% of the variability in the number of chairs produced can be explained by whether the shift is in the morning or evening and whether it is a weekday shift or weekend shift.
A

That we cannot reject the null hypothesis

The significance level of the F-value provides information on whether we can reject the null hypothesis. The R-square value indicates what percentage of the variability in the dependent variable is explained by the regression line.

That there is multicollinearity between the independent variables

Multicollinearity involves correlation among individual independent variables; R-square provides information about the relationship between the regression line and the dependent variable.

That on average, 0.7059 more chairs are produced during weekday shifts than during weekend shifts.

The coefficient for shift-time, 70.97, tells us how many more chairs are produced on weekday shifts than on weekend shifts.

That 71% of the variability in the number of chairs produced can be explained by whether the shift is in the morning or evening and whether it is a weekday shift or weekend shift.

R-square indicates what percentage of the variability in the dependent variable is explained by the regression line

51
Q

If an independent variable has a p-value of 0.07, which of the following could represent the Lower 95% and the Upper 95% for that variable?

  • -14.52, -3.25
  • -14.52, 3.25
  • 3.25, 14.52
  • The answer cannot be determined without further information
A

-14.52, -3.25

The p-value, 0.07, is greater than 0.05 so the independent variable is not significant at the 5% significance level. Therefore, the 95% confidence interval for the coefficient of the independent variable must include zero. The interval between -14.52 and -3.25 does not contain zero.

-14.52, 3.25

The p-value, 0.07, is greater than 0.05 so the independent variable is not significant at the 5% significance level. Therefore, the 95% confidence interval for the coefficient of the independent variable must include zero. The interval between -14.52 and 3.25 contains zero.

3.25, 14.52

The p-value, 0.07, is greater than 0.05 so the independent variable is not significant at the 5% significance level. Therefore, the 95% confidence interval for the coefficient of the independent variable must include zero. The interval between 3.25 and 14.52 does not contain zero.

The answer cannot be determined without further information

The p-value, 0.07, is greater than 0.05 so the independent variable is not significant at the 5% significance level. Therefore, we know that the 95% confidence interval for the coefficient of the independent variable must include zero.

52
Q

Which of the following is TRUE about the difference between analyzing residual plots for single variable regression models and analyzing residual plots for multiple regression models.

  • Multiple regression residual plots give insight into multicollinearity across the independent variables, whereas multicollinearity cannot occur in a single variable regression model or its residual plot.
  • Single variable regression plots give insight into the gross relationship between the independent and dependent variable, whereas multiple regression plots give insight into the net relationship, controlling for the other independent variables included in the regression model.
  • Multiple regression plots give insight into both heteroskedasticity and non-linearity, whereas single regression plots only give insight into heteroskedasticity.
  • In single regression plots, residuals are measured as the shortest distance from the regression line, whereas in multiple regression residuals are measured along the vertical axis.
A

Multiple regression residual plots give insight into multicollinearity across the independent variables, whereas multicollinearity cannot occur in a single variable regression model or its residual plot.

Residual plots cannot provide information about multicollinearity.

Single variable regression plots give insight into the gross relationship between the independent and dependent variable, whereas multiple regression plots give insight into the net relationship, controlling for the other independent variables included in the regression model.

In multiple regression, we see the effect of each independent variable, controlling for all the other variables in the model, or the net effect. This is reflected in the residual plots.

Multiple regression plots give insight into both heteroskedasticity and non-linearity, whereas single regression plots only give insight into heteroskedasticity.

All residual plots give insight into BOTH heteroskedasticity and non-linearity.

In single regression plots, residuals are measured as the shortest distance from the regression line, whereas in multiple regression residuals are measured along the vertical axis.

Residuals are measured the same way in both single and multiple regression; they are measured as distance from the regression line along the vertical axis.