CHAPTER 5 Regression for Describing and Forecasting Flashcards

1
Q

What is regression?

A

Finding the line of best fit through data to describe the relationship between two or more variables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the two primary uses of regression?

A
  • Description
  • Forecasting
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is overfitting in regression?

A

A problem that arises when a model is too complex and fits the noise in the data rather than the underlying relationship.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a regression equation express?

A

A linear relationship between a dependent variable and an independent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the regression parameters in a regression equation?

A
  • α (intercept)
  • β (slope)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What does the intercept (α) represent in regression?

A

The predicted number of crimes on a day when the average temperature is 0 degrees Fahrenheit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does the slope (β) represent in regression?

A

The amount that predicted crime increases with each degree Fahrenheit.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the sum of squared errors (SSE)?

A

A measure of how well a regression line fits the data, calculated by summing the squared errors of each observation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is ordinary least squares (OLS) regression?

A

A method to find the values of α and β that minimize the sum of squared errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are OLS regression coefficients?

A

The values of the parameters α and β that minimize the sum of squared errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do we interpret the OLS regression line?

A

It estimates the average number of crimes based on temperature and shows how crime changes with temperature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

True or False: The regression line is always the best fit for all types of data.

A

False

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a conditional mean function?

A

A function that tells you the mean of some variable conditional on the value of another variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is the relationship between the regression line and conditional means?

A

The regression line is the best linear approximation to the conditional means.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What would you minimize to find the conditional median?

A

The sum of the absolute values of the errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What was a historical reason for using the sum of squared errors?

A

It is computationally easier to calculate using linear algebra.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What does the regression line tell us about crime and temperature?

A

It predicts the average number of crimes based on temperature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What does it mean to regress crime on temperature?

A

To run an ordinary least squares regression where crime is the dependent variable and temperature is the independent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the significance of the slope of the regression line?

A

It indicates how much crime changes as the temperature changes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Fill in the blank: The line of best fit minimizes the _______.

A

sum of squared errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the graphical representation of errors in regression?

A

Vertical lines from data points to the regression line.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

In the context of regression, what does ‘parsimonious summary’ mean?

A

A concise representation of the data using the least number of parameters.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What happens if we use an arbitrary line in regression?

A

It will yield poor forecasts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the role of statistical software in OLS regression?

A

To calculate the values of α and β that minimize the sum of squared errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What data was used to illustrate regression in this chapter?

A

Crime and temperature data in Chicago.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is the importance of the slope in the context of crime data?

A

It quantifies the increase in crime for each degree increase in temperature.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What does the conditional mean function graph represent?

A

The average number of crimes for each temperature bin.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is one alternative to using the sum of squared errors?

A

Minimizing the sum of absolute values of the errors.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What does OLS stand for in the context of regression?

A

Ordinary Least Squares.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does the intercept (α O L S) in a regression equation represent?

A

The predicted turnout rate when the independent variable is zero.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

What is the predicted turnout rate for people aged zero according to the regression?

A

Approximately −14 percent.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

What does the slope (β O L S) in a regression equation indicate?

A

The average change in turnout rate for each additional year of age.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What is the average increase in voter turnout for each additional year of age between 19 and 68?

A

Just over 1 percentage point.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

How much more likely are 68-year-olds to vote compared to 18-year-olds?

A

Approximately 50 percentage points more likely.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

What is the turnout rate for 18-year-olds in the 2014 election?

A

Approximately 4.8 percent.

36
Q

What are the predicted turnout rates for 69- and 70-year-olds?

A

Approximately 57.3 percent and 58.3 percent, respectively.

37
Q

True or False: The regression line can accurately predict turnout rates for infants.

38
Q

What is one method to account for non-linearity in a regression analysis?

A

Fit separate regression lines for different segments of the data.

39
Q

What is one limitation of using a single linear regression for non-linear data?

A

It may produce a lot of error since it doesn’t capture the true relationship.

40
Q

What is a polynomial regression?

A

A regression that includes powers of the explanatory variable, such as age-squared or age-cubed.

41
Q

What happens to the fit of the regression as more explanatory variables are added?

A

The fit typically improves, but it may lead to overfitting.

42
Q

What is overfitting in the context of regression analysis?

A

When a model becomes too complex and captures noise rather than the underlying relationship.

43
Q

What is the trade-off when adding more explanatory variables to a regression?

A

Increased complexity may lead to worse out-of-sample predictions.

44
Q

What is one reason why the tenth-order polynomial regression might perform poorly?

A

It captures meaningless correlations that do not hold in different data sets.

45
Q

What is the typical accuracy of polling averages in predicting presidential election outcomes?

A

Within 1 or 2 percentage points of the final vote share.

46
Q

Who is Nate Silver?

A

A journalist known for political data analytics and averaging polls.

47
Q

Fill in the blank: The relationship between age and voter turnout is approximately ______ from 18 to 68.

48
Q

What is the average accuracy of polls on Election Day?

A

Within 1 or 2 percentage points of the final vote share.

49
Q

Who established himself as a giant of political data analytics?

A

Nate Silver.

50
Q

True or False: The Electoral College allows some candidates to win the election while losing the popular vote.

51
Q

What journal typically publishes a symposium before each presidential election?

A

PS: Political Science & Politics.

52
Q

What are some fundamental variables that might help predict vote share?

A
  • Economic growth
  • Incumbency status
  • Number of war casualties
53
Q

What does the r² statistic represent in regression analysis?

A

The proportion of variance in the dependent variable that can be explained by the independent variables.

54
Q

What was the average prediction error of the regression using ten independent variables?

A

1.7 percentage points.

55
Q

What happens to the average prediction error when out-of-sample testing is applied?

A

It jumps up from 1.7 to 5.6 percentage points.

56
Q

Fill in the blank: A naive prediction based on a simple average of past elections gets within ______ percentage points.

57
Q

What is overfitting in regression analysis?

A

Predicting a dependent variable with too many independent variables.

58
Q

What is the purpose of out-of-sample testing?

A

To assess the predictive accuracy of a model using data not included in the original regression.

59
Q

What is the intercept in a regression context?

A

The predicted value of the outcome when all explanatory variables are set to 0.

60
Q

What is the sum of squared errors (SSE)?

A

The total of squared differences between actual and predicted values.

61
Q

Who is credited with coining the term ‘regression’?

A

Francis Galton.

62
Q

What phenomenon did Galton discover related to regression?

A

Regression to the mean.

63
Q

What does the slope of the regression line indicate?

A

The sign and magnitude of the relationship between two variables.

64
Q

What is the common form for presenting regression results?

A

In a table.

65
Q

What is the conditional mean function?

A

A function that tells the mean of a variable given the value of other variables.

66
Q

What type of regression uses the method of ordinary least squares?

A

OLS regression.

67
Q

What is the root mean squared error (Root-MSE)?

A

The square root of the mean squared error, indicating average prediction deviation.

68
Q

What is an independent or explanatory variable?

A

A variable used to predict or explain the dependent variable.

69
Q

What is a dependent variable?

A

The variable associated with the outcome we are trying to describe, predict, or explain.

70
Q

What might cause random variables to appear correlated with outcomes in regression?

71
Q

What did Galton’s analysis of heights reveal?

A

Sons tend to be taller than their fathers but shorter than average.

72
Q

What is the goal of researchers when trying to predict election outcomes?

A

To find new variables to improve predictive power.

73
Q

What is the conditional mean function?

A

A function that tells you the mean (average) of some variable conditional on the value of some other variables.

74
Q

Define out-of-sample prediction.

A

Using regression (or another statistical technique) to predict the outcome for observations that were not included in the original data you used to generate your predictions.

75
Q

What is overfitting?

A

Attempting to predict a dependent variable with too many independent variables, so that variables appear to predict the dependent variable in the data but have no actual relationship with it in the world.

76
Q

What does the dataset SchoolingEarnings.csv provide?

A

The average annual earnings for 41- to 50-year-old men in the United States in 1980 at each level of schooling.

77
Q

What is the dependent variable in the regression exercise discussed?

78
Q

Fill in the blank: A _______ is a method to predict earnings using only years of schooling.

A

parsimonious way

79
Q

What is the first step in examining the relationship between earnings and schooling?

A

Start by making a scatter plot.

80
Q

What type of regression should be run to include schooling, schooling^2, schooling^3, and schooling^4?

A

Fourth-order polynomial regression.

81
Q

What is the purpose of running different regressions for some different ranges of schooling?

A

To see if those lines look meaningfully different from the predictions you get from a single regression including all the data.

82
Q

Does all the analysis make you think the simple linear approach was reasonable or unreasonable?

A

Subjective judgment based on analysis results.

83
Q

What is the goal of conducting out-of-sample tests?

A

To evaluate your prediction strategy.

84
Q

Using only data for those with twelve years of schooling or less, what should you predict?

A

Earnings for those with more than twelve years of schooling.

85
Q

Who wrote ‘The History of Statistics: The Measurement of Uncertainty before 1900’?

A

Stephen M. Stigler.

86
Q

True or False: Asking enough people to guess the weight of an ox will always yield a correct answer.

87
Q

What is the principle behind asking many non-experts for their guesses?

A

Their errors will cancel out and you’ll get a good answer.