CHAPTER 5 Regression for Describing and Forecasting Flashcards
What is regression?
Finding the line of best fit through data to describe the relationship between two or more variables.
What are the two primary uses of regression?
- Description
- Forecasting
What is overfitting in regression?
A problem that arises when a model is too complex and fits the noise in the data rather than the underlying relationship.
What does a regression equation express?
A linear relationship between a dependent variable and an independent variable.
What are the regression parameters in a regression equation?
- α (intercept)
- β (slope)
What does the intercept (α) represent in regression?
The predicted number of crimes on a day when the average temperature is 0 degrees Fahrenheit.
What does the slope (β) represent in regression?
The amount that predicted crime increases with each degree Fahrenheit.
What is the sum of squared errors (SSE)?
A measure of how well a regression line fits the data, calculated by summing the squared errors of each observation.
What is ordinary least squares (OLS) regression?
A method to find the values of α and β that minimize the sum of squared errors.
What are OLS regression coefficients?
The values of the parameters α and β that minimize the sum of squared errors.
How do we interpret the OLS regression line?
It estimates the average number of crimes based on temperature and shows how crime changes with temperature.
True or False: The regression line is always the best fit for all types of data.
False
What is a conditional mean function?
A function that tells you the mean of some variable conditional on the value of another variable.
What is the relationship between the regression line and conditional means?
The regression line is the best linear approximation to the conditional means.
What would you minimize to find the conditional median?
The sum of the absolute values of the errors.
What was a historical reason for using the sum of squared errors?
It is computationally easier to calculate using linear algebra.
What does the regression line tell us about crime and temperature?
It predicts the average number of crimes based on temperature.
What does it mean to regress crime on temperature?
To run an ordinary least squares regression where crime is the dependent variable and temperature is the independent variable.
What is the significance of the slope of the regression line?
It indicates how much crime changes as the temperature changes.
Fill in the blank: The line of best fit minimizes the _______.
sum of squared errors
What is the graphical representation of errors in regression?
Vertical lines from data points to the regression line.
In the context of regression, what does ‘parsimonious summary’ mean?
A concise representation of the data using the least number of parameters.
What happens if we use an arbitrary line in regression?
It will yield poor forecasts.
What is the role of statistical software in OLS regression?
To calculate the values of α and β that minimize the sum of squared errors.
What data was used to illustrate regression in this chapter?
Crime and temperature data in Chicago.
What is the importance of the slope in the context of crime data?
It quantifies the increase in crime for each degree increase in temperature.
What does the conditional mean function graph represent?
The average number of crimes for each temperature bin.
What is one alternative to using the sum of squared errors?
Minimizing the sum of absolute values of the errors.
What does OLS stand for in the context of regression?
Ordinary Least Squares.
What does the intercept (α O L S) in a regression equation represent?
The predicted turnout rate when the independent variable is zero.
What is the predicted turnout rate for people aged zero according to the regression?
Approximately −14 percent.
What does the slope (β O L S) in a regression equation indicate?
The average change in turnout rate for each additional year of age.
What is the average increase in voter turnout for each additional year of age between 19 and 68?
Just over 1 percentage point.
How much more likely are 68-year-olds to vote compared to 18-year-olds?
Approximately 50 percentage points more likely.
What is the turnout rate for 18-year-olds in the 2014 election?
Approximately 4.8 percent.
What are the predicted turnout rates for 69- and 70-year-olds?
Approximately 57.3 percent and 58.3 percent, respectively.
True or False: The regression line can accurately predict turnout rates for infants.
False.
What is one method to account for non-linearity in a regression analysis?
Fit separate regression lines for different segments of the data.
What is one limitation of using a single linear regression for non-linear data?
It may produce a lot of error since it doesn’t capture the true relationship.
What is a polynomial regression?
A regression that includes powers of the explanatory variable, such as age-squared or age-cubed.
What happens to the fit of the regression as more explanatory variables are added?
The fit typically improves, but it may lead to overfitting.
What is overfitting in the context of regression analysis?
When a model becomes too complex and captures noise rather than the underlying relationship.
What is the trade-off when adding more explanatory variables to a regression?
Increased complexity may lead to worse out-of-sample predictions.
What is one reason why the tenth-order polynomial regression might perform poorly?
It captures meaningless correlations that do not hold in different data sets.
What is the typical accuracy of polling averages in predicting presidential election outcomes?
Within 1 or 2 percentage points of the final vote share.
Who is Nate Silver?
A journalist known for political data analytics and averaging polls.
Fill in the blank: The relationship between age and voter turnout is approximately ______ from 18 to 68.
linear.
What is the average accuracy of polls on Election Day?
Within 1 or 2 percentage points of the final vote share.
Who established himself as a giant of political data analytics?
Nate Silver.
True or False: The Electoral College allows some candidates to win the election while losing the popular vote.
True.
What journal typically publishes a symposium before each presidential election?
PS: Political Science & Politics.
What are some fundamental variables that might help predict vote share?
- Economic growth
- Incumbency status
- Number of war casualties
What does the r² statistic represent in regression analysis?
The proportion of variance in the dependent variable that can be explained by the independent variables.
What was the average prediction error of the regression using ten independent variables?
1.7 percentage points.
What happens to the average prediction error when out-of-sample testing is applied?
It jumps up from 1.7 to 5.6 percentage points.
Fill in the blank: A naive prediction based on a simple average of past elections gets within ______ percentage points.
4.6.
What is overfitting in regression analysis?
Predicting a dependent variable with too many independent variables.
What is the purpose of out-of-sample testing?
To assess the predictive accuracy of a model using data not included in the original regression.
What is the intercept in a regression context?
The predicted value of the outcome when all explanatory variables are set to 0.
What is the sum of squared errors (SSE)?
The total of squared differences between actual and predicted values.
Who is credited with coining the term ‘regression’?
Francis Galton.
What phenomenon did Galton discover related to regression?
Regression to the mean.
What does the slope of the regression line indicate?
The sign and magnitude of the relationship between two variables.
What is the common form for presenting regression results?
In a table.
What is the conditional mean function?
A function that tells the mean of a variable given the value of other variables.
What type of regression uses the method of ordinary least squares?
OLS regression.
What is the root mean squared error (Root-MSE)?
The square root of the mean squared error, indicating average prediction deviation.
What is an independent or explanatory variable?
A variable used to predict or explain the dependent variable.
What is a dependent variable?
The variable associated with the outcome we are trying to describe, predict, or explain.
What might cause random variables to appear correlated with outcomes in regression?
Chance.
What did Galton’s analysis of heights reveal?
Sons tend to be taller than their fathers but shorter than average.
What is the goal of researchers when trying to predict election outcomes?
To find new variables to improve predictive power.
What is the conditional mean function?
A function that tells you the mean (average) of some variable conditional on the value of some other variables.
Define out-of-sample prediction.
Using regression (or another statistical technique) to predict the outcome for observations that were not included in the original data you used to generate your predictions.
What is overfitting?
Attempting to predict a dependent variable with too many independent variables, so that variables appear to predict the dependent variable in the data but have no actual relationship with it in the world.
What does the dataset SchoolingEarnings.csv provide?
The average annual earnings for 41- to 50-year-old men in the United States in 1980 at each level of schooling.
What is the dependent variable in the regression exercise discussed?
Earnings.
Fill in the blank: A _______ is a method to predict earnings using only years of schooling.
parsimonious way
What is the first step in examining the relationship between earnings and schooling?
Start by making a scatter plot.
What type of regression should be run to include schooling, schooling^2, schooling^3, and schooling^4?
Fourth-order polynomial regression.
What is the purpose of running different regressions for some different ranges of schooling?
To see if those lines look meaningfully different from the predictions you get from a single regression including all the data.
Does all the analysis make you think the simple linear approach was reasonable or unreasonable?
Subjective judgment based on analysis results.
What is the goal of conducting out-of-sample tests?
To evaluate your prediction strategy.
Using only data for those with twelve years of schooling or less, what should you predict?
Earnings for those with more than twelve years of schooling.
Who wrote ‘The History of Statistics: The Measurement of Uncertainty before 1900’?
Stephen M. Stigler.
True or False: Asking enough people to guess the weight of an ox will always yield a correct answer.
False.
What is the principle behind asking many non-experts for their guesses?
Their errors will cancel out and you’ll get a good answer.