Topics 20-23 Flashcards

Question 1

Q

Sample regression function

Answer

A

The sample regression function is an equation that represents a relationship between the Y and X variable(s) that is based only on the information in a sample of the population.

In almost all cases the slope and intercept coefficients of a sample regression function will be different from that of the population regression function. If the sample of X and Y variables is truly a random sample, then the difference between the sample coefficients and the population coefficients will be random too.

Question 2

Q

OLS (Ordinary Least Squares) - slope and intercept

Question 3

Q

Assumptions of OLS

Answer

A

OLS regression requires a number of assumptions. Most of the major assumptions pertain to the regression model’s residual term (i.e., error term).

Three key assumptions are as follows:

The expected value of the error term, conditional on the independent variable, is zero (E(ε_i|Xi) = 0).
All (X, Y) observations are independent and identically distributed (i.i.d.).
It is unlikely that large outliers will be observed in the data. Large outliers have the potential to create misleading regression results.

Additional assumptions include:

A linear relationship exists between the dependent and independent variable.
The model is correctly specified in that it includes the appropriate independent variable and does not omit variables.
The independent variable is uncorrelated with the error terms.
The variance of ε_i is constant for all X_i: Var(E(ε_i|Xi) = σ².
No serial correlation of the error terms exists [i.e., Corr(ε_i,ε_i+1) = 0 for i=l, 2, 3 …].
The point being that knowing the value of an error for one observation does not reveal information concerning the value of an error for another observation.
The error term is normally distributed.

Question 4

Q

Properties of OLS estimators

Answer

A

Since OLS estimators are derived from random samples, these estimators are also random variables because they vary from one sample to the next. Therefore, OLS estimators will have their own probability distributions (i.e., sampling distributions). These sampling distributions allow us to estimate population parameters, such as the population mean, the population regression intercept term, and the population regression slope coefficient.

Drawing multiple samples from a population will produce multiple sample means. The distribution of these sample means is referred to as the sampling distribution of the sample mean. The mean of this sampling distribution is used as an estimator of the population mean and is said to be an unbiased estimator of the population mean.

An unbiased estimator is one for which the expected value of the estimator is equal to the parameter you are trying to estimate.

Given the central limit theorem, for large sample sizes, it is reasonable to assume that the sampling distribution will approach the normal distribution. This means that the estimator is also a consistent estimator.

A consistent estimator is one for which the accuracy of the parameter estimate increases as the sample size increases. Note that a general guideline for a large sample size in regression analysis is a sample greater than 100.

Question 5

Q

Sum of squared residuals (SSR)

Answer

A

The sum of squared residuals (SSR), sometimes denoted SSE, for sum of squared errors, is the sum of squares that results from placing a given intercept and slope coefficient into the equation and computing the residuals, squaring the residuals and summing them. It is represented by Σe_i². The sum is an indicator of how well the sample regression function explains the data.

Question 6

Q

Coefficient of determination

Answer

A

The coefficient of determination, represented by R², is a measure of the “goodness of fit” of the regression. It is interpreted as a percentage of variation in the dependent variable explained by the independent variable. The underlying concept is that for the dependent variable, there is a total sum of squares (TSS) around the sample mean. The regression equation explains some portion of that TSS. Since the explained portion is determined by the independent variables, which are assumed independent of the errors, the total sum of squares can be broken down as follows:

Total sum of squares = explained sum of squares + sum of squared residuals

! Sum of squared residuals (SSR) is also known as the sum of squared errors (SSE). In the same regard, total sum of squares (TSS) is also known as sum of squares total (SST), and explained sum of squares (ESS) is also known as regression sum of squares (RSS).

Question 7

Q

Standard Error of the Regression

Answer

A

The standard error of the regression (SER) measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. The SER gauges the “fit” of the regression line. The smaller the standard error, the better the fit.

The SER is the standard deviation of the error terms in the regression. As such, SER is also referred to as the standard error of the residual, or the standard error of estimate (SEE).

Question 8

Q

Confidence intervals for regression coefficients

Question 9

Q

Hypothesis tests about regression coefficients

Question 10

Q

Interpret p-value for linear regression

Answer

A

For two-tailed tests, the p-value is the probability that lies above the positive value of the computed test statistic plus the probability that lies below the negative value of the computed test statistic.

For example, by consulting the z-table, the probability that lies above a test statistic of 2.46 is: (1 — 0.9931) = 0.0069 = 0.69%. With a two-tailed test, this p-value is: 2 x 0.69% = 1.38%. Therefore, the null hypothesis can be rejected at any level of significance greater than 1.38%. However, with a level of significance of, say, 1%, we would fail to reject the null.

A very small p-value provides support for rejecting the null hypothesis. This would indicate a large test statistic that is likely greater than critical values for a common level of significance (e.g., 5%).

Question 11

Q

Confidence intervals for the predicted value of a dependent variable

Question 12

Q

Heteroskedasticity

Question 13

Q

Effect of Heteroskedasticity on Regression Analysis

Answer

A

There are several effects of heteroskedasticity you need to be aware of:

The standard errors are usually unreliable estimates.
The coefficient estimates (the b_j) aren’t affected.
If the standard errors are too small, but the coefficient estimates themselves are not affected, the t-statistics will be too large and the null hypothesis of no statistical significance is rejected too often. The opposite will be true if the standard errors are too large.

Question 14

Q

How to correct heteroskedasticity

Answer

A

Heteroskedasticity is not easy to correct.

The most common remedy, however, is to calculate robust standard errors. These robust standard errors are used to recalculate the t-statistics using the original regression coefficients.

Use robust standard errors to calculate t-statistics if there is evidence of heteroskedasticity.

By default, many statistical software packages apply homoskedastic standard errors unless the user specifies otherwise.

Question 15

Q

The Gauss-Markov theorem

Question 16

Q

t-statistics for small sample sizes

Answer

A

In order to analyze a regression coefficient t-statistic when the sample size is small, we must assume the assumptions underlying linear regression hold.

In particular, in order to apply and interpret the t-statistic, error terms must be homoskedastic (i.e., constant variance of error terms) and the error terms must be normally distributed. If this is the case, the t-statistic can be computed using the default standard error (i.e., the homoskedasticity-only standard error), and it follows a t-distribution with n — 2 degrees of freedom.

In practice, it is rare to assume that error terms have a constant variance and are normally distributed. However, it is generally the case that sample sizes are large enough to apply the central limit theorem meaning that we can calculate t-statistics using homoskedasticityonly standard errors. In other words, with a large sample size, differences between the t-distribution and the standard normal distribution can be ignored.

Question 17

Q

Omitted variable bias

Answer

A

Omitting relevant factors from an ordinary least squares (OLS) regression can produce misleading or biased results. Omitted variable bias is present when two conditions are met:

the omitted variable is correlated with the movement of the independent variable in the model, and
the omitted variable is a determinant of the dependent variable.

When relevant variables are absence from a linear regression model, the results will likely lead to incorrect conclusions as the OLS estimators may not accurately portray the actual data.

Omitted variable bias violates the assumptions of OLS regression when the omitted variable is in fact correlated with current independent (explanatory) variable(s). The reason for this violation is because omitted factors that partially describe the movement of the dependent variable will become part of the regression’s error term since they are not properly identified within the model. If the omitted variable is correlated with the regression’s slope coefficient, then the error term will also be correlated with the slope coefficient.

! Recall, that according to the assumptions of linear regression, the independent variable must be uncorrelated with the error term.

Question 18

Q

The standard error of the regression (SER) for multiple regression case

Question 19

Q

The multiple coefficient of determination

Question 20

Q

Adjusted R²

Question 21

Q

Assumptions of the multiple linear regression model

Answer

A

As with simple linear regression, most of the assumptions made with the multiple regression pertain to ε, the model’s error term:

A linear relationship exists between the dependent and independent variables.
The independent variables are not random, and there is no exact linear relation between any two or more independent variables.
The expected value of the error term, conditional on the independent variables, is zero [i.e., E(ε|X₁,X₂,…X_k) = 0].
The variance of the error terms is constant for all observations [i.e., E(ε_i²) = σ_ε²].
The error term for one observation is not correlated with that of another observation [i.e., E(ε_iε_j) = 0, j ≠ i].
The error term is normally distributed.

Question 22

Q

Multicollinearity (perfect, imperfect), dummy variable trap

Answer

A

Multicollinearity refers to the condition when two or more of the independent variables, or linear combinations of the independent variables, in a multiple regression are highly correlated with each other. This condition distorts the standard error of the regression and the coefficient standard errors, leading to problems when conducting t-tests for statistical significance of parameters.

If one of the independent variables is a perfect linear combination of the other independent variables, then the model is said to exhibit perfect multicollinearity. In this case, it will not be possible to find the OLS estimators necessary for the regression results.

Am important consideration when performing multiple regression with dummy variables is the choice of the number of dummy variables to include in the model. Whenever we want to distinguish between n classes, we must use n — 1 dummy variables. Otherwise, the regression assumption of no exact linear relationship between independent variables would be violated. In general, if every observation is linked to only one class, all dummy variables are included as regressors, and an intercept term exists, then the regression will exhibit perfect multicollinearity. This problem is known as the dummy variable trap.

Imperfect multicollinearity arises when two or more independent variables are highly correlated, but less than perfectly correlated. When conducting regression analysis, we need to be cognizant of imperfect multicollinearity since OLS estimators will be computed, but the resulting coefficients may be improperly estimated. In general, when using the term multicollinearity, we are referring to the imperfect case, since this regression assumption violation requires detecting and correcting.

Question 23

Q

Effect of Multicollinearity on Regression Analysis, Detecting Multicollinearity

Answer

A

As a result of multicollinearity, there is a greater probability that we will incorrectly conclude that a variable is not statistically significant (e.g., a Type II error).

The most common way to detect multicollinearity is the situation where t-tests indicate that none of the individual coefficients is significantly different than zero, while the R² is high.

If the absolute value of the sample correlation between any two independent variables in the regression is greater than 0.7, multicollinearity is a potential problem. However, this only works if there are exactly two independent variables.

If there are more than two independent variables, while individual variables may not be highly correlated, linear combinations might be, leading to multicollinearity.

High correlation among the independent variables suggests the possibility of multicollinearity, but low correlation among the independent variables does not necessarily indicate multicollinearity is not present.

Question 24

Q

Hypothesis Testing of Regression Coefficients in a multiple regression

Question 25

Q

Interpreting p-Values in multilpe regression

Answer

A

An alternative method of doing hypothesis testing of the coefficients is to comparethe p-value to the significance level:

If the p-value is less than significance level, the null hypothesis can be rejected.
If the p-value is greater than the significance level, the null hypothesis cannot be rejected.

Question 26

Q

Joint Hypothesis Testing

Answer

A

A joint hypothesis tests two or more coefficients at the same time.

For example, we could develop a null hypothesis for a linear regression model with three independent variables that sets two of these coefficients equal to zero: H₀ : b₁ = 0 and b₂ = 0 versus the alternative hypothesis that one of them is not equal to zero. That is, if just one of the equalities in this null hypothesis does not hold, we can reject the entire null hypothesis. Using a joint hypothesis test is preferred in certain scenarios since testing coefficients individually leads to a greater chance of rejecting the null hypothesis. For example, instead of comparing one t-statistic to its corresponding critical value in a joint hypothesis test, we are testing two t-statistics. Thus, we have an additional opportunity to reject the null. A robust method for applying joint hypothesis testing, especially when independent variables are correlated, is known as the F-statistic.

The F-statistic can be used to test the null hypothesis that jointly all of the independent variables have no influence on the dependent. The test statistic is given by F = (ESS/df₁)/(RSS/df₂), where in this case the numerator’s degrees of freedom (df₁) are equal to the number of partial slope coefficients and the denominator’s degrees of freedom are equal to (n - the total number of coefficients).

Question 27

Q

F-test in joint hypothesis testing

Answer

A

An F-test assesses how well the set of independent variables, as a group, explains the variation in the dependent variable. That is, the F-statistic is used to test whether at least one of the independent variables explains a significant portion of the variation of the dependent variable.

! It may have occurred to you that an easier way to test all of the coefficients simultaneously is to just conduct all of the individual t-tests and see how many of them you can reject. This is the wrong approach, however, because if you set the significance level for each t-test at 5%, for example, the significance level from testing them all simultaneously is NOT 5%, but rather some higher percentage.

!! When testing the hypothesis that all the regression coefficients are simultaneously equal to zero, the F-test is always a one-tailed test, despite the fact that it looks like it should be a two-tailed test because there is an equal sign in the null hypothesis.

Question 28

Q

Coefficient of determination in multiple regression

Question 29

Q

Specification bias

Answer

A

Specification bias refers to how the slope coefficient and other statistics for a given independent variable are usually different in a simple regression when compared to those of the same variable when included in a multiple regression.

Question 30

Q

Interpret the R² and adjusted R² in a multiple regression (pitfalls)

Answer

A

When computing both the R² and the adjusted R², there are a few pitfalls to acknowledge, which could lead to invalid conclusions.

If adding an additional independent variable to the regression improves the R², this variable is not necessary statistically significant.
The R² measure may be spurious, meaning that the independent variables may show a high R²; however, they are not the exact cause of the movement in the dependent variable.
If the R² is high, we cannot assume that we have found all relevant independent variables. Omitted variables may still exist, which would improve the regression results further.
The R² measure does not provide evidence that the most or least appropriate independent variables have been selected. Many factors go into finding the most robust regression model, including omitted variable analysis, economic theory, and the quality of data being used to generate the model.

Question 31

Q

Restricted vs. Unrestricted Least Squares Models

Answer

A

A restricted least squares regression imposes a value on one or more coefficients with the goal of analyzing if the restriction is significant.

The R² from the restricted regression is called a restricted R² or R_r².

For comparison, the unrestricted R² from the specification that includes both independent variables is given the notation R_ur², and both are included in an F-statistic that can test if the restriction is significant or not:

Question 32

Q

Interpret tests of a single restriction involving multiple coefficients

Answer

A

What if we wanted to test whether one coefficient was equal to another such that: H₀ : b₁ = b₂?

The first approach is to directly test the restriction stated in the null. Some statistical packages can test this restriction and output a corresponding F-stat.
The second approach transforms the regression and uses the null hypothesis as an assumption to simplify the regression model. For example, in a regression with two independent variables: Y_i = B₀ + B₁X_1i + B₂X_2i + ε_i, we can add and subtract B₂X_1ito ultimately transform the regression to: B₀ + (B₁ — B₂)X_1i + B₂ (X_1i + X_2i ) + ε_i. One of the coefficients will drop out in this equation when assuming that the null hypothesis of B₁ = B₂ is valid. We can remove the second term from our regression equation so that: B₀ + B₂ (X_1i + X_2i ) + ε_i. We observe that the null hypothesis test changes from a single restriction involving multiple coefficients to a single restriction on just one coefficient.

Question 33

Q

Model Misspecification

Answer

A

Omitting relevant factors from a regression can produce misleading or biased results.

Similar to simple linear regression, omitted variable bias in multiple regressions will result if the following two conditions occur:

The omitted variable is a determinant of the dependent variable.
The omitted variable is correlated with at least one of the independent variables.