Topics 20-23 Flashcards

1
Q

Sample regression function

A

The sample regression function is an equation that represents a relationship between the Y and X variable(s) that is based only on the information in a sample of the population.

In almost all cases the slope and intercept coefficients of a sample regression function will be different from that of the population regression function. If the sample of X and Y variables is truly a random sample, then the difference between the sample coefficients and the population coefficients will be random too.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

OLS (Ordinary Least Squares) - slope and intercept

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Assumptions of OLS

A

OLS regression requires a number of assumptions. Most of the major assumptions pertain to the regression model’s residual term (i.e., error term).

Three key assumptions are as follows:

  • The expected value of the error term, conditional on the independent variable, is zero (E(εi|Xi) = 0).
  • All (X, Y) observations are independent and identically distributed (i.i.d.).
  • It is unlikely that large outliers will be observed in the data. Large outliers have the potential to create misleading regression results.

Additional assumptions include:

  • A linear relationship exists between the dependent and independent variable.
  • The model is correctly specified in that it includes the appropriate independent variable and does not omit variables.
  • The independent variable is uncorrelated with the error terms.
  • The variance of εi is constant for all Xi: Var(E(εi|Xi) = σ2.
  • No serial correlation of the error terms exists [i.e., Corr(εii+1) = 0 for i=l, 2, 3 …].
  • The point being that knowing the value of an error for one observation does not reveal information concerning the value of an error for another observation.
  • The error term is normally distributed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Properties of OLS estimators

A

Since OLS estimators are derived from random samples, these estimators are also random variables because they vary from one sample to the next. Therefore, OLS estimators will have their own probability distributions (i.e., sampling distributions). These sampling distributions allow us to estimate population parameters, such as the population mean, the population regression intercept term, and the population regression slope coefficient.

Drawing multiple samples from a population will produce multiple sample means. The distribution of these sample means is referred to as the sampling distribution of the sample mean. The mean of this sampling distribution is used as an estimator of the population mean and is said to be an unbiased estimator of the population mean.

An unbiased estimator is one for which the expected value of the estimator is equal to the parameter you are trying to estimate.

Given the central limit theorem, for large sample sizes, it is reasonable to assume that the sampling distribution will approach the normal distribution. This means that the estimator is also a consistent estimator.

A consistent estimator is one for which the accuracy of the parameter estimate increases as the sample size increases. Note that a general guideline for a large sample size in regression analysis is a sample greater than 100.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sum of squared residuals (SSR)

A

The sum of squared residuals (SSR), sometimes denoted SSE, for sum of squared errors, is the sum of squares that results from placing a given intercept and slope coefficient into the equation and computing the residuals, squaring the residuals and summing them. It is represented by Σei2. The sum is an indicator of how well the sample regression function explains the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Coefficient of determination

A

The coefficient of determination, represented by R2, is a measure of the “goodness of fit” of the regression. It is interpreted as a percentage of variation in the dependent variable explained by the independent variable. The underlying concept is that for the dependent variable, there is a total sum of squares (TSS) around the sample mean. The regression equation explains some portion of that TSS. Since the explained portion is determined by the independent variables, which are assumed independent of the errors, the total sum of squares can be broken down as follows:

Total sum of squares = explained sum of squares + sum of squared residuals

! Sum of squared residuals (SSR) is also known as the sum of squared errors (SSE). In the same regard, total sum of squares (TSS) is also known as sum of squares total (SST), and explained sum of squares (ESS) is also known as regression sum of squares (RSS).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Standard Error of the Regression

A

The standard error of the regression (SER) measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. The SER gauges the “fit” of the regression line. The smaller the standard error, the better the fit.

The SER is the standard deviation of the error terms in the regression. As such, SER is also referred to as the standard error of the residual, or the standard error of estimate (SEE).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Confidence intervals for regression coefficients

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Hypothesis tests about regression coefficients

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Interpret p-value for linear regression

A

For two-tailed tests, the p-value is the probability that lies above the positive value of the computed test statistic plus the probability that lies below the negative value of the computed test statistic.

For example, by consulting the z-table, the probability that lies above a test statistic of 2.46 is: (1 — 0.9931) = 0.0069 = 0.69%. With a two-tailed test, this p-value is: 2 x 0.69% = 1.38%. Therefore, the null hypothesis can be rejected at any level of significance greater than 1.38%. However, with a level of significance of, say, 1%, we would fail to reject the null.

A very small p-value provides support for rejecting the null hypothesis. This would indicate a large test statistic that is likely greater than critical values for a common level of significance (e.g., 5%).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Confidence intervals for the predicted value of a dependent variable

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Heteroskedasticity

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Effect of Heteroskedasticity on Regression Analysis

A

There are several effects of heteroskedasticity you need to be aware of:

  • The standard errors are usually unreliable estimates.
  • The coefficient estimates (the bj) aren’t affected.
  • If the standard errors are too small, but the coefficient estimates themselves are not affected, the t-statistics will be too large and the null hypothesis of no statistical significance is rejected too often. The opposite will be true if the standard errors are too large.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How to correct heteroskedasticity

A

Heteroskedasticity is not easy to correct.

The most common remedy, however, is to calculate robust standard errors. These robust standard errors are used to recalculate the t-statistics using the original regression coefficients.

Use robust standard errors to calculate t-statistics if there is evidence of heteroskedasticity.

By default, many statistical software packages apply homoskedastic standard errors unless the user specifies otherwise.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

The Gauss-Markov theorem

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

t-statistics for small sample sizes

A

In order to analyze a regression coefficient t-statistic when the sample size is small, we must assume the assumptions underlying linear regression hold.

In particular, in order to apply and interpret the t-statistic, error terms must be homoskedastic (i.e., constant variance of error terms) and the error terms must be normally distributed. If this is the case, the t-statistic can be computed using the default standard error (i.e., the homoskedasticity-only standard error), and it follows a t-distribution with n — 2 degrees of freedom.

In practice, it is rare to assume that error terms have a constant variance and are normally distributed. However, it is generally the case that sample sizes are large enough to apply the central limit theorem meaning that we can calculate t-statistics using homoskedasticityonly standard errors. In other words, with a large sample size, differences between the t-distribution and the standard normal distribution can be ignored.

17
Q

Omitted variable bias

A

Omitting relevant factors from an ordinary least squares (OLS) regression can produce misleading or biased results. Omitted variable bias is present when two conditions are met:

  1. the omitted variable is correlated with the movement of the independent variable in the model, and
  2. the omitted variable is a determinant of the dependent variable.

When relevant variables are absence from a linear regression model, the results will likely lead to incorrect conclusions as the OLS estimators may not accurately portray the actual data.

Omitted variable bias violates the assumptions of OLS regression when the omitted variable is in fact correlated with current independent (explanatory) variable(s). The reason for this violation is because omitted factors that partially describe the movement of the dependent variable will become part of the regression’s error term since they are not properly identified within the model. If the omitted variable is correlated with the regression’s slope coefficient, then the error term will also be correlated with the slope coefficient.

! Recall, that according to the assumptions of linear regression, the independent variable must be uncorrelated with the error term.

18
Q

The standard error of the regression (SER) for multiple regression case

A
19
Q

The multiple coefficient of determination

A
20
Q

Adjusted R2

A
21
Q

Assumptions of the multiple linear regression model

A

As with simple linear regression, most of the assumptions made with the multiple regression pertain to ε, the model’s error term:

  • A linear relationship exists between the dependent and independent variables.
  • The independent variables are not random, and there is no exact linear relation between any two or more independent variables.
  • The expected value of the error term, conditional on the independent variables, is zero [i.e., E(ε|X1,X2,…Xk) = 0].
  • The variance of the error terms is constant for all observations [i.e., E(εi2) = σε2].
  • The error term for one observation is not correlated with that of another observation [i.e., E(εiεj) = 0, j ≠ i].
  • The error term is normally distributed.
22
Q

Multicollinearity (perfect, imperfect), dummy variable trap

A

Multicollinearity refers to the condition when two or more of the independent variables, or linear combinations of the independent variables, in a multiple regression are highly correlated with each other. This condition distorts the standard error of the regression and the coefficient standard errors, leading to problems when conducting t-tests for statistical significance of parameters.

If one of the independent variables is a perfect linear combination of the other independent variables, then the model is said to exhibit perfect multicollinearity. In this case, it will not be possible to find the OLS estimators necessary for the regression results.

Am important consideration when performing multiple regression with dummy variables is the choice of the number of dummy variables to include in the model. Whenever we want to distinguish between n classes, we must use n — 1 dummy variables. Otherwise, the regression assumption of no exact linear relationship between independent variables would be violated. In general, if every observation is linked to only one class, all dummy variables are included as regressors, and an intercept term exists, then the regression will exhibit perfect multicollinearity. This problem is known as the dummy variable trap.

Imperfect multicollinearity arises when two or more independent variables are highly correlated, but less than perfectly correlated. When conducting regression analysis, we need to be cognizant of imperfect multicollinearity since OLS estimators will be computed, but the resulting coefficients may be improperly estimated. In general, when using the term multicollinearity, we are referring to the imperfect case, since this regression assumption violation requires detecting and correcting.

23
Q

Effect of Multicollinearity on Regression Analysis, Detecting Multicollinearity

A

As a result of multicollinearity, there is a greater probability that we will incorrectly conclude that a variable is not statistically significant (e.g., a Type II error).

The most common way to detect multicollinearity is the situation where t-tests indicate that none of the individual coefficients is significantly different than zero, while the R2 is high.

If the absolute value of the sample correlation between any two independent variables in the regression is greater than 0.7, multicollinearity is a potential problem. However, this only works if there are exactly two independent variables.

If there are more than two independent variables, while individual variables may not be highly correlated, linear combinations might be, leading to multicollinearity.

High correlation among the independent variables suggests the possibility of multicollinearity, but low correlation among the independent variables does not necessarily indicate multicollinearity is not present.

24
Q

Hypothesis Testing of Regression Coefficients in a multiple regression

A
25
Q

Interpreting p-Values in multilpe regression

A

An alternative method of doing hypothesis testing of the coefficients is to comparethe p-value to the significance level:

  • If the p-value is less than significance level, the null hypothesis can be rejected.
  • If the p-value is greater than the significance level, the null hypothesis cannot be rejected.
26
Q

Joint Hypothesis Testing

A

A joint hypothesis tests two or more coefficients at the same time.

For example, we could develop a null hypothesis for a linear regression model with three independent variables that sets two of these coefficients equal to zero: H0 : b1 = 0 and b2 = 0 versus the alternative hypothesis that one of them is not equal to zero. That is, if just one of the equalities in this null hypothesis does not hold, we can reject the entire null hypothesis. Using a joint hypothesis test is preferred in certain scenarios since testing coefficients individually leads to a greater chance of rejecting the null hypothesis. For example, instead of comparing one t-statistic to its corresponding critical value in a joint hypothesis test, we are testing two t-statistics. Thus, we have an additional opportunity to reject the null. A robust method for applying joint hypothesis testing, especially when independent variables are correlated, is known as the F-statistic.

The F-statistic can be used to test the null hypothesis that jointly all of the independent variables have no influence on the dependent. The test statistic is given by F = (ESS/df1)/(RSS/df2), where in this case the numerator’s degrees of freedom (df1) are equal to the number of partial slope coefficients and the denominator’s degrees of freedom are equal to (n - the total number of coefficients).

27
Q

F-test in joint hypothesis testing

A

An F-test assesses how well the set of independent variables, as a group, explains the variation in the dependent variable. That is, the F-statistic is used to test whether at least one of the independent variables explains a significant portion of the variation of the dependent variable.

! It may have occurred to you that an easier way to test all of the coefficients simultaneously is to just conduct all of the individual t-tests and see how many of them you can reject. This is the wrong approach, however, because if you set the significance level for each t-test at 5%, for example, the significance level from testing them all simultaneously is NOT 5%, but rather some higher percentage.

!! When testing the hypothesis that all the regression coefficients are simultaneously equal to zero, the F-test is always a one-tailed test, despite the fact that it looks like it should be a two-tailed test because there is an equal sign in the null hypothesis.

28
Q

Coefficient of determination in multiple regression

A
29
Q

Specification bias

A

Specification bias refers to how the slope coefficient and other statistics for a given independent variable are usually different in a simple regression when compared to those of the same variable when included in a multiple regression.

30
Q

Interpret the R2 and adjusted R2 in a multiple regression (pitfalls)

A

When computing both the R2 and the adjusted R2, there are a few pitfalls to acknowledge, which could lead to invalid conclusions.

  1. If adding an additional independent variable to the regression improves the R2, this variable is not necessary statistically significant.
  2. The R2 measure may be spurious, meaning that the independent variables may show a high R2; however, they are not the exact cause of the movement in the dependent variable.
  3. If the R2 is high, we cannot assume that we have found all relevant independent variables. Omitted variables may still exist, which would improve the regression results further.
  4. The R2 measure does not provide evidence that the most or least appropriate independent variables have been selected. Many factors go into finding the most robust regression model, including omitted variable analysis, economic theory, and the quality of data being used to generate the model.
31
Q

Restricted vs. Unrestricted Least Squares Models

A

A restricted least squares regression imposes a value on one or more coefficients with the goal of analyzing if the restriction is significant.

The R2 from the restricted regression is called a restricted R2 or Rr2.

For comparison, the unrestricted R2 from the specification that includes both independent variables is given the notation Rur2, and both are included in an F-statistic that can test if the restriction is significant or not:

32
Q

Interpret tests of a single restriction involving multiple coefficients

A

What if we wanted to test whether one coefficient was equal to another such that: H0 : b1 = b2?

  • The first approach is to directly test the restriction stated in the null. Some statistical packages can test this restriction and output a corresponding F-stat.
  • The second approach transforms the regression and uses the null hypothesis as an assumption to simplify the regression model. For example, in a regression with two independent variables: Yi = B0 + B1X1i + B2X2i + εi, we can add and subtract B2X1i to ultimately transform the regression to: B0 + (B1 — B2)X1i + B2 (X1i + X2i ) + εi. One of the coefficients will drop out in this equation when assuming that the null hypothesis of B1 = B2 is valid. We can remove the second term from our regression equation so that: B0 + B2 (X1i + X2i ) + εi. We observe that the null hypothesis test changes from a single restriction involving multiple coefficients to a single restriction on just one coefficient.
33
Q

Model Misspecification

A

Omitting relevant factors from a regression can produce misleading or biased results.

Similar to simple linear regression, omitted variable bias in multiple regressions will result if the following two conditions occur:

  • The omitted variable is a determinant of the dependent variable.
  • The omitted variable is correlated with at least one of the independent variables.