Topics 20-23 Flashcards
Sample regression function
The sample regression function is an equation that represents a relationship between the Y and X variable(s) that is based only on the information in a sample of the population.
In almost all cases the slope and intercept coefficients of a sample regression function will be different from that of the population regression function. If the sample of X and Y variables is truly a random sample, then the difference between the sample coefficients and the population coefficients will be random too.
OLS (Ordinary Least Squares) - slope and intercept
Assumptions of OLS
OLS regression requires a number of assumptions. Most of the major assumptions pertain to the regression model’s residual term (i.e., error term).
Three key assumptions are as follows:
- The expected value of the error term, conditional on the independent variable, is zero (E(εi|Xi) = 0).
- All (X, Y) observations are independent and identically distributed (i.i.d.).
- It is unlikely that large outliers will be observed in the data. Large outliers have the potential to create misleading regression results.
Additional assumptions include:
- A linear relationship exists between the dependent and independent variable.
- The model is correctly specified in that it includes the appropriate independent variable and does not omit variables.
- The independent variable is uncorrelated with the error terms.
- The variance of εi is constant for all Xi: Var(E(εi|Xi) = σ2.
- No serial correlation of the error terms exists [i.e., Corr(εi,εi+1) = 0 for i=l, 2, 3 …].
- The point being that knowing the value of an error for one observation does not reveal information concerning the value of an error for another observation.
- The error term is normally distributed.
Properties of OLS estimators
Since OLS estimators are derived from random samples, these estimators are also random variables because they vary from one sample to the next. Therefore, OLS estimators will have their own probability distributions (i.e., sampling distributions). These sampling distributions allow us to estimate population parameters, such as the population mean, the population regression intercept term, and the population regression slope coefficient.
Drawing multiple samples from a population will produce multiple sample means. The distribution of these sample means is referred to as the sampling distribution of the sample mean. The mean of this sampling distribution is used as an estimator of the population mean and is said to be an unbiased estimator of the population mean.
An unbiased estimator is one for which the expected value of the estimator is equal to the parameter you are trying to estimate.
Given the central limit theorem, for large sample sizes, it is reasonable to assume that the sampling distribution will approach the normal distribution. This means that the estimator is also a consistent estimator.
A consistent estimator is one for which the accuracy of the parameter estimate increases as the sample size increases. Note that a general guideline for a large sample size in regression analysis is a sample greater than 100.
Sum of squared residuals (SSR)
The sum of squared residuals (SSR), sometimes denoted SSE, for sum of squared errors, is the sum of squares that results from placing a given intercept and slope coefficient into the equation and computing the residuals, squaring the residuals and summing them. It is represented by Σei2. The sum is an indicator of how well the sample regression function explains the data.
Coefficient of determination
The coefficient of determination, represented by R2, is a measure of the “goodness of fit” of the regression. It is interpreted as a percentage of variation in the dependent variable explained by the independent variable. The underlying concept is that for the dependent variable, there is a total sum of squares (TSS) around the sample mean. The regression equation explains some portion of that TSS. Since the explained portion is determined by the independent variables, which are assumed independent of the errors, the total sum of squares can be broken down as follows:
Total sum of squares = explained sum of squares + sum of squared residuals
! Sum of squared residuals (SSR) is also known as the sum of squared errors (SSE). In the same regard, total sum of squares (TSS) is also known as sum of squares total (SST), and explained sum of squares (ESS) is also known as regression sum of squares (RSS).
Standard Error of the Regression
The standard error of the regression (SER) measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. The SER gauges the “fit” of the regression line. The smaller the standard error, the better the fit.
The SER is the standard deviation of the error terms in the regression. As such, SER is also referred to as the standard error of the residual, or the standard error of estimate (SEE).
Confidence intervals for regression coefficients
Hypothesis tests about regression coefficients
Interpret p-value for linear regression
For two-tailed tests, the p-value is the probability that lies above the positive value of the computed test statistic plus the probability that lies below the negative value of the computed test statistic.
For example, by consulting the z-table, the probability that lies above a test statistic of 2.46 is: (1 — 0.9931) = 0.0069 = 0.69%. With a two-tailed test, this p-value is: 2 x 0.69% = 1.38%. Therefore, the null hypothesis can be rejected at any level of significance greater than 1.38%. However, with a level of significance of, say, 1%, we would fail to reject the null.
A very small p-value provides support for rejecting the null hypothesis. This would indicate a large test statistic that is likely greater than critical values for a common level of significance (e.g., 5%).
Confidence intervals for the predicted value of a dependent variable
Heteroskedasticity
Effect of Heteroskedasticity on Regression Analysis
There are several effects of heteroskedasticity you need to be aware of:
- The standard errors are usually unreliable estimates.
- The coefficient estimates (the bj) aren’t affected.
- If the standard errors are too small, but the coefficient estimates themselves are not affected, the t-statistics will be too large and the null hypothesis of no statistical significance is rejected too often. The opposite will be true if the standard errors are too large.
How to correct heteroskedasticity
Heteroskedasticity is not easy to correct.
The most common remedy, however, is to calculate robust standard errors. These robust standard errors are used to recalculate the t-statistics using the original regression coefficients.
Use robust standard errors to calculate t-statistics if there is evidence of heteroskedasticity.
By default, many statistical software packages apply homoskedastic standard errors unless the user specifies otherwise.
The Gauss-Markov theorem