4A Flashcards
Why do we need assumptions for regression analysis?
Any estimation method (like OLS) requires some assumptions to calculate regression coefficients and standard errors
Before you conduct a regression analysis, you should check if these assumptions are satisfied
If an assumption is violated, either the regression coefficients or the standard errors (or both) can be biased (i.e., structurally underestimated or overestimated)
Fortunately, there is usually a solution to adjust your estimates
Four assumptions for OLS regression
- homoscedasticity
- independent oberservations
- no large outliers
- normally distributed residuals
What is homoscedasticity?
Variance of the Y-variable does not depend on X
OLS assumes that the variance of the Y-variable is not predicted by X
Why is homoscedasticity a problematic assumption?
It is a problematic assumption because almost anything that could plausibly change Y could also change the variation in Y
What happens when this assumption is violated?
… fout in dia, ff vragen of hij wil aanpassen
Solution for the violation of homoscedasticity
The hetroscedasticity rebust standard errors can be obtained (with SPSS-syntax)
The White-test
The white-test tests the null-hypothesis that there is homoscedasticity.
If the White-test gies a p-value of lower than 0.05, the null-hypothesis is rejected and we conclude that the assumption of homoscedasticity has been violated.
SPSS-output with heterescadasticity robust standard errors
This table gives you your estimates with standard errors (and p-values) that are corrected for heteroscedasticity
The regression coefficients (both unstandardized and standardized) are unchanged
The standardized coefficients (Beta) are not in this table, but you can just obtain them from the regular regression menu/syntax
heteroscedasticity robust standard errors
If there is hetroscedasticity, the robust standard errors are just as good as the regular standard errors. If there is heteroscedasticity, they are better
What are independent observations
The sample size plays an important role in the calculation of the standard errors. The reasoning is that the larger your sample is, the less likely it is to differ from the population by chance.
This reasoning only holds if each observation truly provides a new piece of infromation.
This means that there should not by clusters of observations that all have the same or similar values.
All observations should be truly independent.
What happens when the assumption of independent observations is violated?
When this observation is violated, the estimated standard errors will be too small
This implies that the p-value is too small and that you may incorrectly reject the null-hypothesis
How to check? The assumption of independent variables?
Think about how the data was collected
Errors are usually independent when these two conditions are satisfied:
1. There is only one observation for each case (e.g., person/country)
2. The cases were all sampled from the same population with the same procedure
The first condition is, for example, violated in panel data where the same respondents are interviewed every year
The second condition is, for example, violated when a dataset consists of several surveys from several countries that were put together
How to solve the violation of independent observations?
This problem can be solved by using a combination of cluster robust standard errors and control variables
Cluster robust standard errors adjust for the clustering within a unit that you specify (e.g., within countries)
In the example of the ESS, you would do two things:
Estimate cluster robust standard errors with clustering “within countries”
Add “country” as a control variable
- Unfortunately, cluster robust standard errors are not readily available in SPSS (which is why we will not use them in this course)
If you ever need them, you can download an add-on for SPSS or switch to another program (e.g., Stata or R)
What are outliers?
The absense of large outliers
Outliers are observations that deviate extremely from the mean
Outliers can be detected by the z-score of an observation (e.g., how many standard deviations it is removed from the mean)
Observations with a z-score smaller than -3 or larger than +3 are commonly considered outliers
What happends when the assumption of the outliers is violated?
When there are outliers in the data, these observations can have a disproportional effect on the estimates, such that the results mainly reflect the outliers and not the other cases.
This problem manifests itself both in the regression coefficients and the standard errors
This problem is especially servre whenthe outliers are extreme and/or if they constitute a large share of the sample
How to check for an outlier?
This assumption can be checked by simply calculating z-scores for all the variables (both X and Y) and checking if some z-scores are smaller than -3 or larger than +3
How to solve the outliers problem?
If you have outliers, you typically run two analyses:
1. An analysis with the outliers included
2. An analysis with the outliers removed
If the results are (substantively) similar, you have no problem
If the results differ, you should consider carefully why the outliers exist and how you would interpret the results with or without them
Unfortunately, there is no clear answer on how to act in this situation that works in all situations
Always motivated clearly what you did
What are normally distributed residuals?
The assumption of normally distributed residuals
If the sample is not normally distributed AND too small, the sampling
distributing is not normally distributed
More specifically, the residuals of the regression analysis have to be normally distributed when the sample is small
This is almost the same as saying that the Y-variable must be normally
distributed, but not quite
What happens when this assumption is violated?
The assumption of normally distributed residuals. Only in small samples
When this assumption is violated, the sampling distribution is not normally distributed and the t-values, p-values, and confidence intervals do not make sense anymore.
How do you check for the assumption
The assumption of normally distributed residuals. Only in small samples
We can test this with the following syntax which
o Saves the residuals of the regression analysis
o Makes a histogram of them
o Asks for a formal test of normality
The histogram
The histogram compares the distribution of the data (the bars) tot he normal distribution (the curve)
The test
the assumption of normally distributed residuals
Table: test of Normality
This table provedes two test for the null-hypothesis that the data is normally distributed. If the p-values are smaller than 0.05, we reject the null-hypothesis and conclude that the assumption of normality is violated.
How to solve this assumption
The assumption of normally distributed residuals
You can either increase your sample size or move to other estimation
methods, which we will not discuss in this course