Regression, event, panel, portfolio Flashcards
Regression analysis shows correlations, not causal relationships. Why?
Because the direction or nature of causality depends on a solid theory, not just statistical modeling.
Which are the OLS assumptions? (3)
- linear dependence between independent variables
- exogeneity of covariates
- homoskedasticity
Explain Exogeneity of covariates
That the error u is not a function of X.
- The covariates (independent variables) don’t contain any information that predicts the error term (u).
- This ensures that the model is correctly specified, and the independent variables only explain the dependent variable, not the errors.
- For every data point, the expected value of the error term, given the independent variables, is zero.
- This ensures that errors are purely random and not systematically related to the covariates.
What is endogeneity?
Endogeneity is when the error term u is related to the independent variables → biased and inconsistent estimates
What is homoskedasticity?
aka constant variance assumption.
Assumes that errors are uncorrelated: the covariance between any error terms is zero. The errors are evenly distributed. When errors are uncorrelated, it ensures that the error terms are independent, meaning one error does not affect the other.
Homoskedasticity ensures that the regression model treats all observations equally. If the variance changes (heteroskedasticity), it can lead to inefficient or biased estimates.
The error u has the same variance given any value of the explanatory variable.
What can you say about data generating process of covariates and errors?
The data in X can be a combination of constant and random variables
- OLS relies on variance in the covariates to estimate the relationship between independent variables and the dependent variable.
- If a covariate doesn’t vary (e.g., all values are the same), OLS cannot estimate its effect because it has no explanatory power.
What does the exogeneity assumption say?
The error term u is unrelated to the independent variables X.
It ensures that the model captures the true relationship between X and Y without bias from omitted variables.
Which are the OLS derivations?
- standard errors
- the t-test
- goodness-of-fit (rsquare)
What is standard errors?
Standard errors tell us how much the model’s predictions and estimates (like the coefficients) might vary due to random noise or limited data.
What is residual variance?
Measures how far the actual data points are from the model’s prediction on average (tells how much error is left after fitting the regression line).
What is the residual standard error?
The average size of the errors in the model’s predictions.
What is the t-test?
T-tests are used in regression to check if a regression coefficient B is significantly different from zero. It helps determine if an independent variable significantly contributes to the model.
The significance level (p-value) should be below 0.05 for a variable to be considered meaningful in most cases.
What is goodness-of-fit?
aka r squared or the coefficient of determination.
Used to evaluate how well a regression model fits the data. It helps you assess whether the model is good at predicting the dependent variable Y or if it leaves too much unexplained variability.
The value ranges from 0 to 1. Higher R2 is better.
When is it good to use the adjusted R2?
Good to use when evaluating and comparing different models when having multiple independent variables.
Which are the OLS assumption violations? (6)+1
- non-linearity
- heteroskedasticity
- auto-correlated errors
- multicollinearity
- irrelevant variables (over specified model)
- omitted variables (under specified model)
other issues - outliers
What is heteroskedasticity?
Occurs when the variance of the error terms u in a regression model is not constant. So the “errors” (mistakes) in your regression model don’t have a consistent spread (their variability changes across observations).
Heteroskedasticity doesn’t bias the regression coefficients but it makes standard errors and hypothesis testing unreliable.
How can heteroskedasticity be addressed?
- Robust Standard Errors: Adjusts the standard errors to account for heteroskedasticity.
- Weighted Least Squares (WLS): Reweights observations to stabilize variance.
- Model Refinements: Modify the model to better explain the variability in the data.
What is auto-correlated errors?
Autocorrelated errors occur when the errors (residuals) in a regression model are not independent but instead show a pattern or relationship over time. This violates one of the key assumptions in regression analysis, leading to unreliable results.
How can you solve auto-correlated errors?
- Adjust your model to directly address the source of autocorrelation (e.g., include lagged terms).
- Use robust standard errors (like Newey-West) to correct for the issues in residuals.
- Robust standard errors are a good default because they work under a variety of error conditions.
What is multicollinearity?
Multicollinearity doesn’t violate the assumption of “no perfect linear dependence” (as long as predictors aren’t perfectly collinear), but it still causes numerical issues in estimating coefficients.
Large standard errors due to multicollinearity make it hard to determine the true effect of each variable, leading to unstable regression results.
Multicollinearity can be measured through VIF, Variance Inflation Factor. High VIF indicates severe multicollinearity and inflated standard errors.
How can you solve multicollinearity?
- increase sample size N (this increases SST)
- remove or combine highly correlated variables
What are irrelevant variables?
An over specified model occurs when there are irrelevant variables included in the model. Irrelevant variable is named z.
Including irrelevant variables (z) does not introduce bias in the coefficient estimates (β). However, it increases variance in the estimates due to sampling error, making the model less efficient.
Over-specifying the model adds unnecessary noise, which can affect the reliability and interpretability of the results.
What is omitted variables?
When the error term u is not purely random noise; it contains an omitted variable z, which creates bias.
Omitted variables creates:
Bias in Coefficients:
Omitting a relevant variable z introduces bias in β^ because the effect of z is wrongly attributed to X.
The bias increases if z is strongly correlated with X or has a large γ (strong effect on y).
In contrast to over-specified models (where coefficients remain unbiased but less efficient), under-specified models produce biased estimates.
Practical Impact:
Omitting relevant variables can severely distort conclusions from the model, leading to incorrect inferences about the relationship between X and y.
What is the difference between sampling error and omitted variables bias?
Sampling error diminishes when sample size N increase. Not the same for omitted variable bias because the bias is systematic and stems from the structure of the model itself caused by leaving out a relevant variable; the omitted variable is correlated with X. X picks up the effects of the omitted variables.
What are outliers?
Outliers are extreme values that deviate a lot from the rest of the data. Outliers can disrupt or distort the casual relationship between the dependent variable and the independent variables.
How can you treat outliers?
- Transformation:
Apply mathematical transformations to reduce the influence of extreme values. Example: Use the natural logarithm to compress large values and spread smaller ones.
- Trimming:
Remove extreme values (e.g., top and bottom 5% of the dataset) from the analysis.
- Winsorizing:
Replace extreme values with the nearest non-outlier value. Example: Cap values at the 95th percentile or floor them at the 5th percentile.
What is the constant elasticity model?
constant elasticity model is a type of regression model where the relationship between the dependent variable and the independent variable(s) exhibits a constant percentage change (elasticity).
What is the Gauss-Markov assumptions for simple regression?
Justifies the use of the OLS method rather than using a variety of competing estimators.
The Gauss-Markov theorem requires errors to have constant variance (homoskedasticity) and no correlation over time (no serial correlation). If errors are serially correlated, OLS is no longer the best estimator, and its standard errors and test statistics become invalid, even for large samples.
A1: Linear in Parameters
A2: Random Sampling
A3: Sample Variation in the Explanatory Variable
A4: Zero Conditional Mean
A5: Homoskedasticity
What is ordinary least squares?
chooses the estimates to minimize the sum of squared residuals. the method of ordinary least squares is easily applied to estimate the multiple regression model. Each slope estimate measures the partial effect of the corresponding independent variable on the dependent variable, holding all other independent variables fixed.
What is the Frisch-Waugh theorem?
the general partialling out result
What does no perfect collinearity mean?
In the sample (and therefore in the population), none of the independent variables is constant, and there are no exact linear relationships among the independent variables
What is perfect collinearity?
If an independent variable in is an exact linear combination of the other independent variables, then we say the model suffers from perfect collinearity, and it cannot be estimated by OLS.
What is zero conditional mean?
the error u has an expected value of zero given any values of the independent variables. when this assumption holds, we often say that we have exogenous explanatory variables
What is micronumerosity?
Problems of small sample size
What is BLUE?
best linenar unbiased estimator. Under the Gauss-Markov assumptions the OLS estimators are the best linear unbiased estimators (BLUEs