Regression Analysis Flashcards

Question

Goodness of fit in multiple regression --> how to interpret the adjusted R-squared?

Answer 1

In simple regression, there is only one regressor (one independent variable). As we add more regressors, SSR increases. The adjusted R-squared is a modified version of R-squared that accounts for predictors that are not significant in a regression model. In other words, the adjusted R-squared shows whether adding additional predictors improve a regression model or not. If adjusted R2 = 1, it means that the Sum of Squared Residual (SSR)=0 → all the sample data (all the observations) will fall exactly on the regression line (same interpretation as in simple regression).

Answer 2

It is a a hypothesis test involves multiple coefficients. Example: H0: β1 = β2 = 0 Ha: at least one βi ≠ 0

Answer 3

We run an F-test to test whether at least one of the independent variables that we added to our regression affect the dependent variable Y. F-test checks if at least on of the βi is ≠ 0. to run the F-test we first write the regression of the unrestricted model (the one with the added variables) and in the next line we write "test" and the names of the added independent variables 8except for the main independent). If the p-value<0.05, we reject the null hypothesis (that all βs are = 0) --> this means that at least one is ≠ 0, thus w include them all in the model (because we can't know wich one is ≠ 0).

Answer 4

- In simple regression, multiple regression, interaction effects, polynomial and logarithmic: β1 is the marginal effect of X on Y. - In models with binary dependent variable: β1 is the effect on the z-score (we need margins to see the effect on the probability) - In models with categorical outcomes: The coefficient is alternative specific → for each outcome you will have different coefficients (different βs). We need margins to se how much it affects the probability of having one outcome. - In models with ordinal outcomes: β1 only indicates positive/negative effect on the probability of having a certain outcome. Need margins to see how much.

Answer 5

Errors (or residuals) in regression are the distances between our prediciton (the fitted line) and the real, observed values.

Answer 6

Five assumptions (out of 12) of the Gauss-Markov Theorem --> The estimator is BLUE (best linear unbiased estimator) when: 1) Linearity: the parameters we are estimating using the OLS method must be themselves linear. 2) Randomness: our data must have been randomly sampled from the population. 3) Non-Collinearity: the regressors being calculated aren’t perfectly correlated with each other. 4) Exogeneity: the regressors aren’t correlated with the error term. 5) Homoskedasticity: no matter what the values of our regressors might be, the error of the variance is constant.

Answer 7

Beta is the real coefficient (the population coefficient) and we can't know it. B is the estimate that we can calculate through regression.

Answer 8

- Linear-log: A 1% increase in X1 results in a 0.01*b1 increase/decrease in Y, on average. - Log-linear: A one unit increase in X1 results in a 100*b1 percentage points increase/decrease in Y, on average. - Log-log: A 1% increase in X1 results in a b1% increase in Y, on average.

Answer 9

Add i. before binary or categorical variables, and c. before continuous variables.

Answer 10

testparm i.year (to test whether we need to include time-fixed effects in the model). If p-value<0.05, we include them.

Answer 11

- Probit/logit: binary outcome - Multinomial logit: categorical outcome - Oprobit/ologit: ordinal outcome

Answer 12

A percentage point is the simple numerical difference between two percentages. A percentage is a number or ratio expressed as a fraction of 100. An increase from 40 per cent to 50 per cent will often be described as a 10 per cent increase. However, it is a 10 percentage point increase and a 25 per cent increase which is quite a difference.

Answer 13

Panel data are helpful to mitigate OVB, but only OVB that are related to unobserved heterogeneity (which is fixed across units OR over time), we cannot control for the bias which varies across units and time simultaneously (i.e. national deficit, varies over time and across countries, and can still bias our analysis).

Answer 14

Pooled OLS: used when you have different surveys merged together OLS - limitations: omitted-variable bias, does not take into account for fixed effects (i.e. region/city specific effects), individual heterogeneity

Answer 15

- LSDV only controls for time FE, by adding a dummy variable for each year/months etc (i.time_var); - Within transformation only controls for entity FE (xtreg …, fe r).

Answer 16

If the dataset isnt' in long format --> reshape. Then --> xtset unit_var time_var

Answer 17

In a panel data model we add the subsctipt "t" to the variables and to the error term. The error term u_it can be divided into a_i adn v_it. The error term a_i only captures unobserved heterogeneity across units (not across time). The estimation method depends on your data and on your research question. - LSDV: when you only want to control for time-fixed effects (unit invariant) --> i.e. EU fiscal policyes (vary over time but not over entity, they are always applied to all EU members) - Within transformation: when you only want to control for unit-fixed effects (time-invariant) --> i.e. geography (varies over countries but not over time) - Two-way (LSDV + Within trans.): when you want to control for both.

Answer 18

The error term u in a model is likely to be correlated within unit --> i.e. income level which explains life satisfaction but is not included in the model, will be contained in the error term u. The errors for the same individual are likely to be correlated across time (i.e. if Mark is rich in 1998, he will probably be rich also in 1999 and 2000), however between individuals, the correlation is less likely to be correlated (i.e. Mark's income in 1998 is unlikely to be correlated with Lisa's income in 2000). Correlations of the errors within the same individual (or entity, i.e. class, country etc.) can affect the standard error of the estimated coefficient. This is because, although we have i.e. 4 obstervations for each individual (or entity), information we can get from 4 correlated observations is less than information we can get from 4 independent observations --> some adjustments need to be done in calculating the standard errors. The resulting standard errors after the adjustment is called clustered standard errors. Like the heteroskedasticity robust standard errors, clustered standard errors will only affect the result of a statistical test, they will not affect the size of coefficients. We cluster the standard error in Stata by adding cluster(unit_varname) as an option in the regression line (after r or fe r) --> we can cluster standard errors also without adding time-fixed effects. We cluster standard errors to take into account the error terms correlated in each cluster and specific to the cluster (withing group correlation)

Answer 19

There is endogeneoity when one independent variable is correlated with the error term This can happen in the case of OVB, measurement error and/or reverse causality (simultaneity). Endogeneity is a problem because it could bias our estimate, thus our results can no longer be generalized to the real population.

Answer 20

- Internal validity: when there is internal validity causal inferences are generalizable to the population we are studying. We need to ask ourselves whether the sample is representative and whether statistical inferences about causal effects are valid for the population being studied. - External validity: are we able to generalize these results? Are results generalizable to other populations and settings? One issue is whether findings in a specific experiment (in a specific context) can be generalized (i.e. the training program for Indian women).

Answer 21

They are the five threats to internal validity. - OVB: when we omit a variable that is correlated with another independent variable and has an effect on the independent variable. - Functional form misspecification: when we chose the wrong model for the regression (it may also be a form of OVB). - Selection bias (missing data): affects internal validity if the data that are missing are conditional on the dependent or error term OR if the missing data are due to systematic reasons. - Simultaneous bias (reverse causality): when x has an effect o y, but also y has an effect on x. - Measurement error: bias when the measurement error is systematic (not random) and/or it's correlated with an independent variable.

Answer 22

They will induce bias when they are the result of a systematic error and they won't be a problem if they are random.

Regression Analysis Flashcards

(46 cards)