Econometrics - Theory Flashcards
Root MSE in STATA stands for :
SER
Total MS =
TSS
Residual SS =
SSR
Model SS =
ESS
TSS = ___ + ____
ESS + SSR
When analyzing STATA what do you have to assume unless specified otherwise?
That all 3 Least Squares Assumptions hold
and homoskedastic errors
What is the range of R squared?
0 to 1
A stock with Beta > ____ is riskier
1
A stock with Beta < ______ is less risky than the market portfolio
1
An empirical analysis is externally valid if _________
the conclusions can be generalized to other populations and other settings
Are results/studies regarding health in the United States externally valid?
No, because very few people in the US have health insurance and therefore results from the US cannot be generalized for other settings
An empirical analysis is internally valid when statistical inference _________
about the causal effects is valid for the population
For internal validity why should estimators be unbiased and consistent?
Because if they are not unbiased and consistent, answers don’t provide systematically skew results, providing accurate estimations close to the population average and consistency implies that as sample sizes increase, consistent estimators become more accurate, ensuring reliability.
The reason why we need the Large outliers are unlikely assumption is to derive that the OLS estimator is ____________
asymptotically normally distributed
We cannot calculate the OLS estimator if _________
there is perfect multiple linearity between explanatory variables; so there cannot be perfect multicollinearity
The first OLS assumption is not an assumption but a ___________
REQUIREMENT
List the threats to internal validity
- omitted variables
- functional form misspecification
- measurement error
- sample selection
- simultaneous causality
All of the threats to internal validity lead to a violation of: ________
OLS assumption #1 ; which states that the error term is not related to explanatory variables
If there are important explanatory variables missing from the model then _______
our results are biased and inconsistent, and therefore internal validity is not ensured
If a regressor correlates with the error term then it is _______
endogenous
If we ommit an exogenous variable,
Because labour market experience has a non-linear relationship with wages, if we only use linear parameteres we will be dealing with what problem:
Functional form misspecification
What is Sample Selection Bias?
Sample selection bias occurs when the process of selecting data is related to the dependent variable beyond its relationship with the regressors, leading to correlation between regressors and the error term, affecting OLS estimators’ consistency.
Can you explain how Sample Selection Bias manifests?
It arises when the selection process affecting data availability is tied to the dependent variable. For instance, in the 1936 polling example, selecting phone numbers of car owners introduced bias because car owners with phones were more likely to support a specific political party.
How can the Sample Selection Bias problem be described?
It can be viewed either as a consequence of nonrandom sampling or as a missing data issue. For instance, a random sample of car owners with phones isn’t the same as a random sample of voters.
What’s the optimal solution to address Sample Selection Bias?
The best solution is to design studies to avoid it. For instance, estimating the mean height of undergraduates should involve a random sample of all undergraduates, not just those entering a basketball court.
Simultaneity bias occurs if causality ______ in both directions
runs
Is internal validity an issue here: You want to investigate health costs in the Netherlands and you have a sample drawn from all customers of health insurance companies of the Netherlands.
Health insurance is compulsory in the Netherlands, so there is no problem with the selectivity of the sample if the sample is randomly drawn from all insurance companies.
Can confidence intervals be constructed in the usual way if the OLS estimator includes a measurement error, w, with finite fourth moment?
Assuming a homoskedastic wi and since the LSA conditions hold, the standard errors are calculated correctly and therefore also the confidence intervals.
To establish whether ommitted variables have a genuine effect we must look at and evaluate _________
t-values and p-values and then look at F test for UR and R
formula for t
B1 kapelusz - B1,0 / SE (B kapeluszek)
Is there a problem with internal validity here: You have a sample of adult males living in Amsterdam en you want to use this sample to estimate
the average height of Dutch adult males.
Yes, because Amsterdam will not be representative of the entire Dutch population as there are a lot of students and expats. Furthermore, young people tend to be taller. Furthermore, people from below the large rivers (Lek, Waal and Maas) are known to be shorter than those from above these rivers.
X under measurement error =
Real X + w (measurement error term)
A low F test in White Test suggests
strong evidence for heteroskedasticity
When dealing with a measurement error, how do you know if the confidence interval can be constructed in the usual way?
If the measurement error term, w, is homoskedastic, and if the LSA conditions hold
What happens when a redundant explanatory variable is added and it’s correlated with other variables in a model?
When a redundant variable is added and it’s correlated with other variables, it leads to inefficiency in the model. For instance, if the added variable is negatively correlated with one variable (let’s say ‘Jap’), it might be positively correlated with another variable (‘Time’). As a consequence, the standard deviations of the coefficient estimators for ‘Jap’ and ‘Time’ increase, making these estimators less accurate. This means that the t-ratios move towards 0 or the standard errors become larger.
For our instrument to be valid we need to make sure that :
the covariance between x and z is unequal to zero and the covariance between z and error term is equal to zero
Why are instruments usually different from exogenous variables in IV regression?
Instruments need to satisfy two critical conditions: exogeneity (uncorrelated with the error term) and relevance (correlated with the endogenous variable). Exogenous variables, by definition, are uncorrelated with the error term but using them as instruments might satisfy the relevance condition required for a valid instrument.
What are the conditions for a valid instrument?
Exogeneity –> uncorrelated with the error term
Relevance –> correlated with the explanatory variable
The instrument cannot be a part of the initial regression model
Explain the difference between exogenous and endogenous variables?
Exogenous variables are independent, and endogenous variables are dependent. Therefore, if the variable does not depend on variables within the model, it’s an exogenous variable. If the variable depends on variables within the model, though, it’s endogenous.
Explain this stata command : ivreg S (T = TF TM) SP IP
This command runs an instrumental variable regression where S is the dependent variable, T is the endogenous regressor, TF and TM are the instruments for T, while SP and IP are exogenous variables. It’s specifying that T is endogenous and should be instrumented by TF and TM.
Compute the stata output that runs an instrument variable regression:
ivreg S (T=TM TF) SP IP,
wherein s is the dependent variable, T is the endogenous one which is instrumented by tm and tf and sp and ip are the exogenous variables
Define endogeneity and explain why it’s a concern in regression analysis.
Endogeneity refers to a situation where an independent variable is correlated with the error term, leading to biased and inconsistent regression estimates due to omitted variable bias or simultaneous causation.
What does the first stage regression in 2SLS aim to accomplish?
The first stage regression in 2SLS aims to predict the potentially endogenous variable using instrumental variables, thereby creating adjusted values that aren’t correlated with the error term.
Which variables are used as instruments in the first stage of 2SLS, and what’s their role?
Instruments in the first stage of 2SLS are variables chosen for their lack of correlation with the error term but correlation with the potentially endogenous variable. For instance, TF and TM might be instruments for predicting T.
Why do we save predicted values in 2SLS regression, and what variable contains these values?
Predicted values are saved in the first stage to create a new variable (TFIT) that contains the predicted values of the potentially endogenous variable (T).
Describe the objective of the second stage regression in 2SLS.
The second stage regression in 2SLS seeks to estimate the relationship between the endogenous regressor and the predicted values of the potentially endogenous variable while controlling for exogenous variables.
How does the second stage regression address endogeneity in the model?
By using the predicted values of the potentially endogenous variable from the first stage, the second stage regression eliminates the endogeneity problem, providing unbiased estimates of the effect of the potentially endogenous variable on the dependent variable.
What does the use of instrumental variables achieve in the context of endogeneity?
Instrumental variables help separate the correlation between the potentially endogenous variable and the error term, allowing estimation of causal relationships in the presence of endogeneity.
Explain the difference between endogenous and exogenous variables in a regression model.
Endogenous variables are correlated with the error term, causing potential bias, while exogenous variables are not correlated with the error term and aren’t influenced by other variables in the model.
How does the 2SLS method contribute to obtaining unbiased estimates in regression analysis?
The 2SLS method contributes to obtaining unbiased estimates by first predicting the potentially endogenous variable using instruments in the first stage, then using these predicted values in the second stage to estimate the relationship between the variables, addressing endogeneity concerns.
What is the purpose of testing the strength of instruments in a 2SLS regression?
The purpose of testing instrument strength in 2SLS regression is to assess whether the chosen instruments (TF and TM) are sufficiently correlated with the potentially endogenous variable (T). thats why t is regressed on tf and tm ( to check for non-zero covariance)
How can researchers assess the strength of instrument variable instruments (TF and TM) in Stata?
Researchers can assess instrument strength in Stata by using the regress command to estimate the first stage regression and then employing the test command to check the joint significance of the instruments.
What does the command regress T TF TM SP IP accomplish in assessing instrument strength?
The command regress T TF TM SP IP runs a regression where T is regressed on TF, TM, SP, and IP, evaluating the relationship between the potentially endogenous variable and its instruments along with exogenous variables.
How does the F-statistic obtained from the test help evaluate instrument strength?
The F-statistic obtained from the test command helps evaluate the joint significance of the instruments. A larger F-statistic indicates greater explanatory power of the instruments in predicting T.
Why is an F-test used instead of a t-test in this context of instrument strength assessment of two IV (TF and TM)?
An F-test is used to assess joint significance because it checks whether both instruments together significantly contribute to explaining the variation in the potentially endogenous variable, unlike a t-test that examines individual coefficients.
What does a larger F-value suggest in the context of instrument strength testing?
A larger F-value suggests that the instruments (TF and TM) are stronger and more relevant in predicting the potentially endogenous variable T, providing more support for their validity in addressing endogeneity.
How does a significant F-statistic influence the credibility of instruments in 2SLS regression analysis?
A significant F-statistic strengthens the credibility of the instruments in 2SLS regression, indicating that they are sufficiently strong and relevant for predicting the potentially endogenous variable, thereby enhancing the reliability of the instrumental variable approach in addressing endogeneity.
When do we have over-identified models?
When the number of instruments exceeds the number of endogenous variables
When the model is over-identified we may not want to use all of the _____
instruments because the more instruments the larger the variance of the estimator, thereby it is less efficient
Formula for the degree of overidentification
number of instruments minus the number of endogenous regressors
Formula for the J statistic:
m*F , where m is the number of instruments
the degrees of freedom of the asymptotic distribution of the J-statistic is :
m-k
When is it impossible to statistically test the hypothesis that the instruments are exogenous?
It becomes impossible to statistically test the hypothesis of instrument exogeneity when there are as many instruments as there are endogenous regressors, making it exactly identified SO IN SHORT IF M=K
Why is it that if the coefficients are overidentified we can test for the assumption that the instrumets are exogenous?
if the coefficients are overidentified, it is possible to test the overidentifying restrictions— that is, to test the hypothesis that the “extra” instruments are exogenous under the maintained assumption that there are enough valid instruments to identify the coefficients of interest.
What does relevance mean ( one of the two validity conditions)?
The instrument should be correlated with the endogenous regressor
What does exogeneity mean Iit is one of the two validity conditions)?
the instrument should be uncorrelated with error term u, or in other
words, there should be no direct effect of the instrument on the dependent variable Y through u (the error term).
when evaluation maximum likelihood, we have to make assumptions on the ________ of some variables
distribuution
The F test is applicable when the number of instruments is _______ to the number of endogenous variables
at least equal
t critical value for significance level 5% (for one sided)
1.645
t critical value for significance level 5% (for two sided)
1.96
If we have a one - sided test with significance level 5%, then we should use a ____% confidence interval
90
When explaining why a measurement error causes correlation with error term:
remember to show this mathematically as well - S = β0 + β1(T − ν) + β2SP + β3IP + u
∗
Examples of unordered discrete variables,
type of credit card, choice of streaming program,
Examples of ordered discrete variables
schooling level,
examples of binary variables
employed, having savings
Why might a linear model not be ideal for modeling probabilities?
A linear model isn’t ideal for probabilities because it can predict values beyond the bounds of 0 and 1, which are the limits for probabilities. This can lead to unrealistic predictions such as probabilities greater than 1 or less than 0. SO FITTED/PREDICTED VALUES MIGHT BE OUTSIDE INTERVAL (0,1)
What are the implications of LPM generating probabilities outside [0,1]?
Predicted probabilities outside this range can be nonsensical (less than 0 or greater than 1), challenging the fundamental laws of probability. This can lead to unrealistic interpretations of event likelihoods.
Why is binary error term an issue in the Linear Probability Model?
This is because the error term can only take on two values for Y=1 and Y=0 and therefore the error term cannot be normally distributed, so using a normal distribution will be a poor approximation here ; therefore the least squares is not efficient
Linear probability models, logit models and probit models are all models where the ________ is a _______ variable
dependent, binary (thus dummy)
What is the advantage of logit model over LPM?
the bounded range of the probability means that the logit model gives much more consistent results than the LPM
What makes interpreting marginal effects straightforward in the LPM? (Interpreting marginal effects is a benefit)
The LPM’s coefficients directly represent how the probability of an event (binary dependent variable) changes for every one-unit change in an independent variable, making it easy to understand and communicate the impact of regressors
Why is the linear approximation of the LPM considered advantageous?
The LPM’s assumption of a linear relationship between independent variables and the probability of the dependent variable simplifies modeling and interpretation in scenarios where this linear approximation adequately captures the relationship.
Under what conditions are estimators from the LPM unbiased and consistent?
Assuming certain conditions, like no omitted variable bias, no multicollinearity, and no endogeneity, estimators in the LPM are unbiased, indicating they are, on average, accurate in estimating true population parameters. Additionally, they are consistent, becoming more precise with larger sample sizes.
What benefits does the simplicity of the LPM’s structure offer?
The model’s straightforward linear structure simplifies analysis and comprehension, making it accessible for those seeking a basic but interpretable approach to studying relationships between variables.
Solutions to fitted values being outside the interval (0,1) in the Linear probability model:
Using Maximum Likelihood (
What is the main reason why researchers apply MLE instead of LPM
The shortcoming of the LPM in that the predicted values can be outside the (0,1) interval/bound.
In regular OLS estimation methods when we want to look at the marginal effect of the expectation of y:
we look at the derivative of y with respect to the explanatory variables
The marginal effects in the logit and probit model are not _____ because of the non linear function form, but the sign is equal to the sign of the corresponding ___. “ The nonlinear nature of these models means that the marginal effects change depending on the values of the variables involved. However, the direction of the impact, whether positive or negative, aligns with the signs of the coefficients.
constant, B (estimated slope); “
Logit model and Probit model derive the same _____
curves
LPM has a _____ standard error
robust
The coefficients for the LPM model will remain same, regardless of whether I use the ____ or not _____ model
robust or not robust, only standard error changes !
Why use robust regression in Stata for linear probability estimation?
Robust regression in Stata is valuable for linear probability models, especially when dealing with binary outcomes (0 or 1). It helps address issues like heteroscedasticity and outliers in the data. By employing robust regression techniques, the analysis produces more reliable coefficients and standard errors, mitigating the impact of outliers and potential biases.
What command in Stata is used for robust regression in linear probability estimation?
regress y x1 x2, robust
In this syntax, y represents the binary dependent variable, while x1 and x2 denote independent variables. The addition of robust prompts Stata to estimate coefficients using robust standard errors,
Logit and Probit models are both inherently _________
homoskedastic
How can iterations in logistic regression be understood? (real life example)
Think of tuning a radio station for a clear signal.
Why is the probit model inherently homoskedastic?
The probit model, the error term follows a normal (Gaussian) distribution. The normal distribution is characterized by constant variance, which means the spread or dispersion of the errors remains the same across various levels of the predictors.
What’s the analogy for the second iteration in logistic regression? (fine tuning radio example)
It’s like fine-tuning the radio to reduce static. The model fine-tunes how predictors affect outcomes, reducing “noise” and improving the understanding of what’s happening in the data.
How does log likelihood relate to these iterations?
Log likelihood values measure how much clearer the signal (or model fit) gets with each adjustment. The goal is to adjust until further changes don’t significantly improve the model’s clarity, indicating convergence.
What does the stata output prob>chi2 for logit and probit models mean?
This is the probability of obtaining the chi-square statistic given that the null hypothesis is true. In other words, this is the probability of obtaining this chi-square statistic (71.05) if there is in fact no effect of the independent variables, taken together, on the dependent variable. So a small prob>chi suggests evidence against null
When explaining how there is measurement error, expand the model with the measurement error and trhen compare it to the original one
Researchers can check whether instruments are strong enough by doing a ______ test
F test but the condition here is that the number of instruments is at least equal to the number of endogenous regressors. It’s also worth adding that the condition for the instruments to be strong is that the F value should be greater than 10
Can it be tested whether the two instruments are exogenous? If yes explain how, if no explain why
We can check instruments for exogeneity when there are more instruments than the number of endogenous regressors. This is because here we have our coefficients are overidentified and it is possible that only one of the instruments is valid.
Steps for J test for exogeneity of instruments:
(i) Regress the IV-residuals on all exogenous variables TF TM SP IP.
(ii) Calculate the partial F statistic (F) of removing TF and TM from the regression.
(iii) Calculate J = mF = 2F.
(iv) If J > χ2[df = m − k = 2 − 1 = 1; α = 0.05] = 3.84, then reject exogeneity of the
instruments.
When they ask you which regression model do you prefer, what Test should you use?
The f test for restricted and unrestricted model, it allows you to have a joint hypothesis
degrees of freedom for F test=
q (THE RESTRICTIONS)
What happens when leave out a regressor that is positively correlated with the dependent variable?
Since the regressor has a positive effect on the dependent variable, if we leave it out, this means that another regressor will be overestimated as as part of the estimated effect of the regressor is not due to it itself, but rather due to the ommitted variable. To make this assumption we need to find a connection with the ommitted variable and another regressor.
Formula for F test within one sample
Remember that numerator is divided by k and denominator is just divided by N
Formula for the strength of the instrument when we only have one instrument :
F = t squared
What 2 conditions need to be true for the ommitted variable bias to be true?
Another independent variable in the restricted model should be correlated with the ommited variable AND the ommitted variable should be a determinant of the dependent variable
How can I test if an omitted variable is correlated with another independent variable in Stata?
You can test this using an F-test comparing two regression models: one with the omitted variable and another without it. Run the unrestricted model with all relevant variables, including the omitted one, then run the restricted model without the omitted variable. Finally, use the test command to assess the joint significance of the omitted variable in the unrestricted model.
Instruments that explain little variation in the endogenous regressor X are called __________
weak instruments
If the F-statistic is less than 10, the instruments are weak such that the TSLS estimate of the coefficient on X is ____ and no _______statistical inference about its true value can be made.
biased, valid
When does imperfect multicollinearity occur?
When a variable is correlated with another explanatory variable
Exogeneity is tested using the ____ test
J