WWWW Flashcards
F-Distribution + F-Stat
Distribution:
- Chi Squared
- Want to test if one event is significntly different from another
Group A, B and C put on 10 mg, 5 mg and placebo.
- Mean Square Between (MSB) = Mean square between these groups
- Mean Square Error (MSE) = Mean Variance of all these groups added together
F-stat = MSB/MSE - A large F-stat might indicate that the population means are not equal
goodness of fit
Explained Sum of Squares (ESS)
- Difference between predicted value and the mean of the dependent variable
Sum of Squared
Residuals (SSR)
- Difference between predicted and observed value at squared level.
Total Sum of Squares
- Difference between observed dependent variable and its mean
- TSS = SSR + ESS
unbiased estimators conditions
linear in parameters,
random sampling,
sample variation in explanatory variable,
zero conditional mean
significance level
The significance level is the probability of rejecting the null hypothesis when it is in fact true. a 5% significance level says that we have a prob of 5% of rejecting null when its true.
significance probability
The probability of drawing a statistic at least as adverse to the null hypothesis as the one you computed in your sample, assuming that the null hypothesis is true.
• What is meant by the size of a test?
In hypothesis testing, the size of a test is the probability of committing a Type I error, that is, of incorrectly rejecting the null hypothesis when it is true.
p-value
- P-value is the probability of rejecting the null hypothesis, when it should not be rejected (Type 1 error).
- We usually use 5% significance level, meaning that if a medicine is tested and it doesn’t have a real effect, the result will tell us it has an effect 1/20 times.
critical value
What are degrees of freedom?
- independent values that are free to vary in a data set
Intuitively:
A data set of four numbers. Three of the values are 4, 4, 4. and average of data is 4.
This must mean that the last number also has to be 4. It must be 4, it is not allowed to vary
What happen to a confidence interval when then sample is bigger and bigger?
It becomes smaller and smaller., sample size increase, more precise
How do we interpret Binary dependent variable regression
interpreted as a conditional probability function
depends on which model
- LPM easy
- Probit, cpf
- Logit, lcpf
interpret coeff probit logit
Standard errors in LPM are always
Heteroscedastic
Why is it called LPM
Because the probability that Y = 1 is a linear function of the regressors
What is cumulative distribution function
It is the probability that the variable takes a value less than or equal to X
What does Probit and Logit regression allow for that LPM doesnt
Probit and Logit models allows for non-linear relationship between regressors and dependent variable.
What is the z value
Rule of thumb: Z should be over 2 and p under 0,05 for H0 to be rejected
estimated intercept divided by the standard error
- the number of standard deviations the estimated intercept is away from 0 on a standard normal curve (Wald test)
Maximum Likelihood Intuitive of the mean
- Imagine that you have a line of observed values.
- Then imagine that you test every point on that line for where you get the highest likelihood of observing the data
- when all areas are checked you pick the one that maximizes the likelihood
logistic distribution is what
continious distribution
likelihood in statistics means
trying to find the optimal value for the mean or std for a distribution
How do wee find the best regression line
maximum likelihood
if p-value is < 0,05
probit logit
there is a statistically significant association between the response variable and dependent
Unbiasedness
Same Distribution as population and thus close to real result
What are the assumptions on the parameters
Logit Probit
Consistency, Unbiased
When to use Logit model
Use logit models whenever your dependent variable is binary (also called dummy) which takes values 0 or 1. Logit regression is a nonlinear regression model that forces the output (predicted values) to be either 0 or 1.
what is the main advantage of panel data
- allow for heterogenity
- and control for omitted biases
In panel data, is omitted variables a problem?
No, assuming the ommited variable does not change over time, the change in Y must be caused by the observed factors
what is fixed effects
A regression performed on panel data to test the effect of being in state i. The model can be either entity demeaned, time demeaned or both. All regressions will have the same slope, but different intersections.
standard errors panel
- Standard errors are found under the assumption that there is no autocorrelation or heterosced
- so the given standard error cannot be used
- that is why we use Clustered Standard Errors. They allow for heterosked and multicorr WITHIN an entity, but not across.
- multicorr and heterosked does not affect coefficient value, only standard error
- clustered: allows for heterosked and autocorr
panel assumptions
- Error term must have mean of zero FOR ALL OBSERVATIONS OF THE VARIABLE X. past, present and future
- I.I.D ACCROS Entitites:
- observations within an entity can autocorr, but not corr across entities
(3. large outliers unlikely
4. no perfect multicorr)
What are the conditions for a valid instrument?
RELVEVANT and EXOGENOUS: The two conditions for a valid instrument:
- Instrument relevance: if an instrument is relevant, then variation in the instrument is relevant to the variation in X.
- Instrument Exogeneity: Z is correlated with Y solely through its correlation with X.
Relevant: It is relevant in a way that the IV actually affects X
Exogenous: Our IV only affect Y through X
How do we use IV?
- Z is correlated with X, but not with error term. It has to satidy conditions for Relevance and exogeneity
Use Two Stage Least Square:
- Regress X on the Instrument Variable (X as dependent)
- Use the calculated Y variable of this regression in the original
Lets call the instrument for Z. if it satisfy the two conditions of relevance and exogeneity, we can estimate B1 by using an IV estimator called two least squares (TSLS). TSLS is calculated in two stages. First stage splits X into two parts; one part that is problematic and might be correlated to the error term, and one other part that is problem-free. The second stage uses the problem free part to estimate B1.
In the first stage you regress x on its instrument that gives X(hatt).
You then put X(hatt) in the regression.
From our example, the intuition is now that we only regress future salary on those who got draftet, thus eliminates the bias that veterans usually earn less.
which model to calculate IV
TSLS
Two least squares regression
How do we test the instrument variables?
- Relevance
F-test
H0: coeff = 0,
if rejected, they are relevant
- Exogeneity
J-test
H0: All IV are exogeneous
What are the two assumptions of including an instrumental varable?
It should be relevant (there should be a (strong) correlation to the explanatory variable) and exogenous (there should be no correlation to the error term)
What are some of the drawbacks of using IVs?
- hard to find good IV that capures all of variance of exog + not corr with error
- IV often not well corr with endo
IV variable need to satisfy which assumptions?
- Relevance
- Exogeneity
Consequences of weak instruments
If so, the TSLS estimator will be
- biased, and
- statistical inferences (standard errors, hypothesis tests, confidence intervals) can be misleading
Test Weak IV’s with single X
F-test
H0: IV are engodenous
- so p-value over 0,05 = all IV are exog
Relevancy means that
The variation in the instrument is relevant to the variation in X
What is the Least Square Assumptions?
Assumption 1: The Error Term has Conditional Mean of Zero
- Error term must not show any systematic pattern
- Cannot have omitted variable biases
Assumption 2: For all n are Independently and Identically Distributed
- Independently: The variables are independent from each other. The variables does not carry information of each other. If you roll two dices, the value you get on the first dice does not affect the value you get on the second.
- Identically Distributed: Each variable in the observation is has the same probability distribution. If you have a deck of cards, the probability of drawing a diamond king is 1 in 52. All of the participants has 1 in 52 chance of drawing a king.
Main: If you flip a coin 100 times, the probability of getting heads/tails will be 50/50 for every throw (coin has no memory), so it is “Indepentent”. The probability for every throw stays the same, so it is “Identical”
Assumption 3: Large Outliers Are Unlikely
X and Y have finite kurtosis, as several outliers can give wrong estimations
Normality
sed to determine if a data set is well-modeled by a normal distribution and to compute how likely it is for a random variable underlying the data set to be normally distributed.
What are the Gauss-Markov assumptions?
- Parameters are linear
- IID
- No perfect multicorr
- Error term zero mean
- Homosked:
- No autocorr
Why is regression analysis useful?
- quantify models.
- provides causal effect or relationship between ….
- eco theory rarely gives precise values, so better to turn to econometrics and reg analysis
- the causal effect between variables can be quantified and evaluated
- key toolkit for scientists
what is Heteroscedasticy
- error term doesn’t have a constant variance.
Simultanely bias
- one or more of the independent variables are jointly determined with the dependent variable.
- X causes Y but Y causes also X
- two variables on either side influence each other
- dont give the real causal effect
- Violate mean of zero assumption
Supply/demand a good example. Quantity and price Investments and Productivity Sales and advertisement This leads to violation of LS.1, hence our coefficient is biased.
Sample Selection bias
A type of bias that arises by choosing non-random data for statistical analysis. For example when people volunteer for a study. Those who volunteer might share the same characteristics.
For example, you want to study the context between veganism and undergraduate students. You send out a survey to the students in class of art and culture. Because this is not a random draw sample, it is not representative for the target population. These students might be more liberal etc.
Measurement error in independent variable
There are often error in the data
Feks:
- Reporting error
- Coding error
- Estimation error
what is stationarity
- no trends or seasonality
- its statistical properties does not change over time
- constant mean and varianc
What does heteroscedasticy lead to?
Coefficient doesn’t change
- But it leads to biased Standard Errors (SER)
- Biased SER makes hypothesis testing, t-test, p-vaules etc impossible
- Not BLUE or Gauss-Markov
Why do we want time-series to be stationary?
- nonstationary have undefined mean and infinite variance. Makes very biased answers
Threats to Internal Validity of Experiments
1) FAILURE TO RANDOMIZE
- treatment not randomly assigned, based on part of characteristics or preferences
- ethnic difference last name
- if you use vouchers
* Can test for if control variables coefficients W are 0 or not. If Random, X will be uncorrelated with W.
2) Failure to follow treatment protocol / partial complience
3) Attrition
4) Experimental effects / Hawthorne
5) Small Sample Sizes
- small sample does not necessary bias estimator of causal effect
- raises threat to validity of conf intervalls and hypothesis test
What are the threats to External Validity for Idealized Experiments?
- Nonrepresentative Sample
- The population studied and the population of interest might differ
- sample only includes people with one type of characteristics - Nonrepresentative program or policy
- policy or program must be similar to program studied to give generalizing results
- exp might be small-scale, might differ from real world - General Equilibrium effects
- Turning a small, temporary exp intro a widespread, permanent program
- sometimes only works with small groups
- Ac training Zimbabwe, 10 villages 40% increase wages. Nationwide: Different effect, become skilled, decrease in wage gains
What is Quasi-Experiments / Natural Experiments?
Two types of Quasi-Experiments:
- Whether an individual (entity) receives treatment is “as if” randomly assigned, possible conditional on certain characteristics
“Treatment (d) “as if” randomly assigned
• For example a new policy measure that is implemented in one but not in another are, whereby the implementation is “as if” randomly assigned.
- Does immigration reduce wages? Eco theory suggest that if the supply of labor increases, wages will fall. However immigrants tend to go to cities with high labor demand, so the OLS estimator of the effect on wages of immigration will be biased. Was done a Quasi on Cubans that moved to Miami. Estimated the causal effect on wages of an increase in immigration by comparing the change in wages of low-skilled workers in Miami to the change in wages of similar workers in comparable U.S cities. He found no effect.
2. Whether an individual receives treatment partially determined by another variable that is “as if” randomly assigned
“A variable (z) that influences treatment (d) is “as if” randomly assigned: use IV regressions
• The variable that is “as if” randomly assigned can then be used as an instrument variable in a 2SLS regression analysis.
What is the parallel trend assumption?
We cannot test for this, but if the treatment and control firms seem similar before the treatment, this is more likely to be the case.
What is internal validity? what is external validity?
Internal validity refers to the validity of the findings within the research study. It is primarily concerned with controlling the extraneous variables and outside influences that may impact the outcome. External validity refers to the extent to which the results of study can be generalized or applied to other members of the larger population being studied.
How do we use difference-in-differences?
Testing the effect of treated versus control group regressing delta Y against the mean of the treated and control group before and after treatment
MLE
maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data.