ECON 326 FINAL Flashcards
Why do we need econometrics? Class size example
Economics suggests important relationships with policy implications/should that rarely ever indicates quantitative magnitude of causal effects which ideally would be determined by experiment (randomized+controlled) however almost always we only have observational (non-experimental) data.
Ex. Decrease in class size increases student achievement, the provincial government should create policy to decrease class sizes, but by how much=quantitative effect?
Random Sampling must satisfy
Random Sampling must satisfy: no confounds, each person has equal chance of selection (ex.Tasting saltiness of well mixed soup)
* n>25 Law of Large Numbers
* Random Sample
* Identically distribution
* Sample Independently distributed
Explain difference b/w Independent & Identically distributed? Example of coin flip?
- Independent: value of one doesn’t affect/depend on value of another
- Identical: probability of outcomes is the same, same process used to collect data
- (ex.Flipping coin, previous result doesn’t affect next flip, each flip has 50/50 probability distribution)
Prove expectation of Y equals population regression equation
E[Y]=E[Y|X]=E[0+1Xi+ui]=E(b0)+E(b1x)+E(e)=b0+b1x+e
3 Measures of Fit + Formula + Drawing
- Regression R2(0=no fit,1=perfect): unitless fraction of variance Y that is explained by X, the higher the better
- Standard Error of the Regress (SER) units of Y magnitude of typical regression residual in units of Y/spread of the distribution of residuals u, almost the sample standard deviation of OLS residuals, the lower the better
- Root Mean Squared Error (RMSE) same as SER but divided by 1/n
OLS Estimation + Proof
- OLS Estimation: minimizing squared error, makes the line fit the model so that the sum of difference b/w regression line and true data point.
- Minimizes avg. squared difference b/w actually values (Yi) and prediction Yi=b0+b1+e based on the estimated line
- given a set of number=n points (Xi,Yi), find the line of best fit Yi=a+bx that minimizes the sum of squared errors in Y, i=1n(Yi-Yi)2 (vertical distance b/w points & line)
R Regression Code
regression1 <-lm(dependent~independent, data=caschool)
summary(regression1)
coeftest(regression,vcov=vcovHC(regression, type=”HC1”))
Look at R code result, Interpret each element
1 Least Squares Assumptions for Causal Reference
Randomized Controlled Experiment: for a binary treatment, expected difference in means b/w the treatment & control groups which are divided by random assignment (by computer) ensuring X is uncorrelated with all other determinants of Y, there are no confounding variables (OVB/bias), all individual characteristics that make up u are distributed independently of X so Conditional Distribution E(u|X=x)=0 all other qualities and residuals will cancel out across both groups implying 1 is an unbiased estimator of the causal effect
See graph in doc
2 Least Squares Assumptions for Causal Reference
Identically & Independently Distributed to allow Central Limit Theorem (CLT) to create the sampling distribution of 0 & 1 by simple random sampling; all entities selected from same population (identically distributed) and at random so probability of selecting one school has no correlation with selecting other (independently distributed)
3 Least Squares Assumptions for Causal Reference
Large outliers in X and/or Y are rare E(X4or Y4)< it could strongly influence results or create meaningless values of 1, usually X & Y are bounded having finite fourth moments
Scatterplot and removing extreme values of X or Y or else
Trimming: take 1% of data off of both ends
Winsorizing: replacing with less extreme values from within the data distribution, rather than removing them entirely to mitigate their effects without completely discarding data points.
See doc for graph.
Interpret b0 and b1
b0 is the average value of Y when X=0
b1 is the unit of change associated with a 1 unit change of X holding all other factors/variables constant.
Heteroskedasticity means that:
A) homogeneity cannot be assumed automatically for the model.
B) the variance of the error term is not constant.
C) the observed units have different preferences.
D) agents are not all rational.
B
The power of the test is:
A) dependent on whether you calculate a t or a t2 statistic.
B) one minus the probability of committing a type I error.
C) a subjective view taken by the econometrician dependent on the situation.
D) one minus the probability of committing a type II error.
D
With i.i.d. sampling each of the following is true EXCEPT:
A) E( ) = .
B) var( ) = /n.
C) E( ) < E(Y).
D) is a random variable
C
Central limit theorem states:
A) states conditions under which a variable involving the sum of Y1,…, Yn i.i.d. variables
becomes the standard normal distribution.
B) postulates that the sample mean is a consistent estimator of the population mean .
C) only holds in the presence of the law of large numbers.
D) states conditions under which a variable involving the sum of Y1,…, Yn i.i.d. variables
becomes the Student t distribution
A
You have estimated a linear regression to understand the relationship between salary and
years of experience. You want to test the hypothesis:
* Null Hypothesis H0 : The effect of experience on salary is zero (β1=0).
* Alternative Hypothesis HA : Experience significantly affects salary (β1≠0).
Which of the following R commands will provide the t-statistic and p-value for this
hypothesis test?
A) summary(model)
B) coefficients(model)
C) confint(model)
D) t.test(company_data$salary, company_data$experience)
A
Which command will predict sales if the advertising budget is 1000 units?
A) predict(model, newdata = data.frame(advertising = 1000))
B) predict(model, newdata = list(advertising = 1000))
C) model$predict(1000)
D) predict(model, advertising = 1000
A
Which command extracts the intercept and slope coefficients from the model?
A) coef(model)
B) summary(model)
C) model$coefficients
D) coefficients(model)
C
Which R command will show the detailed results (coefficients, residuals, R-squared, etc.) of
the regression?
A) summary(model)
B) print(model)
C) model$coefficients
D) coefficients(model)
A
Which of the following is the correct way to run a simple linear regression in R, where sales
is the dependent variable and advertising is the independent variable using the lm()
function?
A) lm(sales ~ advertising, data = dataset)
B) lm(advertising ~ sales, dataset)
C) lm(data = dataset, sales ~ advertising)
D) lm(dataset$sales, dataset$advertising)
A
To infer the political tendencies of the students at your college/university, you sample 150
of them. Only one of the following is a simple random sample. You:
A) make sure that the proportion of minorities are the same in your sample as in the
entire student body.
B) call every fiftieth person in the student directory at 9 a.m. If the person does not answer
the phone, you pick the next name listed, and so on.
C) go to the main dining hall on campus and interview students randomly there.
D) have your statistical package generate 150 random numbers in the range from 1 to the
total number of students in your academic institution, and then choose the corresponding
names in the student telephone directory
D
4 elements of Ideal Randomzied Controlled Experiment
- Ideal: subjects follow treatment protocol, perfect compliance, no errors in reporting
- Randomized: subjects from population of interest are randomly assigned to a treatment or control group so no confounding OVB
- Controlled: control group permits measuring differential effect of treatment
- Experiment: treatment assigned, subjects have no choice to avoid reverse causality & selection biases (those who are more likely to be in treatment group make up treatment group causing a bias)
4th Least Square Assumptiosn for Causal Inference in Multiple Regressions & How it can be violated & Solutions
No perfect collinearity, a regressor is an exact linear function of the other regressor, regressors are highly correlated
1. Inserting the same variable twice gives r code of NA, STRATA (dropped)
2. Dummy Variable Trap: one variable can be perfectly predicted from the others, making it impossible to accurately interpret the individual effects of each dummy variable on the model due to redundancy with the intercept term, mutually exclusive & exhaustive, include all dummy variables & a constant gives perfect multicollinearity, income v. provinces
* Solution: modify list of regressors, omit intercept or omit a categorical group