ECON 326 Midterm Flashcards

Question 1

Q

Why do we need econometrics? Class size example

Answer

A

Economics suggests important relationships with policy implications/should that rarely ever indicates quantitative magnitude of causal effects which ideally would be determined by experiment (randomized+controlled) however almost always we only have observational (non-experimental) data.

Ex. Decrease in class size increases student achievement, the provincial government should create policy to decrease class sizes, but by how much=quantitative effect?

Question 2

Q

Random Sampling must satisfy

Answer

A

Random Sampling must satisfy: no confounds, each person has equal chance of selection (ex.Tasting saltiness of well mixed soup)
* n>25 Law of Large Numbers
* Random Sample
* Identically distribution
* Sample Independently distributed

Question 3

Q

Explain difference b/w Independent & Identically distributed? Example of coin flip?

Answer

A

Independent: value of one doesn’t affect/depend on value of another
Identical: probability of outcomes is the same, same process used to collect data
(ex.Flipping coin, previous result doesn’t affect next flip, each flip has 50/50 probability distribution)

Question 4

Q

Prove expectation of Y equals population regression equation

Answer

A

E[Y]=E[Y|X]=E[0+1Xi+ui]=E(b0)+E(b1x)+E(e)=b0+b1x+e

Question 5

Q

3 Measures of Fit + Formula + Drawing

Answer

A

Regression R2(0=no fit,1=perfect): unitless fraction of variance Y that is explained by X, the higher the better
Standard Error of the Regress (SER) units of Y magnitude of typical regression residual in units of Y/spread of the distribution of residuals u, almost the sample standard deviation of OLS residuals, the lower the better
Root Mean Squared Error (RMSE) same as SER but divided by 1/n

Question 6

Q

OLS Estimation + Proof

Answer

A

OLS Estimation: minimizing squared error, makes the line fit the model so that the sum of difference b/w regression line and true data point.
Minimizes avg. squared difference b/w actually values (Yi) and prediction Yi=b0+b1+e based on the estimated line
given a set of number=n points (Xi,Yi), find the line of best fit Yi=a+bx that minimizes the sum of squared errors in Y, i=1n(Yi-Yi)2 (vertical distance b/w points & line)

Question 7

Q

R Regression Code

Answer

A

regression1 <-lm(dependent~independent, data=caschool)
summary(regression1)
coeftest(regression,vcov=vcovHC(regression, type=”HC1”))

Question 8

Q

Look at R code result, Interpret each element

Question 9

Q

1 Least Squares Assumptions for Causal Reference

Answer

A

Randomized Controlled Experiment: for a binary treatment, expected difference in means b/w the treatment & control groups which are divided by random assignment (by computer) ensuring X is uncorrelated with all other determinants of Y, there are no confounding variables (OVB/bias), all individual characteristics that make up u are distributed independently of X so Conditional Distribution E(u|X=x)=0 all other qualities and residuals will cancel out across both groups implying 1 is an unbiased estimator of the causal effect

See graph in doc

Question 10

Q

2 Least Squares Assumptions for Causal Reference

Answer

A

Identically & Independently Distributed to allow Central Limit Theorem (CLT) to create the sampling distribution of 0 & 1 by simple random sampling; all entities selected from same population (identically distributed) and at random so probability of selecting one school has no correlation with selecting other (independently distributed)

Question 11

Q

3 Least Squares Assumptions for Causal Reference

Answer

A

Large outliers in X and/or Y are rare E(X4or Y4)< it could strongly influence results or create meaningless values of 1, usually X & Y are bounded having finite fourth moments
Scatterplot and removing extreme values of X or Y or else
Trimming: take 1% of data off of both ends
Winsorizing: replacing with less extreme values from within the data distribution, rather than removing them entirely to mitigate their effects without completely discarding data points.

See doc for graph.

Question 12

Q

Interpret b0 and b1

Answer

A

b0 is the average value of Y when X=0

b1 is the unit of change associated with a 1 unit change of X holding all other factors/variables constant.

Question 13

Q

Heteroskedasticity means that:
A) homogeneity cannot be assumed automatically for the model.
B) the variance of the error term is not constant.
C) the observed units have different preferences.
D) agents are not all rational.

Question 14

Q

The power of the test is:
A) dependent on whether you calculate a t or a t2 statistic.
B) one minus the probability of committing a type I error.
C) a subjective view taken by the econometrician dependent on the situation.
D) one minus the probability of committing a type II error.

Question 15

Q

With i.i.d. sampling each of the following is true EXCEPT:
A) E( ) = .
B) var( ) = /n.
C) E( ) < E(Y).
D) is a random variable

Question 16

Q

Central limit theorem states:
A) states conditions under which a variable involving the sum of Y1,…, Yn i.i.d. variables
becomes the standard normal distribution.
B) postulates that the sample mean is a consistent estimator of the population mean .
C) only holds in the presence of the law of large numbers.
D) states conditions under which a variable involving the sum of Y1,…, Yn i.i.d. variables
becomes the Student t distribution

Question 17

Q

You have estimated a linear regression to understand the relationship between salary and
years of experience. You want to test the hypothesis:
* Null Hypothesis H0 : The effect of experience on salary is zero (β1=0).
* Alternative Hypothesis HA : Experience significantly affects salary (β1≠0).
Which of the following R commands will provide the t-statistic and p-value for this
hypothesis test?
A) summary(model)
B) coefficients(model)
C) confint(model)
D) t.test(company_data$salary, company_data$experience)

Question 18

Q

Which command will predict sales if the advertising budget is 1000 units?
A) predict(model, newdata = data.frame(advertising = 1000))
B) predict(model, newdata = list(advertising = 1000))
C) model$predict(1000)
D) predict(model, advertising = 1000

Question 19

Q

Which command extracts the intercept and slope coefficients from the model?
A) coef(model)
B) summary(model)
C) model$coefficients
D) coefficients(model)

Question 20

Q

Which R command will show the detailed results (coefficients, residuals, R-squared, etc.) of
the regression?
A) summary(model)
B) print(model)
C) model$coefficients
D) coefficients(model)

Question 21

Q

Which of the following is the correct way to run a simple linear regression in R, where sales
is the dependent variable and advertising is the independent variable using the lm()
function?
A) lm(sales ~ advertising, data = dataset)
B) lm(advertising ~ sales, dataset)
C) lm(data = dataset, sales ~ advertising)
D) lm(dataset$sales, dataset$advertising)

Question 22

Q

To infer the political tendencies of the students at your college/university, you sample 150
of them. Only one of the following is a simple random sample. You:
A) make sure that the proportion of minorities are the same in your sample as in the
entire student body.
B) call every fiftieth person in the student directory at 9 a.m. If the person does not answer
the phone, you pick the next name listed, and so on.
C) go to the main dining hall on campus and interview students randomly there.
D) have your statistical package generate 150 random numbers in the range from 1 to the
total number of students in your academic institution, and then choose the corresponding
names in the student telephone directory

Question 23

Q

4 elements of Ideal Randomzied Controlled Experiment

Answer

A

Ideal: subjects follow treatment protocol, perfect compliance, no errors in reporting
Randomized: subjects from population of interest are randomly assigned to a treatment or control group so no confounding OVB
Controlled: control group permits measuring differential effect of treatment
Experiment: treatment assigned, subjects have no choice to avoid reverse causality & selection biases (those who are more likely to be in treatment group make up treatment group causing a bias)

Question 24

Q

4th Least Square Assumptiosn for Causal Inference in Multiple Regressions & How it can be violated & Solutions

Answer

A

No perfect collinearity, a regressor is an exact linear function of the other regressor, regressors are highly correlated
1. Inserting the same variable twice gives r code of NA, STRATA (dropped)
2. Dummy Variable Trap: one variable can be perfectly predicted from the others, making it impossible to accurately interpret the individual effects of each dummy variable on the model due to redundancy with the intercept term, mutually exclusive & exhaustive, include all dummy variables & a constant gives perfect multicollinearity, income v. provinces
* Solution: modify list of regressors, omit intercept or omit a categorical group

Question 25

Q

5 Multiple Regression Model Measures of Fit

Answer

A

Actual predicted + residual
SER standard deviation of predicted residuals with d.f. correction/avg spread 1n-k-1i=1nui2, k=# of variables
RMSE standard deviation of predicted residuals without d.f. correction/avg spread 1ni=1nui2,
R2 fraction of variance Y explained by variance X. Issue: adding more regressors even if slightly correlated with Y will reduce SSR improving fit ESSTSS=1-SSRTSS,ESS=i=1n(Y-Y)2, SSR=i=1n(Yi-Y)2, TSS=i=1n(Yi-Y)2
R2 Adjusted with a degrees of freedom correction, penalizes inclusion of another regressor, doesn’t necessarily increase when adding another regressor, always smaller than unadjusted R2 1-(n-1n-k-1)SSRTSS

Question 26

Q

How to mathematically hold constant variables in multiple regression model

Answer

A

Partial derivatives

X2, b=changeY/changeX1
X1, b=changeY/changeX2

Question 27

Q

3 Solutions to Omitted Variable Bias

Answer

A

Randomized controlled experiment where treatment is randomly assigned: not feasible
Cross Tabulation Approach: will run out of data, control for OVBs, compare cases with differing independent but identical confounding determinants, however
Add regressor/use regression doesn’t omit confounder: multiple regression

Question 28

Q

Fill in Direction of Bias Table for Positive & Negative Correlation

Answer

A

Direction: relation b/w Z→X & B/w Z→Y, does it amplify or dampen the relation b/w X→Y
Downward Bias: makes relation seem more negative, increases X and decreases Y, decreases X and increases Y
Upward Bias: makes relation seem more positive, as increases X and increases Y, decreases X and decreases Y
Over & Underestimating: farther and closer to 1=0

Question 29

Q

OVB impact on
1) Bad attention span on X-Media Usage –> Y-Academic Performance
2) Experience on X-Setup –> Y-Game Rank
3) PcETL on X-Class Size –> Y-TestScores

Answer

A

Downward Bias: more negative than it actually is. Overestimates the power of media usage on academic performance. May falsely conclude that media usage has a greater effect on grades than reality as attention span plays a large role
Upward bias: more positive than it actually is. Overestimates the ability for a good setup to increase rank as experience plays a big factor too.
Direction: positive X→Y times positive Z→X, overestimate

Question 30

Q

Omitted Variable Bias + Conditions + Formula

Answer

A

Omitted Variable Bias: biased and not consistent even if n is large E(b^1)=/b1
Arises when one of the variables (Z) in u are correlated with x & satisfies both
1. Is a determinant of Y=part of u
1. Is correlated with X=Corr(Z,X)=Xu0

Question 31

Q

TestScore=698.9-2.28STR, SE(b1)=0.52 significant? Use hypothesis testing, p-test and confidence interval to prove it.

Answer

A

H0:1=0
H0:10
Hypothesis Testing: t=1-1SE(1)=-2.28-00.52=-4.38>2.58=t reject null

P-Test: Pr(t>t)=Pr(-4.23>0.05)=2%<5%= reject null

Confidence Interval: 11.96SE(1)=-2.28-1.96(0.52),-2.28+1.96(0.52)
(-3.3,-1.26) since confidence interval doesn’t include 0, null is rejected

Question 32

Q

5 Step Process for solving slop of population regression line

Answer

A

State+Provide Population Object of Interest & Its Estimator: 1 assuming 3 LSA hold (EIIDO)
Derive Sampling Distribution (Normal of n large): 1~N(1,v2n(x2)2) Mean: E(1)=1, Variance:Var(1)=12=1nVar((Xi-x)ui)(Var(Xi))2=1nVar(1ni=1n(Xi-x)ui(x2)2)
Standard Error of Estimator = Square root of Estimated Variance SE(1)=12=
Construct T-Statistic & Confidence Interval for Hypothesis Testing t=estimator-hypothesized valuestandard error of estimator=Y-ysY/n=1-1SE(1)=1-112
Significance/Hypothesis Reject: reject if t>1.96 or p>significance

Question 33

Q

Derive Residual, Mean and Variance of b1

Question 34

Q

Sampling Uncertainty + What do you need to derive to find it

Answer

A

Sampling Uncertainty: different samples yield different values of 1 & 2 which can be quantified by hypotheses or confidence interval requiring finding the sampling distribution.
Distribution of the OLS estimator: sampling distribution of 1 in large samples is normally distributed, derive mean & variance to compute significance t=b1SE(b1)
1~N(1,v2n(x2)2), Z=1-E(1)var(1)~N(0,1)

Question 35

Q

Derive b_1, b_0

Question 36

Q

OLS Blue

Answer

A

OLS BLUE: best linear unbiased efficient estimator

Question 37

Q

Causal Inference v. Predicition

Answer

A

Place different requirements on data, use same regression toolkit
Causal Inference: learning causal effect on Y of a change in X
Prediction: predicting value of Y given X for an observation not in the data se

Question 38

Q

Regression Error Types + Illustrate

Answer

A

Regression Error/Population Error Term (ui): consists of omitted factors (OVB other than X that influence Y, difference b/w regression line and true data point (& error measurement of Y)
* Unexplained Variations/Residual i=1n(Yi-Y)=i=1nei ei=Yi-(b0+b1Xi)=Actual Data Value Y -Predicted Value Y
* Explained Variations i=1n(Y-Y)
* Total Variation i=1n(Yi-Y)

Question 39

Q

X=434.49, X=294.67 Construct 99%

Answer

A

Confidence Interval: CI[X-2.58294.671744,X+2.58294.671744]=[416.29,452.69]
99% of cases the true mean lies within this interval. True population weekly earnings average would be in this interval.
90% Confidence Interval would be smaller CI[X2.58294.671744>X1.64294.671744]

Question 40

Q

Is difference in average earnings statistically significant: see graph in tutorial 1 doc

Answer

A

Given sample standard deviation use t-statistic:
H0=0,H1>0, t=Y<45-Y>45s<452507+s>4521237=4.62>t=2 is significant, null hypothesis rejected

Question 41

Q

Sir Francis Galton, a cousin of James Darwin, examined the relationship between the height of children
and their parents towards the end of the 19th century. It is from this study that the name “regression”
originated. You decide to update his findings by collecting data from 110 college students, and estimate
the following relationship:
= 19.6 + 0.73 × Midparh, R2 = 0.45, SER = 2.0
where Studenth is the height of students in inches, and Midparh is the average of the parental heights.
(Following Galton’s methodology, both variables were adjusted so that the average female height was
equal to the average male height.)
(a) Interpret the estimated coefficients.
(b) What is the meaning of the regression R2 ?
(c) What is the prediction for the height of a child whose parents have an average height of 70.06 inches?
(d) What is the interpretation of the SER here?
(e) Given the positive intercept and the fact that the slope lies between zero and one, what can you say
about the height of students who have quite tall parents? Those who have quite short parents?
(f) Galton was concerned about the height of the English aristocracy and referred to the above result as
“regression towards mediocrity.” Can you figure out what his concern was? Why do you think that we
refer to this result today as “Galton’s Fallacy”?

Question 42

Q

Sample Midterm Q1

Question 43

Q

Sample Midterm Q3

Question 44

Q

Imagine that you were told that the t-statistic for the slope coefficient of the regression line = 698.9 – 2.28 × STR was 4.38. What are the units of measurement for the t-statistic?

Answer

A

D) in denominators SE=f(x) of standard deviation

Question 45

Q

In general, the t-statistic has the following form:

Answer

A

C)
x-x/SE(x)=-SE()

Question 46

Q

4 Reasons why Correlation doesn’t imply causation

Answer

A

Hidden variable causes A & be to move together
Coincidence
B causes A, reverse causliaty
Strong correlation but causal effect weak, A causes B and B causes A

Question 47

Q

Interpret 95% Confidence Interval

Answer

A

Interval that contains the true value 95% of the time when repeatedly sampled.

Question 48

Q

Type 1, Type 2, P-value, Power

Answer

A

Type 1 Error Pr==significance: false positive, reject Null when it is true
Type 2 Error Pr==1-: false negative, accept Null when it is false
P-Value/Marginal Significance Level=Pr(t>tact)=Pr(x-X/n>xactual-X/n): probability of getting statistic higher than realized sample, contains more info than test reject, reject if p<

Question 49

Q

Process of Student t-distribution

Answer

A

Student t-distribution: if distribution is normal, i.i.d, n<25 or population variance unknown
t=x-XsY/n=Y1-Y2s2n+s2n
Compute t-statistic
Compute degrees of freedom
Look up 5% critical value
If t-statistic exceeds this critical value reject

Note: hypothesis of 2 means might not have a joined normal distribution/student t distribution, even if both have it. Must use s^2/n

Question 50

Q

Sampling Distribution of Predicted Y

Answer

A

Sampling Distribution of
Y: distribution of Y over diff possible samples of size n
* Unbiased: E(Y)=Y
* Efficiency: Y has smallest variance compared to all other linear unbiased estimators
* Consistent (Law of Large Numbers): YPY as n increases distribution Y becomes more tightly centered around Y interval of true population value, guaranteed when IID and Variance

Question 51

Q

4 Moments of Statistics

Answer

A

Mean (1st Moment): expected value of Y=E(Y)=Y, long-run average over repeated realizations
Variance (2nd Moment): E(Y-Y)=Y2 squared spread of distribution Sample Variance: s=1n-1i=1n(Yi-Y)2 estimates population variance, unbiased estimator if sample distribution is i.i.d and 4th moment< Standard Error: Variance=Y, Sample Standard Error: Var(Y)=s2yn=syn=SE(Y)
Skewness (3rd Moment): E[(Y-Y)]3Y3 asymmetry of a distribution, 0=symmetrical, >0 long right tail, <0 long left tail
Kurtosis (4th Moment): 3=normal distribution, >3 heavy tails=leptokurtic

Question 52

Q

Covariance v. Correlation

Conditional Distribution v. COndaitional Mean/Variance

Answer

A

Covariance:
cov(X,Y)=E[(X-x)(Z-Y)]=XY linear association b/w X & Y are units of X times Y
Covariance of a variable with itself is its variance Cov(X,X)=E((X-x)(X-x))=E((X-x)2)
Correlation: Cov(X,Y)XY=XYXy=rXY

Conditional Distributions: distribution of Y given X
Conditional Mean/variance: E(Y|X) mean/variance E((Y-Y|X)) of conditional distribution