Midterm Flashcards
Right vs left skewed distribution
Right = mode, median, mean
Left = mean, median, mode
What is the variance
The arithmetic average of the squared differences of the data values of the mean
How to calculate standard error
Standard deviation divided by the square root of the sample size
What is standard deviation
Describes the spread o f values in a continuous distribution - a sample or population
It is used as a descriptive statistic
What is a standard error
Is used to measure the accuracy of a sample distribution in representing a population
How do you calculate confidence bounds
Mean plus minus (t value * standard error)
How to calculate standard error of the difference
Root of (se1 squared plus se2 squared)
How to calculate a chi square
Sum of ((observed frequency minus expected frequency)squared) divided by expected frequency
What does the chi square tell us about the null hypothesis
If the chi square is larger than the critical value of the degrees of freedom, the null hypothesis can be rejected
Explain type I vs type II error
Type II error is one where you fail to reject the null hypothesis when you should
Type I error is one where you reject a null hypothesis when you shouldn’t
What is the central limit theorem
Establishes that means of repeated large samples are normally distributed even when underlying distribution of the data is not normal
What is a confidence interval?
Best guess of finding the mean of a data set and the confidence that it lies somewhere in the desired interval (generally use 95)
Pearson’s correlation coefficient
Measure the association between two continuous variables
R is scaled between -1 and 1
R= covariation of x and y
Correlation vs regressions
Correlation tells us how strongly associated two variables are
Regression can tell us on average how much a one unit increase in the independent variable changes the predicated value of the dependent variable
What does the line of best fit do
Minimizes the Y distance from each observation to the line
Why do we use y hat and how does it differ
Y hat means we are producing estimated y values
In actual values of y we need the error term so y=a+bXi+ei
Standard error of the slope
Given by the root mean square error over standard deviation
T ratio
t= (b - ßH0)/s.e.
What are the five OLS assumptions
Linearity
Mean independence
Homscedasticity
Uncorrelated disturbances
Normal disturbance
Explain linearity
Linearity - the dependent variable is a linear function of the x’s plus a population error term ex. yea+ß1x1+ß2x2+e
Pertains to linearity in the parameters
Explain mean independence
Zero conditional mean
The mean value of error does not depend on any of the x’s
Assume that e(€)=0
Most important because violations 1. Can generate large bias in the estimates and 2. Cannot be tested for without additional data
Omitted variable bias
Endogenous bias
Measurement error
Explain homoscedasticity
The variance of the error cannot depend on the x’s
standard deviation squared is constant
You want homoskedacity
P value has to be >0.05
Non constant variance
Biases the standard errors
Explain uncorrelated disturbances
Teh value of the error for any observation is uncorrelated with the value of the error for any other observation
Correlated errors can arise from connected observations, causal effects, or serial correlation
They shrink standard errors, observations are assumed to be more independent that they are, type 1 error danger
Explain normal disturbance
The disturbances ,e, are distributed normally
Only the disturbances not the variables must be normally distributed
Normality is the least important assumption
How are the OLS assumptions related
1+2 are unbiased estimators
3+4 are BLUE (best linear unbiased estimator) and standard errors are at least as small are those produced by any other method
5 implies that a t table of z table can be used to calculate p values
What does the dummy variable do
Helps with comparison of the means of y for different categories of x
Collider bias
Occurs when a treatment (independent) variable and outcome (dependent) variable or factors causing these each influence a common third variable and that variable (the collider) is controlled for by design or analysis
More general form of selection bias
Post treatment bias
While omitting relevant covariates can lead to omitted variable bias, including covariates that control for you causal mechanism can result in post treatment bias
What is multicollinearity and what are the consequences
A situation where more than two explanatory variables in a multiple regression are highly linearly related
Does not bias the estimation of your coefficient estimates
Inflate standard errors of highly colinear variables
Induce unstable estimates
which OLS assumptions do time series tend to violate
Mean independence and the independence of errors
What is stationarity
A time series is weakly stationary if it’s mean and variance remain constant over time
What is a dummy variable
A variable that is coded 1 or 0
Regression outlier
An observation where the dependant value y is unusually extreme given its independent value x
In which direction is the ß1 biased
ß2>0 and corr(x1,x2)>0 positive
ß2>0 and corr(x1,x2)<0 negative
ß2<0 and corr(x1,x2)>0 negative
ß2<0 and corr(x1,x2)<0 positive
Interpret the different long linear relationship
Level level : y=a+ßx one unit change in x leads to a ß unit change in y
Log linear: log(y) = a+ßx one unit change in x leads to a 100*ß change in y
Linear log : y=a+ßlog(x) one percent change in x leads to a ß/100 unit change in y
Log log : log(y)=a+ßlog(x) one percent change in x leads to a ß percent change in y
How do we interpret shared terms in non linear regression
If b2 is negative and b3^2 is positive then y is convex (smiley)
If b2 is positive and b3^2 is negative then y is concave (frowny)
What does a time counter do
It draws out the trends in a time series data
What is a unit root
How much of y is explained by the previous y
The y is almost exactly the same as it’s previous value
Also known as a random walk
Random walk
Same value today as yesterday with just a bit of randomness
Weakly dependent time series
Covariance stationary time series is weakly dependent if the correlation between x1 and x1+h goes to zero sufficiently quickly as h increases
What are the two types of panel data
True panels - longitudinal data measuring the same units repeatedly over time
pooled cross sections - random surveys in multiple years with a new random sample each time
pooled cross sections
Advantages
Are amenable to OLS with only minor complications
Increased sample size increases accuracy of estimators and adds statistical power
Pitfalls
Distributions may change in different years
Panel heteroskedaciity
Fixed effects/within model
Subtract off the mean value of each group from each observation in a group
Equivalent to adding a dummy variable for each group
Super power- yields within estimation in which only the variation within groups is used for coefficients
What is persistence
Persistence in time series Evers to the continuity of an effect after the cause is removed
Often related to the notion of memory properties of time series
Has an effect on standard errors and can lead to false positives and negatives
If the effect of infinitesimally small shock will be influencing in future predictions of time series for a very long time you will have a persistent time series process
How do you deal with persistence
Use lag data
Make sure to model your data
How do you interpret a marginal effects plot
The y axis is the marginal effect of x on y dy/ dx
And the x axis is now the value of the conditioning variable
How do you calculate vif
1/1-r^2
1/tolerance
Tolerance is 1-r^2