Descriptive analysis and linear regression Flashcards
What is cross-sectional data?
Data collected simultaneously from a random sample e.g. household surveys, population census
What is time-series data?
Observations of a variable at different times e.g. monthly inflation
What is pooled cross-sectional data?
Observing different random samples at different times
What is panelled longitudinal data?
Observing the same random sample at different times
Write an example of a population regression function and a sample regression function.
Population: Yi = B1 + B2Xi + ui
Sample: Yi = b1 + b2Xi + ei
Interpret the terms Yi, Xi, ui, B1 and B2
Yi - the dependent/response/regressand variable
Xi - the independent/explanatory/regressor variable
ui - error term (we do not observe this)
B1 - the expected value of Yi when Xi=0
B2 - the average/expected change in Yi for a one-unit increase in Xi
Interpret E(Yi | Xi) = B1 + B2Xi
This is the conditional mean of Y given X which represent the deterministic component of the regression function.
Define OLS
OLS is a method for estimating the parameters in a linear function which contains a set of independent variables and a dependent variable.
It does this by minimising the sum of the differences between the observed dependent variable in the sample data set and the fitted/predicted value of the dependent variable given by the linear function.
In other words, the OLS method minimises the residual sum of squares.
Alternatively:
OLS finds the pair of values, B1 and B2, which minimises the Residual/Error Sum of Squares (RSS/ESS)
What is a confidence interval and how is it calculated?
A confidence interval is a range of values that will contain a population parameter a specified proportion of the time, for example 95% of the time.
For example, the 95% confidence interval for B2:
[b2 - 1.96 x se(b2), b2 + 1.96 x se(b2)]
What are the seven classical assumptions of OLS?
- Model is LINEAR in the regression coefficients (parameters)
- Regressors are assumed to be FIXED (i.e. NON-STOCHASTIC)
- EXOGENEITY
- HOMOSCEDASTICITY
- UNCORRELATED ERRORS
- NO MULTICOLLINEARITY
- NORMALITY OF Ui
How can a population regression function be split up into components, and what are these components called?
Deterministic component: Yi = B1 + B2Xi
(for the sample regression function this part relates to the fitted values of Y, denoted by Y-hat)
Random component: ui
What do the standard errors of the coefficient estimates measure?
What does a large standard error imply?
Standard errors measure the variability of the estimates, and they are estimates of the standard deviation of the estimates.
The larger the standard error, the greater the variability and the less certainty there is about the true magnitude of the coefficient.
What does it mean that the regressors are fixed (non-stochastic?
If you assume that a probability distribution p(x) accurately describes the probability of that variable having each value it might have, it is a random variable. If you don’t make any assumption about what value it has with what probability, it isn’t a random variable.
Define exogeneity (also covered in part 2 of course)
Exogeneity means that there is no systematic relationship between the error terms, u, and the independent variable, X.
E(ui | X) = 0
Define homoscedasticity
Homoscedasticity means that ui has constant variance given X:
var(ui | Xi) = σ^2
Write out formally that errors are uncorrelated.
Uncorrelated errors: cov(ui, uj | X) = 0 ; i /= j
Define multicollinearity.
What does multicollinearity imply if there is just one regressor?
Multicollinearity means that there is a linear relationship between regressors.
For there to be no multicollinearity with one regressor, this means that Xi has to take at least two different values in the data.
Write out formally that ui follows a normal distribution.
Normality of ui: ui ∼ N (0, σ^2)
What does it mean if OLS is BLUE?
Best Linear Unbiased Estimators (BLUE).
This means that for linear functions, the OLS estimators are BEST with MINIMUM VARIANCE (i.e. efficient) and UNBIASED whereby on average, the ESTIMATED parameters are EQUAL to their TRUE VALUES i.e. E(bk) = Bk
Write out the formula for calculating the test statistic (t-ratio)
t = (b2 - B2)/se(b2)
Under what conditions do we reject the null hypothesis in a hypothesis test?
We reject the null hypothesis if the ABSOLUTE value of the t-ratio is GREATER than the CRITICAL VALUE, which is determined by the specified significance level.
Write out the equation for calculating the degrees of freedom
Degrees of freedom = n - k
n - number of observations
k - number of regression coefficients, including the intercept
Degrees of freedom are the number of values in a study which are free to vary. This is important for hypothesis testing, including chi-square, since it indicates the importances of the chi-square statistic and the validity of the null hypothesis.
What is a type 1 error in hypothesis testing?
A type 1 error is the INCORRECT REJECTION of a TRUE NULL hypothesis.
In other words a “false positive”.
The type 1 error rate is the significance level (e.g. 5%)
What is a type 2 error in hypothesis testing?
How is a type 2 error denoted?
How does the type 2 error rate depend on the magnitude of the coefficient concerned?
A type 2 error is the FAILURE to REJECT a FALSE NULL hypothesis.
In other words a “false negative”.
A type 2 error is denoted β and relates to the power of a test (power = 1 - β)
The type 2 error rate depends on the magnitude of the coefficient. If the coefficient is large we are more likely to reject the null.
Interpret the P-value
The p-value is how UNLIKELY it would be to see a T-RATIO of that magnitude if the NULL hypothesis were TRUE.
Alternatively, the p-value is the PROBABILITY of obtaining a result EQUAL to or MORE EXTREME than what was actually observed, given a TRUE NULL hypothesis.
For example, a p-value of 0.027 means that there is a 2.7% chance that an observation of this magnitude, or more extreme, would be obtained under the null. Therefore we do reject the null at the 5% level but not at the 1% level.
Given a particular p-value, how do we decide whether to reject the null hypothesis or not?
Given a particular p-value, we REJECT the NULL at any SIGNIFICANCE LEVEL GREATER than the p-value.
What is the notation for the significance level of a hypothesis test?
Significance level is denoted by α, alpha.
Define coverage probability
Coverage probability is the proportion of time that the confidence interval contains the true value of interest.
What is a dummy variable? What type of information do dummy variables capture?
A dummy variable only takes two values, usually 0 or 1, to indicate the absence or presence of some categorical effect that is expected to effect the outcome variable.
Dummy variables captures qualitative information such as gender, ethnicity etc.
Name three ways to test a hypothesis.
- T-ratio
- P-value
- Confidence interval