Week 11: Core Skills In Regression Flashcards
What is regression known as?
Conditional Expectation Function = E(Y|X)
What is a conditional expectation function?
It tells us the expected (predicted) value of Y for some set of X variables.
Which variables do we include when using regression as a predictor?
All the variables, regardless of their statistical significance
What is another main use of regression?
To find marginal effects
Describe a marginal effect.
The impact of a one-unit change in X on E(Y |X).
What is the marginal effect in linear regression?
The coefficient of the variables.
What is differentiation used for in statistics?
- Compute marginal effects from regressions
- Find the minimum or maximum point of mathematical functions
Define ‘estimand’.
The unknown parameter(s) that we aim to estimate [e.g. E(Y )]
Define ‘estimator’.
Functions of sample data which we use to learn about
the estimands.
[e.g. the sample mean mean estimator 1n (sum of n (i) =1 y(i) ]
Define ‘estimate’.
Particular values of estimators that are realised in a given sample dataset.
[e.g. the mean of a sample µ]
What are the estimands in regression?
The βs, the true population coefficients.
What are the estimators in regression?
The OLS regression function.
What are the estimates in regression?
The βˆs, the estimated coefficients from our
regression.
Why is there uncertainty in statistics?
Due to the process of sampling, we observe only one of many possible estimates from the full population. Our sample mean or regression coefficient is an imprecise estimate of the true population estimand.
Why is the sampling distribution of the estimator important?
The sampling distribution of the estimator shows the probability of different estimates over repeated samples.
List the 4 sampling distribution facts.
- Mean is the true β
- Normally distributed (Central Limit Theorem)
- Can estimate its variance from the sample variance
- The standard error is its standard deviation
How do we create a sampling distribution in R?
By using simulation
How do we approximate the sampling distribution?
By calculating the standard error (standard deviation).
Define the sample variance.
An unbiased estimator of the variance of the true
sampling distribution of any β(j) from a multiple regression.
When do we say β is statistically significant?
If, under the null hypothesis, it would only have occurred 5% of the time or less, e.g. when |t| > 1.96.
What does the standard error say?
The bigger it is, the more uncertainty we have about the true value of β.
When do we reject the null hypothesis?
When α = 0.05, reject the null hypothesis when |t| > 1.96, e.g. reject when p < 0.05 under the null hypothesis
When do we accept the null hypothesis?
If the 95% confidence interval contains 0, we cannot reject the null hypothesis.
What is the Pseudo-Bayesian Approach?
A different approach involves directly simulating the sampling distribution, and using it to quantify uncertainty unlike using the standard deviation.
How would we take n draws in R?
rnorm(n,mean=,sd=)
When is simulation most useful?
When we want to show uncertainty about a
function of the coefficients such as the predicted outcome for a given set of X variables.
Write the 5 steps to using simulation.
- Estimate a regression model
- Create n simulations of the coefficients using the multivariate normal distribution [in R, use the sims() command in the arm package]
- For each set of n simulated coefficients, calculate the function required, storing the results
- The 95% confidence interval is the 0.025th and 0.975th values of the vector from (3) [95% of possible values are contained within it]
- The standard error is the standard deviation of the vector from (3)
What does the command %*% do in R?
Carries out matrix multiplication, e.g. (pred.outcomes <- values %*% t(coefs))
What is coefs?
A matrix with 1000 rows and five columns (each row is a simulation).
What is values?
A vector of X used for prediction: 1 row and five columns