Descriptive Analysis and Linear Regression Flashcards
Linear Regression Model
Yi = B1 + B2X2i + BkXki + ui Yi = dependent variable Xi = explanatory/independent/regressor B1 = intercept/constant (average value of Y when X=0 B2 = slope coefficient
ui
stochastic error term
average effect of all unobserved variables
objective of regression analysis
estimate values of Bs based on sample data
OLS
Ordinary Least Squares - used to estimate regression coefficients
finds the pair of B1 and B2 (b1 and b2) that minimise RSS
OLS assumptions
- LRM is linear in its parameters
- regressors = fixed/non-stochastic
- exogeneity - expected value of error term = 0 given values of X
- homoscedasticity - constant variance of each u given values of X
- no multicollinearity - no linear relationship between regressions
- u follows normal distribution
OLS estimators are BLUE
best linear unbiased estimators
- estimators are linear functions of Y
- on average they are = to the true parameter values
- they have minimum variance i.e. efficient
standard deviation of error term =
standard error
= RSS/df
n-k
degrees of freedom
n = sample size
k = no. of regressors
hypothesis testing
construct Ho and Ha e.g B2 = 0 and B2 x 0
t = b2/se(b2)
if t > cv from table
reject null
type 1 error
incorrect rejection of true null
detecting an affect that is not present
type 2 error
failure to reject false null
failing to detect present effect
low p-value
suggests that estimated coefficient if statistically significance
p-value < 0.01, 0.05, 0.1
statistically significant at 1%, 5%, 10% levels
dummy variables
0 = absence 1 = presence
e.g 1 if female, 0 if male
B2 would measure changes when you go from male to female
b1 = estimated wage for men
b2 = estimated diff btw men and women
b1+b2 = estimated wage for women
if exogeneity assumption doesn’t hold
leads to bias estimates and therefore we need to adjust for omitted variables
quadratic terms
capture increasing/decreasing marginal effects
have to generate a new variable and add it to regression
marginal effect
first derivative of regression functioned wrt variable of interest
interaction variable
constructed by multiplying two regressors
allows the magnitude of the effect X has on Y to vary depending on the level of another X
interpreting
how does the regression function respond to a change in a variable
if it is not linear (log-log)
log-log model so that it is linear in parameters
take logs and add error term
log-lin model
dependent variable in logs – %
explanatory variables in levels – units
B2 measures relative change in output Q for an absolute change in input
lin-log model
estimates % growth in dependent variable for an absolute change in explanatory variable
lin-lin model
using a linear production function
testing for linear combinations
se – t-stat – compare to critical value – create p-value – reject/don’t reject null
TSS
total sum of squares = ESS + RSS sum of squared deviations from the sample mean = how well we could predict outcome w/o any regressors
ESS
explained sum of squares = how much of that variation do our regressors predict
RSS
residual sum of squares = outcome variation that regressors don’t explain
R^2
ESS/TSS
overall measure of goodness-of-fit of the estimated regression line
how much of variation is explained by regressors
increases when u add more regressors
F-stat
tests significance of all coeffs
(ESS/k-1) / (RSS/n-k)
>critical value =reject null
dummy variable trap
situation of multicollinearity
to distinguish btw m categories we can only have m-1 dummies
perfect collinearity
perfect linear relationship between two or more regressors
one predictor variable can be used to predict another
imperfect collinearity
one dependent variable always equals to a linear combination of the other dependent variables plus a small error term
consequences of multicollinearity in the data
larger standard errors – smaller t-ratio – wider CI – less likely to reject null
homoscedasticty
assumption that error term has has the same variance for all observations (doesn’t always hold)
heteroscedasticity
error terms have unequal variances for different observations
consequences of heteroscedasticity
- OLS still consistent and unbiased
- se either too large or too small so t-stats, F-stats, p-values etc will be wrong
- OLS no longer efficient
dealing with heteroscedasticity
- use log transformation
- keep using OLS and compute heteroscedasticty
- weighted least squares
using a logarithmic transformation of the outcome variable
e.g. ln(wage) - these variables tend to have more variance at higher values
continuing to use OLS and computing heteroscedasticity - robust standard errors
regress y on x
corrects se to allow for heteroscedastcity
weighted least squares
more efficient than OLS in presences of heteroscedastcity
omission of relevant variables
they’ll be captured by the error term
if they are correlated to the ones included then parameters are biased and exogeneity assumption doesn’t hold