RM TEST 4 Flashcards
- Explain why we automatically expect the presence of heteroskedasticity in the linear probability model.
All linear probability models have heteroskedasticity. Because all of the actual values for yi are either equal to 0 or 1, but the predicted values are probabilities anywhere between 0 and 1 (and sometimes even greater or smaller), the size of the residuals grow or shrink as the predicted values grow or shrink.
- 1)What is the weighted least squares method?
2) Under what conditions do we consider using it instead of OLS?
3)What is the relationship between the weight of the ith observation and the variance of the ith random error?
1) is a generalization of ordinary least squares and linear regression in which knowledge of the variance of observations is incorporated into the regression.
2)Weighted linear regression should be used when the observation errors do not have a constant variance and violate homoscedasticity requirement of linear regression.
3) The weight assigned to the ith observation is proportional to the inverse of the variance of the ith random error.
- Provide the formula for the logistic function and its inverse.
What is the name commonly used to refer to this inverse function.
Logistic function:
f(x) = 1 / (1 + e^-x)
Inverse function: (Logit)
logit(p) = ln( p / (1 - p))
56.What is the impact of heteroskedasticity on the OLS estimator in linear regression? What are the implications for statistical inference?
The effects of heteroskedasticity only concern the standard errors of the MNC estimator
Standard errors bias (s.e.)
It crashes everything that is calculated via (s.e.) -> t-tests, confidence intervals
F-tests don’t have Fisher distribution, neither LM test nor Wald tests work
- The following regression output was obtained in R using a cross-sectional data set on 327 used cars (Škodas) containing the variables price (price of a used car in CZK), km (kilometres travelled), age (age in years), and a categorical variable model with three levels: Felicia, Octavia and Superb.
Predict the price of a used Škoda Felicia that is 10 years old and has travelled 100,000 km.
lm(formula = log(price) ~ km + age + model, data = skoda)
Residuals:
Min 1Q Median 3Q Max
-0.46400 -0.13219 0.00324 0.10674 0.99017
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.667097877 0.044934932 281.899 < 2e-16 **
km -0.000001221 0.000000287 -4.255 0.0000275 *
age -0.124266214 0.007410084 -16.770 < 2e-16 *
modelOctavia 0.580380316 0.027288705 21.268 < 2e-16 ***
modelSuperb 1.072510923 0.055222034 19.422 < 2e-16 ***
—
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1857 on 323 degrees of freedom
Multiple R-squared: 0.9196,Adjusted R-squared: 0.9186
F-statistic: 923.9 on 4 and 323 DF, p-value: < 2.2e-16
log(price) = 12.667097877 + (-0.000001221 * 100,000) + (-0.124266214 * 10) + (0.580380316 * 0) + (1.072510923 * 0)
log(price) = 12.667097877 - 0.1221 - 1.24266214
log(price) ≈ 11.302335737
To get the predicted price, we need to exponentiate the result:
price = exp(11.302335737) ≈ 80962.84 CZK
So, the predicted price of a used Škoda Felicia that is 10 years old and has travelled 100,000 km is approximately 80,962.84 CZK.
- Briefly describe the Breusch-Pagan and the White heteroskedasticity tests. What are the auxiliary regressions run in these tests? What is the null hypothesis of the test?
1) The Breusch-Pagan and White heteroskedasticity tests are used to check if the variance of the errors in a regression model is constant.
The only different between White’s test and the Breusch-Pagan is that its auxiliary regression doesn’t include cross-terms or the original squared variables. Other than that, the steps are exactly the same.
2) The auxiliary regressions run in the tests are used to model the squared residuals as a function of the independent variables.
3) The null hypothesis of both tests is
H0: Variance is constant.
Alternative: The variance is not constant (heteroskedasticity)
- Describe the procedure for obtaining LRM standard error estimates usinq nonparametric bootstrap.
1)Resample. Create
B bootstrap samples by sampling with replacement from the original data
{r_1, … ,r_T}
Each bootstrap sample has T observations (same as the original sample)
2) Estimate θ
From each bootstrap sample estimate θ and denote the resulting estimate θ(hat). There will be B values of θ(hat)*: {θ(hat)_1 , … , θ(hat)_B}
3 ) Compute statistics. Using
{θ(hat)_1, …,θ(hat)_B}
compute estimate of bias, standard error, and approximate 95% confidence interval.
- What is the impact of heteroskedacity on the statistical properties of the OLS estimator for LRM, as well as the t-statistics of individual LRM.
It can affects variance of coeficient estimates, make model less precise, more futher from real population parametr, also it affects the standard error, makes it incorrect, it leads to t and p values misleading.
- Consider a sequence of random variables a(index 1), a(index 2)…… such that a(index n) -> N (0,1) as n -> infinity. What type of convergence of random variables does this correspond to? Give an example of a sequence of random variables that satisfies this condition, but where no member of a (index n) (for any finite n) has the distribution N(0, 1): by this condition, we exclude trivial sequences like a(index n) ~ N(0, 1) for all n >= 100.
An example of a sequence of random variables that satisfies the given condition but where no member of a (index n) has the distribution N(0, 1) is:
Let X be a Bernoulli random variable taking values +1 and -1 with probability 1/2 each.
Define a(index n) = X/n^0.5 for n >= 1. Then, as n -> infinity, a(index n) converges in distribution to N(0,1) since E(a(index n)) = 0 and Var(a(index n)) = 1/n -> 0.
This type of convergence is called convergence in distribution
- The following regression output was obtained in R using a cross-sectional data set on 327 used cars (Škodas) containing the variables price (price of a used car in CZK), km (kilometres travelled), age (age in years), and a categorical variable model with three levels: Felicia, Octavia and Superb.
Find the 95% prediction interval for the price of a used Škoda Felicia that is 10 years old and has travelled 100,000 km.
lm(formula = price ~ I(km - 100000) + I(age - 10) + model, data = skoda)
Residuals:
Min 1Q Median 3Q Max
-175751 -46147 -6131 33287 523797
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33879.8598 11201.3571 3.025 0.00269 **
I(km - 100000) -0.5218 0.1163 -4.487 0.0000101 **
I(age - 10) -29973.2684 3002.9421 -9.981 < 2e-16 *
modelOctavia 112914.6631 11058.7678 10.210 < 2e-16 *
modelSuperb 429508.6965 22378.7699 19.193 < 2e-16 ***
—
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 75240 on 323 degrees of freedom
Multiple R-squared: 0.8499,Adjusted R-squared: 0.8481
F-statistic: 457.4 on 4 and 323 DF, p-value: < 2.2e-16
Predicted price = 33879.8598 + (-0.5218 x 0) + (-29973.2684 x 0) = 33879.8598 CZK
t-value: from the t-distribution with 323 degrees of freedom and a 95 % confidence level, the t-value is 1.966.
n: number of observations in the data set (327)
x̄: mean of the independent variable (age), which is 9.45.
x: the value of the independent variable (age) for which we are making the prediction, which is 10.
Prediction interval = 33879.8598 ± (1.966 x 75240 x √(1 + (1/327) + ((10 - 9.45)^2 / ∑(age - 9.45)^2))) = (357.38, 34746.61) CZK.
Therefore, the 95% prediction interval for the price of a used Škoda Felicia that is 10 years old and has travelled 100,000 km is (357.38, 34746.61) CZK.
- Variables in the cross-sectional data used to obtain the output below have the following meanings: entrepreneur = 1 if the respondent operates as a self-employed person or as a (co-)owner of a company (dummy variable), female = 1 for women and 0 for men, age is the respondent´s age in years.
Interpret the obtained regression coefficients.
lm(formula = entrepreneur ~ female + age, data = gem, subset = I(vek^2))
Residuals:
Min 1Q Median 3Q Max
-0.09017 -0.06370 -0.00922 0.00652 0.91985
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0472387 0.0157281 3.003 0.00274 **
female -0.0802351 0.0106628 -7.525 1.19e-13 *
age 0.0007155 0.0003325 2.152 0.03163 *
—
Signif. codes: 0 ‘’ 0.001 ‘**’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1563 on 979 degrees of freedom
(1018 observations deleted due to missingness)
Multiple R-squared: 0.05514,Adjusted R-squared: 0.05321
F-statistic: 28.56 on 2 and 979 DF, p-value: 8.773e-13
entrepreneur = 0.0472387–0.0802351 * female + 0.0007155 * age + error
The intercept of 0.0472387 represents the predicted probability of being an entrepreneur when female=0 (male) and age=0.
The coefficient for the female variable is -0.0802351, which indicates that being female is negatively associated with the probability of being an entrepreneur. Specifically, for each unit increase in female (going from male to female), the predicted probability of being an entrepreneur decreases by 0.0802351 units.
The coefficient for the age variable is 0.0007155, indicating that age is positively associated with the probability of being an entrepreneur. Specifically, for each one-year increase in age, the predicted probability of being an entrepreneur increases by 0.0007155 units.
The p-values for both coefficients are significant (p<0.05), suggesting that both gender and age have a significant impact on the likelihood of being an entrepreneur.
The model has a low adjusted R-squared value of 0.05321, indicating that only a small proportion of the variation in the outcome variable (entrepreneur) can be explained by the model.
- Using logistic regression in R, we estimate the probability that a randomly selected adult is starting a new business (variable startup coded as 1 = yes, 0 = no), depending on the following characteristics: self.efficacy asks about the respondent´s confidence in their own entrepreneurial skills/abilities, with possible values of Yes and No; age is for the age of the respondent in years; female = 1 for women and 0 for men; year is a variable distinguishing data from 2011 from data from 2006 (values 2006, 2011).
By how much approximately will the respondent’s odds ratio change with one year of age (with other explanatory variables held constant)?
e^-0.0101 - 1
decrease by approx. 1%
exp(-0.0101) ≈ 0.9899
This means that for each one-year increase in age, the odds of starting a new business decrease by a factor of 0.9899, or approximately 1 %. In other words, holding all other variables constant, a one-year increase in age is associated with a 1 % decrease in the odds of starting a new business.
- Using logistic regression in R, we estimate the probability that a randomly selected adult is starting a new business (variable startup coded as 1 = yes, 0 = no), depending on the following characteristics:
Odds in 2006 = Probability of success / (1 - Probability of success) = 0.25 / (1 - 0.25) = 0.25 / 0.75 = 1/3 ≈ 0.3333
Odds in 2011 = Odds in 2006 * year2011 odds ratio = 0.3333 * e ^1.2 ≈ 1.105
Probability of success in 2011 = Odds in 2011 / (1 + Odds in 2011) = 1.105 / (1 + 1.105) ≈ 0.5249