Quantitative Methods Flashcards
In linear regression what is the confidence interval for the Y value
CI = Y +/- (tcritical) x (SE forecast)
What does the t-test evaluate
Statistical significance of an individual parameter in the regression
What does the F-test evaluate
The effectiveness of the complete model to explain Y
Is the dependent variable X or Y in a linear regression
Y
Explain what it means to say a “critical t-stat is distributed with n-k-1 degrees of freedom”
This is the t value that is compared with the measurements of the data.
The t-critical is taken from the standard table for the n and significance level.
What expression does the line of best fit for a linear regression minimise
Sum of the squared errors between Y theoretical and Y estimated.
What is the SSE of a linear regression
Sum of the squared residuals
Sum of the squared errors between Y theoretical and Y estimated.
What is the first of six classic normal linear regression assumptions, concerning parameter independence
- The relationship between Y and X is linear for the parameters and:
(1a) -the parameters are not raised to powers other than 1 and
(1b) - are parameters are separate and not functions of other parameters. - X can be powers other than 1
What is the second of six classic normal linear regression assumptions, concerning X, the independent variable
X is NOT RANDOM
X is not correlated with the Residuals
(note that Y can be correlated with the residuals)
Describe the relationship between “total variation of dependent variable” and “explained variation of dependent variable”
It is the change in observed value of Y for a change in value of X
Vis a vis the
Expected change in Y given the regression model
Explain covariance X and Y
Its the sum of the cross products of the difference from the mean of X and Y
Divided by n-1
Cov(X,Y)=(X-Xmean)(Y-Ymean)/(n-1)
What is the correlation coefficient of X,Y
Its the Cov(X,Y) divided by the product of sqrt(sum deviations of X from X_mean) and sqrt(sum deviations of Y from Y_mean)
For the error term of a linear regression what are the assumptions concerning correlation and variance
- Errors are uncorrelated
- Variance is the same for any observation
What 3 criteria must be satisfied for sample correlation coefficient to be valid
- Mean and Variance of X and Y are finite and constant
- The covariance between X and Y is finite and constant
Re Correl=cov(X,Y)/(sX.sY)
What is the t-staristic compared with?
How is it calculated
t statistic is compared with t-critical from tables
t-stat =
(b1 measured - value of b1 theoretical given null hypothesis) / (SE of b1 measured)
When b1 theoretical = 0 t=(b1_est / SE b1_est)
What is the similarity of an F-test with a t test in a simple regression
F-test = t-test of the slope coefficient
Define “dependent variable”
The variable Y whose variation is explained by the independent variable, X.
Give three other names for the dependent variable.
Explained variable
Endogenous variable
Predicted variable
Define the “Independant variable”
The variable used to explain the dependent variable.
Give three other names for the Independent variable.
Explanatory variable
Exogenous variable
Predicting variable
What is the second of six classic normal linear regression assumptions, concerning the Independent variable and the residuals
The independent variable X is uncorrelated with the residuals
(note Y can be correlated with the residuals)
X must not be random
What is the third of six classic normal linear regression assumptions, concerning the expected value of the residual
The expected value of the residual=zero
[E(ε) = 0].
What is the fourth of six classic normal linear regression assumptions, concerning the variance of the residual
The variance of the residual is constant for all values of residual
Homoskedasticity.
NO HETEROSKEDASTICITY .e.g where residuals change and get more or less noisy
What is the fifth of six classic normal linear regression assumptions, concerning the distribution of residual values
The Residuals are not correlated with each other (this means they are independently distributed)
e.g. NO SERIEL CORRELATION
What is the sixth of six classic normal linear regression assumptions, concerning the distribution of residual values
The distribution of the residuals is a normal distribution (with mean zero?)
Explain what the slope b1 is for a simple linear regression?
What is the expression for this slope coefficient in terms of variation of X and Y?
It is the change in Y due to a 1 unit change in X
b1=cov(X,Y)/var(X)
From a simple linear regression
Express the intercept b0
Express the slope b1
Y=b0 + b1.X
b0 = Y_mean - b1.X_mean
b1 is the slope =
Cov(X,Y)/Var(X)
What is the covariance of x with itself, Cov(X,X)
Var (X)
For the SSE and the SEE
- What is the same?
- What is different?
- “E” is error of the estimate = residual
SEE is a function of SSE - Sum of squares vs standard deviation
SSE uses the sum of the squared residuals
SEE uses the standard deviation of the residuals = sqrt[(SSE)/(n-2)].
What does SEE guage?
Give two other names for this
Fit of the linear regression:
- Standard deviation of the residuals (the standardized error)
- Standard Error of the regression
For what type of regression will SEE be low
For a good fit, strong relationship between the Y and X variables
The standars deviation of the residuals will be low
For what type of regression will SEE be high
Low fit, weak relationship between variables X and Y
This means standard deviation of residuals will be high
What does the coefficient of variation show
R squared
(Variation of X)/(Variation of Y)
Describe sample Covariance
Covariance (X,Y) = Sum (X- Xmean)(Y- Ymean)/(n-1)
Describe sample variance
Sample Variance (X) =[Sum (X- Xmean)squared /(n-1)]
Which three conditions are necessary for valid correlation coefficient
- Mean of X and Y is finite and constant
- Variance of X and Y is finite and constant
- Covariance (X,Y) must be finite and constant
How is SEE calculated
Standard deviation of residuals
sqrt [SSE/(n-2)]
What is R squared?
What does it mean?
Coefficient of determination.
It is the explained variation by percentage of total variation of the Dependent variable. It is % of total variation that is explained by the independent variables
R2
Squared = 65% means (variation X) /(variation of Y) = 0.65
How can R squared quickly be calculated for a simple linear regression with one independent variable?
R squared= r (correlation x,y) squared
What does the confidence interval of a regression coefficient show?
What is the test based on?
Whether the coefficient is statistically significant or not.
The test is based upon the coefficient not being zero, being “statistically different from zero”
If coefficient is zero that variable should not be in the regression because it is unrelated to Y.
How to show a coefficient is statistically different from Zero.
Explain how 95% confidence interval is calculated and used to test for null hypothesis of a slope coefficient bi from 35 samples
bi +/- (t_crit × SE_bi)
t_crit is obtained from student t where
Two tailed significance = 0.05
df = 35-2 = 33
Zero must fall within the range to confirm the null hypothesis otherwise
bi is statistically not zero
How to show the true value of a coefficient is not Zero and that X explains Y
Explain how 95% confidence interval is calculated and used to test for null hypothesis of a slope coefficient bi from 36 samples
Compare estimated b1 with hypothetical b1=0
Null hypothesis is b1=0
Test is t is outside range=
- t_critical to + t_critical
t_b1 < - t_crit
t_b1 > + t_crit
t_b1= (b1-0)/(SE b1)
t_crit =
df=36-2
Sig= 0.05
What is the df for error terms relative to number of observations for:
- Parameter estimate
- Predicted Y
For both the degree of freedom is adjusted for the number of parameters = number of coefficients plus intercept.
df = (n-2)
What is the null and alternative hypothesis for intercept term, b0
Hnull: b0=0
Ha: b0<>0
Explain R squared as function of explained, unexplained and total variation
R_squared = (explained variation) / (total variation)
= RSS/SST
R_squared = (Total variation - Unexplained variation) / (total variation)
=(SST-SSE)/SST
R_squared
=1-(unexplained/total)
= 1-(SSE/SST)
Describe SSE
The SSE is the sum of all the Unexplained Variation
Sum of all the squared residuals (Y actual - Y predicted)
Describe the Total Variation
How else is it known?
This is the sum of all squared differences between actual Y and mean( of all Y) = (Y actual - Y mean)
SST
= explained (RSS) + unexplained (SSE)
Describe the explained variation
What else is it called?
This is the sum of the squared differences of predicted Y from mean of Y
Sum (Y_predicted -Y_mean)
RSS = Regression explained variation
How does the slope coefficient explain correlation between two variables
It does not. This is a trick question
Explain how to calculate CI around a predicted Y
CI pred Y
= pred Y +/-
(Sf x t_crit)
Two tailed because its either side of pred Y
Sf is Standard Error of the Forecast Pred Y
If the standard error of predicted Y is not given what 3 values needed to calculate it
- n observations
- SEE (standard deviation of residuals)
- Variance and mean of X
- Xi for Predicted Y
Derive sf (standard error of forecast Y) using all of
- SEE,
- variance X
- Xi
- X mean
(Sf) squared =
SEE squared x [1+ 1/n + (Xi-X mean) squared /((n-1)× variance(X))]
Derive total variation of Y from
Unexplained + Explained
- Explained = variation of Y pred around mean Y
- Unexplained = variation of actual Y around Y pred
Input formula 1 + 2 from above
What is RSS
Regression Sum of Squares
The variation explained by the regression model
(Y pred-Ymean) squared
What is SSE
This sum of the squared residuals
The part of the regression model that cannot explain the part of total variation Yi from Y mean (this is the part not explained by RSS)
(Yactual - Ypred) squared
SSE=(MSE)x (n-k-1)
What is SST
It is the total variation of Y actual from Y mean
(Yactual - Ymean) squared
SST= RSS + SSE
Calculate and interpret the standard error of the estimate (SEE).
SEE indicates certainty about predictions using the regression equation
It is the standard deviation of the SSE, the “sum of the squared residuals”
Calculate and interpret the coefficient of determination (R2).
R2 indicates confidence about estimates using the regression
It is the ratio of the variation “explained” by the model over the “total variation” of the observations against their mean (the variation due to the distribution of all the observations)
Describe the confidence interval for a regression coefficient, b1 pred
It is a range values either side of the estimated coefficient, b1
C.I. = b1pred +/- (t_crit x standard error of b1 pred)
Formulate a null and alternative hypothesis about a population value of a regression coefficient and determine the appropriate test statistic and whether to reject the null hypothesis.
What part of the model effectiveness does F test determine
The effectiveness of the group of k independent variables
Explain MSE What is the adjusted sample size Explain SEE
MSE= The sample mean of Squared residuals The adjusted sample size = n-k-1 SEE = Standard deviation of all the sampled residuals Standard deviation of residuals = sqrt (MSE) SEE = sqrt (MSE) S
What does a large F indicate
Good explanation power
Why is F stat not often used for regressions with 1 independent variable?
F stat is the square of the t-stat and the rejection of F critical where F > Fcrit implies the same as the t-test, t> tcrit
Outline limitations of simple linear regression
- Parameter instability 2. Standard 6 assumptions do not hold, particularly presence of heteroskedasticity and autocorrelation. Both concerned with reliability of the residuals. 3. Public knowkedge limitation: widespread understanding causes participants to act in ways that distorts relationships of independent and dependent variables and future use of the regression is compromised. Note multicollinearity is not for simple linear regression because it concerns correlation of variables or functions of variables in a multiple regression.
Compare Rsquared with F in terms of variation
Rsquared = explained/total variation F = Explained/Unexplained variation
Explain multiple regression Null Hypothesis and Alternative hypothesis How is this tested?
If F test > F crit reject Null Hypothesis. If F test > F crit. At least one slope coefficient is non zero Null is that all slope coefficients = zero Alternative, at least 1 slope coefficient is not zero
Explain adjusted R squared
R squared adjusted = 1 - (df TSS / df SSE)(1-R squared) As k increases df SSE decreases As k increases df TSS does not change As k increases (df TSS / df SSE) increases As k increases adj R squared decreases
What are the drawbacks of multiple R squared
Ggg
Can adj Rsquared be negative
Yes
Compare in 4 key points Rsquared with adjusted R squared
- Adj R squared always <= R squared
- R squared is always greater than adj R squared when k>0
- As k increases, adjusted R-squared increases but then begins to decrease
- Where k=3 adjusted R squared is often max
Explain how dummy variables are evaluated by formulating the Hypothesis
- The omitted dummy variable is the reference class (remember Q4 not included in the regression equation example) so its implicit in the b0 which is always in the output.
- The hypothesis test applied to included dummy variables is whether or not they are statistically different to the reference class (in this case Q4)
- The slope coefficient for each included Dummy gives an output from the regression that represents a function of the included Dummy and the omitted dummy
- So for Ho: b1=0 this means bo=bo+[b1-bo), therefore the Ho tests if b1=bo
- Ha: b1 <>0 this means b1<>bo
If we accept Ho (t-test<= t_crit) this means b1=bo, e.g. Q1 equals Q4 (omitted dummy)
Which test does conditional heteroskedasticity make unreliable
F-test
What are the two types of serial correlation
Positive Negative
What effects result from multicollinearity
Slope coefficients unreliable
Standard Error of slope coefficients b_se, is higher than it should be
t-test is lower than it should be (b / b_se)
less likely to reject null hypothesis that (b=0) since t-test > t_crit
increase in Type II error
How do we detect multicollinearity
If the individual statistical significance of each slope coefficient is low but the F test and R squared indicated high significance then this is classic multicollinearity
How do we correct for multicollinearity
Stepwise regression elimination of variables to minimise multicollinearity
Give 7 types of model misspecification