Quantitative Methods Flashcards

1
Q

In linear regression what is the confidence interval for the Y value

A

CI = Y +/- (tcritical) x (SE forecast)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What does the t-test evaluate

A

Statistical significance of an individual parameter in the regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does the F-test evaluate

A

The effectiveness of the complete model to explain Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Is the dependent variable X or Y in a linear regression

A

Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Explain what it means to say a “critical t-stat is distributed with n-k-1 degrees of freedom”

A

This is the t value that is compared with the measurements of the data.

The t-critical is taken from the standard table for the n and significance level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What expression does the line of best fit for a linear regression minimise

A

Sum of the squared errors between Y theoretical and Y estimated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the SSE of a linear regression

A

Sum of the squared residuals

Sum of the squared errors between Y theoretical and Y estimated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the first of six classic normal linear regression assumptions, concerning parameter independence

A
  1. The relationship between Y and X is linear for the parameters and:
    (1a) -the parameters are not raised to powers other than 1 and
    (1b) - are parameters are separate and not functions of other parameters.
  2. X can be powers other than 1
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is the second of six classic normal linear regression assumptions, concerning X, the independent variable

A

X is NOT RANDOM
X is not correlated with the Residuals
(note that Y can be correlated with the residuals)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe the relationship between “total variation of dependent variable” and “explained variation of dependent variable”

A

It is the change in observed value of Y for a change in value of X

Vis a vis the

Expected change in Y given the regression model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Explain covariance X and Y

A

Its the sum of the cross products of the difference from the mean of X and Y

Divided by n-1

Cov(X,Y)=(X-Xmean)(Y-Ymean)/(n-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the correlation coefficient of X,Y

A

Its the Cov(X,Y) divided by the product of sqrt(sum deviations of X from X_mean) and sqrt(sum deviations of Y from Y_mean)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

For the error term of a linear regression what are the assumptions concerning correlation and variance

A
  1. Errors are uncorrelated
  2. Variance is the same for any observation
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What 3 criteria must be satisfied for sample correlation coefficient to be valid

A
  1. Mean and Variance of X and Y are finite and constant
  2. The covariance between X and Y is finite and constant

Re Correl=cov(X,Y)/(sX.sY)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the t-staristic compared with?

How is it calculated

A

t statistic is compared with t-critical from tables

t-stat =
(b1 measured - value of b1 theoretical given null hypothesis) / (SE of b1 measured)

When b1 theoretical = 0 t=(b1_est / SE b1_est)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the similarity of an F-test with a t test in a simple regression

A

F-test = t-test of the slope coefficient

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Define “dependent variable”

A

The variable Y whose variation is explained by the independent variable, X.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Give three other names for the dependent variable.

A

Explained variable

Endogenous variable

Predicted variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Define the “Independant variable”

A

The variable used to explain the dependent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Give three other names for the Independent variable.

A

Explanatory variable
Exogenous variable
Predicting variable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

What is the second of six classic normal linear regression assumptions, concerning the Independent variable and the residuals

A

The independent variable X is uncorrelated with the residuals
(note Y can be correlated with the residuals)
X must not be random

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is the third of six classic normal linear regression assumptions, concerning the expected value of the residual

A

The expected value of the residual=zero
[E(ε) = 0].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the fourth of six classic normal linear regression assumptions, concerning the variance of the residual

A

The variance of the residual is constant for all values of residual
Homoskedasticity.
NO HETEROSKEDASTICITY .e.g where residuals change and get more or less noisy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is the fifth of six classic normal linear regression assumptions, concerning the distribution of residual values

A

The Residuals are not correlated with each other (this means they are independently distributed)
e.g. NO SERIEL CORRELATION

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What is the sixth of six classic normal linear regression assumptions, concerning the distribution of residual values

A

The distribution of the residuals is a normal distribution (with mean zero?)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Explain what the slope b1 is for a simple linear regression?

What is the expression for this slope coefficient in terms of variation of X and Y?

A

It is the change in Y due to a 1 unit change in X

b1=cov(X,Y)/var(X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

From a simple linear regression

Express the intercept b0

Express the slope b1

A

Y=b0 + b1.X

b0 = Y_mean - b1.X_mean

b1 is the slope =
Cov(X,Y)/Var(X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is the covariance of x with itself, Cov(X,X)

A

Var (X)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

For the SSE and the SEE

  1. What is the same?
  2. What is different?
A
  1. “E” is error of the estimate = residual
    SEE is a function of SSE
  2. Sum of squares vs standard deviation

SSE uses the sum of the squared residuals

SEE uses the standard deviation of the residuals = sqrt[(SSE)/(n-2)].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What does SEE guage?
Give two other names for this

A

Fit of the linear regression:

  1. Standard deviation of the residuals (the standardized error)
  2. Standard Error of the regression
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

For what type of regression will SEE be low

A

For a good fit, strong relationship between the Y and X variables

The standars deviation of the residuals will be low

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

For what type of regression will SEE be high

A

Low fit, weak relationship between variables X and Y

This means standard deviation of residuals will be high

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

What does the coefficient of variation show

A

R squared

(Variation of X)/(Variation of Y)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

Describe sample Covariance

A

Covariance (X,Y) = Sum (X- Xmean)(Y- Ymean)/(n-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

Describe sample variance

A
Sample Variance (X) 
=[Sum (X- Xmean)squared /(n-1)]
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Which three conditions are necessary for valid correlation coefficient

A
  1. Mean of X and Y is finite and constant
  2. Variance of X and Y is finite and constant
  3. Covariance (X,Y) must be finite and constant
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

How is SEE calculated

A

Standard deviation of residuals

sqrt [SSE/(n-2)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is R squared?
What does it mean?

A

Coefficient of determination.
It is the explained variation by percentage of total variation of the Dependent variable. It is % of total variation that is explained by the independent variables

R2
Squared = 65% means (variation X) /(variation of Y) = 0.65

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

How can R squared quickly be calculated for a simple linear regression with one independent variable?

A

R squared= r (correlation x,y) squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

What does the confidence interval of a regression coefficient show?

What is the test based on?

A

Whether the coefficient is statistically significant or not.

The test is based upon the coefficient not being zero, being “statistically different from zero”

If coefficient is zero that variable should not be in the regression because it is unrelated to Y.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

How to show a coefficient is statistically different from Zero.

Explain how 95% confidence interval is calculated and used to test for null hypothesis of a slope coefficient bi from 35 samples

A

bi +/- (t_crit × SE_bi)

t_crit is obtained from student t where
Two tailed significance = 0.05
df = 35-2 = 33

Zero must fall within the range to confirm the null hypothesis otherwise

bi is statistically not zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

How to show the true value of a coefficient is not Zero and that X explains Y

Explain how 95% confidence interval is calculated and used to test for null hypothesis of a slope coefficient bi from 36 samples

A

Compare estimated b1 with hypothetical b1=0

Null hypothesis is b1=0

Test is t is outside range=
- t_critical to + t_critical

t_b1 < - t_crit
t_b1 > + t_crit

t_b1= (b1-0)/(SE b1)

t_crit =
df=36-2
Sig= 0.05

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

What is the df for error terms relative to number of observations for:

  1. Parameter estimate
  2. Predicted Y
A

For both the degree of freedom is adjusted for the number of parameters = number of coefficients plus intercept.

df = (n-2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

What is the null and alternative hypothesis for intercept term, b0

A

Hnull: b0=0
Ha: b0<>0

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

Explain R squared as function of explained, unexplained and total variation

A

R_squared = (explained variation) / (total variation)
= RSS/SST

R_squared = (Total variation - Unexplained variation) / (total variation)
=(SST-SSE)/SST

R_squared
=1-(unexplained/total)

= 1-(SSE/SST)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

Describe SSE

A

The SSE is the sum of all the Unexplained Variation

Sum of all the squared residuals (Y actual - Y predicted)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

Describe the Total Variation

How else is it known?

A

This is the sum of all squared differences between actual Y and mean( of all Y) = (Y actual - Y mean)

SST

= explained (RSS) + unexplained (SSE)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

Describe the explained variation
What else is it called?

A

This is the sum of the squared differences of predicted Y from mean of Y

Sum (Y_predicted -Y_mean)

RSS = Regression explained variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

How does the slope coefficient explain correlation between two variables

A

It does not. This is a trick question

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

Explain how to calculate CI around a predicted Y

A

CI pred Y
= pred Y +/-
(Sf x t_crit)

Two tailed because its either side of pred Y

Sf is Standard Error of the Forecast Pred Y

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

If the standard error of predicted Y is not given what 3 values needed to calculate it

A
  1. n observations
  2. SEE (standard deviation of residuals)
  3. Variance and mean of X
  4. Xi for Predicted Y
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

Derive sf (standard error of forecast Y) using all of

  1. SEE,
  2. variance X
  3. Xi
  4. X mean
A

(Sf) squared =

SEE squared x [1+ 1/n + (Xi-X mean) squared /((n-1)× variance(X))]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

Derive total variation of Y from
Unexplained + Explained

A
  1. Explained = variation of Y pred around mean Y
  2. Unexplained = variation of actual Y around Y pred

Input formula 1 + 2 from above

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

What is RSS

A

Regression Sum of Squares

The variation explained by the regression model

(Y pred-Ymean) squared

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

What is SSE

A

This sum of the squared residuals

The part of the regression model that cannot explain the part of total variation Yi from Y mean (this is the part not explained by RSS)

(Yactual - Ypred) squared

SSE=(MSE)x (n-k-1)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

What is SST

A

It is the total variation of Y actual from Y mean

(Yactual - Ymean) squared

SST= RSS + SSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

Calculate and interpret the standard error of the estimate (SEE).

A

SEE indicates certainty about predictions using the regression equation

It is the standard deviation of the SSE, the “sum of the squared residuals”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

Calculate and interpret the coefficient of determination (R2).

A

R2 indicates confidence about estimates using the regression

It is the ratio of the variation “explained” by the model over the “total variation” of the observations against their mean (the variation due to the distribution of all the observations)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

Describe the confidence interval for a regression coefficient, b1 pred

A

It is a range values either side of the estimated coefficient, b1

C.I. = b1pred +/- (t_crit x standard error of b1 pred)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

Formulate a null and alternative hypothesis about a population value of a regression coefficient and determine the appropriate test statistic and whether to reject the null hypothesis.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

What part of the model effectiveness does F test determine

A

The effectiveness of the group of k independent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

Explain MSE What is the adjusted sample size Explain SEE

A

MSE= The sample mean of Squared residuals The adjusted sample size = n-k-1 SEE = Standard deviation of all the sampled residuals Standard deviation of residuals = sqrt (MSE) SEE = sqrt (MSE) S

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

What does a large F indicate

A

Good explanation power

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

Why is F stat not often used for regressions with 1 independent variable?

A

F stat is the square of the t-stat and the rejection of F critical where F > Fcrit implies the same as the t-test, t> tcrit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

Outline limitations of simple linear regression

A
  1. Parameter instability 2. Standard 6 assumptions do not hold, particularly presence of heteroskedasticity and autocorrelation. Both concerned with reliability of the residuals. 3. Public knowkedge limitation: widespread understanding causes participants to act in ways that distorts relationships of independent and dependent variables and future use of the regression is compromised. Note multicollinearity is not for simple linear regression because it concerns correlation of variables or functions of variables in a multiple regression.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

Compare Rsquared with F in terms of variation

A

Rsquared = explained/total variation F = Explained/Unexplained variation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

Explain multiple regression Null Hypothesis and Alternative hypothesis How is this tested?

A

If F test > F crit reject Null Hypothesis. If F test > F crit. At least one slope coefficient is non zero Null is that all slope coefficients = zero Alternative, at least 1 slope coefficient is not zero

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

Explain adjusted R squared

A

R squared adjusted = 1 - (df TSS / df SSE)(1-R squared) As k increases df SSE decreases As k increases df TSS does not change As k increases (df TSS / df SSE) increases As k increases adj R squared decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

What are the drawbacks of multiple R squared

A

Ggg

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Can adj Rsquared be negative

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

Compare in 4 key points Rsquared with adjusted R squared

A
  1. Adj R squared always <= R squared
  2. R squared is always greater than adj R squared when k>0
  3. As k increases, adjusted R-squared increases but then begins to decrease
  4. Where k=3 adjusted R squared is often max
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

Explain how dummy variables are evaluated by formulating the Hypothesis

A
  1. The omitted dummy variable is the reference class (remember Q4 not included in the regression equation example) so its implicit in the b0 which is always in the output.
  2. The hypothesis test applied to included dummy variables is whether or not they are statistically different to the reference class (in this case Q4)
  3. The slope coefficient for each included Dummy gives an output from the regression that represents a function of the included Dummy and the omitted dummy
  4. So for Ho: b1=0 this means bo=bo+[b1-bo), therefore the Ho tests if b1=bo
  5. Ha: b1 <>0 this means b1<>bo

If we accept Ho (t-test<= t_crit) this means b1=bo, e.g. Q1 equals Q4 (omitted dummy)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Which test does conditional heteroskedasticity make unreliable

A

F-test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

What are the two types of serial correlation

A

Positive Negative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q

What effects result from multicollinearity

A

Slope coefficients unreliable

Standard Error of slope coefficients b_se, is higher than it should be

t-test is lower than it should be (b / b_se)

less likely to reject null hypothesis that (b=0) since t-test > t_crit

increase in Type II error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

How do we detect multicollinearity

A

If the individual statistical significance of each slope coefficient is low but the F test and R squared indicated high significance then this is classic multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

How do we correct for multicollinearity

A

Stepwise regression elimination of variables to minimise multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Give 7 types of model misspecification

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

What is Unconditional Heteroskedasticity

A

Heteroskedasticity is the opposite of homoskedasticity (the level of variance of the residual is constant across all values of the independent variables)

Unconditional Heteroskedasticity is where the variance of the residuals has no pattern or relation to the level of the independent variable.

It does not cause a significant problem for the regression

80
Q

What is Conditional Heteroskedasticity?

A

The variance of the residual is related to the level (value) of the X variables.

Heteroskedasticity is the opposite of homoskedasticity (the level of variance of the residual is constant across all values of the independent variables)

Instead, with Conditional Heteroskedasticity, the variance of the residuals is linked to the values of the independent variables and is NOT constant.

For example variance of residuals could increase as the value of independent variables increase.

It does cause significant problems for a regression

81
Q

When is an AR(1) model a random walk

A

When bo = 0 And b1=1

So for AR(1) Xt = bo + b1 Xt-1 + disturbance

AR(1) becomes

Xt = Xt-1 + disturbance

82
Q

Discuss the mean-reverting level of a random walk

A

The mean-reverting level is = bo/(1-b1)

for a random walk, b0=0, b1=1

and the mean reverting level = b0/(1-b1)

= 0/0 = “undefined”.

83
Q

Give 7 assumptions for a multiple regression to be valid

A
  1. There must be a linear relationship between the dependent variable and the independent variables
  2. Independent variables are not random
  3. There is no linear relationship between the independent variables (e.g. one is not merely a function of the other)
  4. The expected value of the error term is zero
  5. The variance of the error term is constant
  6. The error terms are not correlated with each other
  7. The error terms are normally distributed
84
Q

What can cause heteroskedasticity?

A

When some samples are spread out more than others the variance of the residual changes.

85
Q

What are the implications of a random walk

A

The mean reverting level is undefined, unbound 0/0 So mean and variance grows with time. Violating both sinple and multiple regression assumptions. Mean reverting level, Xt=b0/(1-beta1) And Beta1=1, this means the time series has a unit root

86
Q

Explain any t-test formulation in terms of estimated and hypothetical values

A

t = (estimated value - hypothetical value (or actual value)) / (standard error of the estimated variable)

standard error is the risk of the estimate being different to the actual value which is equal to the standard deviation of the error between the estimate and the actual value.

This is why things that affect the residual also affect the validity of the coefficient estimates.

87
Q

What is Beta 1 in a random walk?

A

Its the lag coefficient

88
Q

What type of errors are used for heteroskedasticity only

A

White corrected errors

89
Q

What type of errors are used for conditional heteroskedasticity and with seriel correlation

A

Hanson-White

90
Q

What dies generalized least squares attempt to correct

A

Conditional Heteroskedasticity

91
Q

What part of the regression does multicollinearity not affect

A

The slope coefficients themselves

92
Q

Discuss Hypothesis rationale

A

To prove that some factor is significant. Formulate as Beta x Factor, where Beta = zero, the factor is zero in the equation. So a null hypothesis that Beta=0 means the factor is insiginificant Rejecting null hypothesis where t>tcrit means Beta is non zero, so factor is significant

93
Q

What is the practical effect of multicollinearity in hypothesis testing

A

Standard errors of coefficients is inflated t test is therefore lower than it really is.

t-test statistic is smaller and so less likely to be greater than t crit

Less likely to reject a null hypothesis (that coefficient is not different to zero ho: b=0 and so b is not significant)

So more likely to conclude a variable is not significant when in fact it is significant

This is a type II error

94
Q

Explain a Type II error

A

Accepting a variable as insignificant wgen in fact it is significant. eg t test is lower due to artificially high SE

95
Q

Explain the rationale for detecting multicollinearity

A

F test and R squared indicate high explanation power but individual coefficients do not. This happens when coefficients are correlated, washing out individual effects but together explaining the model well

96
Q

Explain what an unbiased estimator is

A

where the expected value for the parameter is equal to the actual value of the parameter

97
Q

Explain what a consistent estimator is

A

A consistent estimator is where the accuracy of the estimate increases as the sample size (n) increases

98
Q

Compare the problems between simple linear regression and multiple linear regression

A

Simple linear regression has only one independent variable but the problems are:

  1. Heteroskedasticity
  2. Serial Correlations

Multiple linear regression adds the problem of correlation between multiple independent variables:

  1. Multi-collinearity
99
Q

Why does conditional heteroskedasticity cause hypothesis testing errors

What type of Error?

A
  1. Standard error of the estimates is unreliable.

If SE is lower than it should be

  1. This means t-test is higher than it should be
  2. This means more likely to reject the hypothesis that beta is not significant
  3. This means more likely to consider a coefficient significant when in fact it is not significant
  4. This is a Type I error. (false positive - incorrectly rejects a true null hypothesis and concludes the beta is significant and not zero)

If SE is higher, since t-test will be lower, so more likely not to reject a false null hypothesis and to accept that beta is not significant.

  1. This is a Type II error (false negative - incorrectly accepts a false null hypothesis and concludes that beta is not significant, beta=0)
100
Q

What is the SEE. What is it approximate to. How is it related to MSE How is it related to SSE

A

SEE is the variation of the predicted Y around the regression line. Appeoximately equal to standard deviation of residuals SEE = sqrt (MSE) SEE = sqrt (SSE/dof) dof= n-k-1

101
Q

What does rejection of Ho F=0 mean?

A

Means one of the independent variables has statistically significant explanatory power F > F crit

102
Q

What is the ship “HMS regression” full of?

A

3 problems Heteroskedasticity Multicollinearity Serial Correlation

103
Q

Which 3 problems are violations of the assumptions of multiple regression?

A
  1. Conditional Heteroskedasticity 2. Multi collinearity 3. Serial correlation
104
Q

What is the effect of heteroskedasticity

A

Conditional is worse because linked to level(values) of X variables F test is unreliable. SE around individual coefficients are unreliable (too large or too small) t stats will be either too large (se too small) causing false rejection of Ho (Type 1) that there is a statistical difference from zero when in truth there is not Or t is too small (se too large), causing false acceptance of Ho no significance (false positive, Type 2)

105
Q

What are the two types of heteroskedasticity

A

Unconditional Conditional

106
Q

What type of hypothesis error is more likely due to conditional heteroskedasticity?

A

Both Type 1 (reject Ho:b=0, when in reality it is true) and Type 2 (accept Ho: b=0, when in reality it is false)

107
Q

What is a Type 1 error?

A

False positive

False accepting difference

False rejecting similarity

Wrongly assume (t-tcrit)>0, positive, when it is not positive.

Wrongly assume t > tcrit

Wrongly reject H0 when it is really true and should be accepted

Wrongly assume there is a significant difference when really there is no significant difference

108
Q

What is a Type II error?

A

False-negative

False-accepting similarity

False- rejecting “no difference”

Wrongly assume (t-tcrit)<0, negative

Wrongly assume t < tcrit

Wrongly assume there is no difference (false negative) when really there is a difference

Wrongly accept H0 =X is true when it is really false

Wrongly accepting the “negative hypothesis” when it should be rejected because there is a difference (a positive).

109
Q

What is the “negative hypothesis”

A

The null hypothesis

“Not different” to the stated value

110
Q

What is a false positive?

A

Type I error

False rejection

Increased probability of rejecting H0 when it should be accepted

Wrongly accepting the positive and wrongly rejecting the negative.

Incorrectly assume there is a difference and the null hypothesis (negative) is wrong when in fact there is no difference and the null hypothesis is true.

Wrongly assume “positive” that (t>tcrit) when it is really false and (tcrit ) “negative” is actually true

Wrongly accepting (t-tcrit)>0 “positive” as true when it is really false and negative

In reality (t-tcrit)≤0 “negative” is true so above is a “false positive”

This leads to (t>tcrit) rejecting the Ho: b=0 when it should instead be accepted.

111
Q

What is a false negative?

A

Type II error

False acceptance

Increased probability of accepting H0 when it should be rejected

Accepting (tcrit) as true when it is really false and (t>tcrit ) is actually true

Wrongly accepting (t-tcrit)<0 “negative” as true when it is really false

In reality (t-tcrit)≥0 is true so above is a “false negative”

This leads to accepting the Ho: b=0 (tcrit) when it is false, a false negative.

112
Q

What is the effect of serial correlation on hypothesis testing

A

Increased probability of Type 1 errors

113
Q

What is the effect of Multicollinearity on hypothesis testing

A

Increased probability of Type II errors

Increased probability of false-negative (tcrit)

Increased probability of accepting H0

114
Q

What does the Brausch-Pagan test show

A

Heteroskedasticity

115
Q

What Rsquared is used in BP test

A

From a regressio of the squared residuals from the first regression

116
Q

What is the BP stat equivalent for BP test of conditional heteroskedasticity

A
  1. Use Chi-square BP crit 2. One degree of freedom 3. 5% one-tailed test BP test = n x Rsquared (regresssion on squared residuals)
117
Q

What is condition to reject null hypothesis for conditional heteroskedasticity

A

BP test > chi-square This rejects Ho no skedasticity and concludea there is conditional heteroskedasticity.

118
Q

How to correct for conditional heteroskedasticity

A

Test the regression coefficients using: t_stat= coefficient / White-corrected SE. t crit from t tables with n-k-1 dof

119
Q

What is the effect of serial correlation on hypothesis testing

A
  1. Estimates SE are smaller than actual 2. t stat is therefor larger than reality 3. Type 1 errors are more common (false positive). 4. False positive is where Ho is rejected too often.
120
Q

What two methods are used to detect serial correlation

A
  1. Residual plots 2. Durbin-Watson
121
Q

Describe DW test

A

DW=2(1-r) r=correlation between residuals

122
Q

When is DW=2

A

When r=0 When there is no serial correlation, r=0

123
Q

When is DW<2

A

When r is positive -> serial correlation

124
Q

When is DW>2

A

With negative serial correlation

125
Q

Explain the DW decision rule

A

Ho: No positive serial correlation

If DW statlower then reject Ho - conclude there is serial correlation

If DW stat> dupper then accept Ho - conclude there is no serial correlation

Inconclusive If dlower< DW stat < dupper

126
Q

Explain an autoregressive (AR) model.

A

An AR model regresses against prior periods of its own data series.

We drop notation of yt as the dependent variable and only use xt

A pth- order autoregression, AR(p), for xt is: xt=b0+b1xt−1+b2xt−2+…+bpxt−p

127
Q

Contrast random walk processes with covariance stationary processes.

A

Coefficient b1 = 1 (i.e., unit root) implies nonstationarity via mean reversion; therefore, first difference a random walk with drift before using an AR model:

  • yt* = xtxt−1, 
  • yt* = b0 + εt, b0 ≠ 0

Perform the same differencing operation for any b1 > 1.

128
Q

Calculate the predicted trend for a linear time series given the coefficients.

A

The independent variable in a linear trend changes at a constant rate with time:

  • yt = b0 + b1t + εt* 
  • where t = 1, 2, . . . , T*
129
Q

Why are moving averages generally calculated?

A

To focus on the underlying trend by eliminating “noise” from a time series.

130
Q

Describe objectives, steps, and examples of preparing and wrangling data.

Unstructured: Text cleansing

A

Remove html tags: Most text data from web pages have html markup tags.

Remove punctuations: Most punctuations are unnecessary, but some may be useful for ML training.

Remove numbers: If numbers are in the text, they should be removed or substituted with an annotation /number/.

Remove white spaces: White spaces should be identified and removed to keep the text intact and clean.

131
Q

Identify the two sources of uncertainty when the regression model is used to make a prediction regarding the value of the dependent variable.

A
  1. The uncertainty inherent in the error term, ε.
  2. The uncertainty in the estimated parameters, b0 and b1.
132
Q

Calculate and interpret a confidence interval for the predicted value of the dependent variable.

A

Y± tCrit x sf

sf=standard error of the forecast Y

133
Q

State and explain Step 2 in a data analysis project.

Step 2: Data collection

A

Data collection

Sourcing internal and external data;

Structuring the data in columns and rows for (Excel) tabular format.

134
Q

What is the OBJECTIVE of model training.

A

The objective of model training is to minimize forecasting errors:

135
Q

Explain Method Selection in model training

A

Method selection involves deciding which ML method(s) to use based on the classification task and type and size of data.

136
Q

Explain Performance Evaluation in model training

A

Performance evaluation uses complementary techniques to quantify and understand model performance.

137
Q

Explain Tuning in model training

A

Tuning seeks to improve model performance.

138
Q

Describe preparing, wrangling, and exploring text-based data for financial forecasting.

A

A corpus is any collection of raw text data, which can be organized into a table containing two columns.

The two columns are:

  1. (sentence) for text and
  2. (sentiment) is for the corresponding sentiment class.

The separator character (@) splits the data into text and sentiment class columns

139
Q

Describe the two ways to determine whether a time series is covariance stationary.

A
  1. Examine for statistically significant autocorrelation for any residual.
  2. Conduct the Dickey-Fuller test for unit root (preferred approach).
140
Q

Describe objectives, methods, and examples of data exploration.

Feature engineering

A

Numbers are converted into a token such as “/number/.”

N-grams are discriminative multi-word patterns with their connection kept intact. For example, a bigram such as “stock market” treats the two adjacent words as one.

Name entity recognition (NER) algorithm analyzes individual tokens and their surrounding semantics to tag an object class to the token.

Parts of speech (POS) uses language structure and dictionaries to tag every token with a corresponding part of speech. Some common POS tags are nouns, verbs, adjectives, and proper nouns.

141
Q

How is the out-of-sample forecasting performance of autoregressive models evaluated?

A

On the basis of their root mean square error (RMSE).

The RMSE for each model under consideration is calculated based on out-of-sample data.

The model with the lowest RMSE has the lowest forecast error and hence carries the most predictive power.

142
Q

Identify the two ways to correct for serial correlation in the regression residuals.

A
  • Hansen’s method - Adjusts standard errors for the coefficients.
    (a) The coefficients stay the same, but the standard errors change.
    (b) Robust standard errors for positive correlation are then larger.
  • Modify the regression equation to eliminate the serial correlation.
143
Q

What relationship does a linear regression postulate between the dependant and independant variables

A

It is not causal (but it may be) It is not acausal. It simply postulates a functional, i.e., an associative relationship between them.

144
Q

What does the coefficient of determination measure

A

The R-square of the regression, measures the amount of variance of the dependent variable explained by the independent variable.

145
Q

For a simple linear regression what is the correlation of X with Y given slope b1

A

= square root of R-squared. The sign, is not given by the R-squared. This is the sign of the slope coefficient.

146
Q

What is conditional heteroskedasticity

A

The variance of the residuals is related to the size of the independant variables

147
Q

What is multi collinearity

A

Two or more independent variables are correlated with each other

148
Q

Give 2 effects of multicollinearity

A
  1. Too many Type 2 errors Too often accepting Ho 2. Unreliable slope coefficients
149
Q

What is serial correlation

A

Correlation of one residual with the next

150
Q

Give two effects of positive serial correlation

A
  1. Too many Type 1 errors 2. Slope coefficients still reliable
151
Q

What does it mean to say a slope coefficient is unbiased?

A

When the expected value of the estimate is equal to the true value of the parameter

152
Q

What are the 2 effects of model misspecifications

A
  1. Biased coefficients 2. More Type 2 errors. Cannot rely on hypothesis tests
153
Q

Deacribe a linear trend model in 4 points

A
  1. Uses time as an independent variable 2. Yt=bo + b1(t) + error 3. Plagued by violations, like serial correlation (DW) 4. Appropriate where data points are equally distributed above and below the line with constant mean. E g a mean reverting percent change timeseries. 4.
154
Q

What is the least likely result of model misspecification

A

Unbiased coefficients

155
Q

What are the six types of model misspecification in multiple regression

A
  1. Omittimg a variable 2. Not transforming a variable 3. Incorrect data pooling 4. Lagged dependent variable as independent variable 5. Forecasting the past 6. Independent variable that cannot be directly observed and is represented by a proxy with large error
156
Q

Give two examples of models that have a qualitative dependent variable

A
  1. Probit and logit 2. Discriminant
157
Q

Does misspecification result from using a leading variable from a prior period?

A

No

158
Q

Describe using p value to reject Ho

A

P value < significance level

159
Q

What is Conditional Heteroskedasticity?

A

Variance of residual is related to the size of the independent variables

160
Q

How is conditional heteroskedasticity detected

A

Breusch-Pagan chi square test 1. Accept Ho (no heteroskedasticity) if BPcrit >= n × R2

161
Q

When os there conditional heteroskedasticity

A

When BP (n x R2) > Chi sq crit

162
Q

What is a consistent estimator

A

Accuracy of parameter estimate increases as sample size increases

163
Q

What is the effect of conditional heteroskedasticity

A
  1. Too many Type 1 errors (false positive) 2. Rejecting Ho b=0 when it is really true, accepting Ha b>0 when it is really false. 3. Because Standard Errors are underestimated
164
Q

Why is it harder to reject null under a two-tailed test than under a one tailed test

A

The rejection region at each side of the distribution is half of the size of the rejection region in a one tailed test

165
Q

What does covariance tell us

A

If x and Y move together directly or inversely (positive or negative association)

166
Q

What does it mean to say coviarance is symetric

A

The variation of x with y is the same as the variation of y with x Cov(x, y) =Cov(y, x)

167
Q

What is the covariance of X with itself

A

Cov(x, x) = var(x)

168
Q

What are the implications of cov(x, y) =0

A

r=0 b1=0

169
Q

Explain sample covariance

A
  1. Expected value of x and Y is the mean of x and mean of y the best guess) 2. For each sample the product of the errors of x and Y from the mean of x and y 3. Sum i=1 to n samples (Xi-Xmean) (Yi-Ymean) 4. Divide by n-1 Or Cov (X, Y) = r (sX x sY)
170
Q

Give two limitations of covariance How are these resolved

A
  1. No helpful magnitude of direction can range between negative to positive infinity 2. Only gives direction of relationship + or - 3. Must be standardised by standard deviations of x and y which gives correlation coefficient
171
Q

What does a linear regression minimise

A

The sum of the squared residuals Min (SSE)

172
Q

Draw the SST triangle to obtain SSE and RSS

A

Insert picture

173
Q

If SEE goes down does the regression improve or deteriorate

A

Improves

174
Q

Give 3 components that make a confidence interval more narrow

A
  1. Lower SEE - because SE forecast is dominated by SEE 2. Higher n - because n is in the denominator of a function of SEE 3. If Xforecast is closer to Xmean
175
Q

Is critical t higher or lower for a two tailed test than a one tailed test

A

Two tailed test

176
Q

Is it more or less difficult to reject null hypothesis with a two tailed test or a one tailed test

A

More difficult to reject with a two tailed test because critical values are higher (the rejection region is split on two sides)

177
Q

What are the two sources of error in a simple linear regression

A

Estimate of bo and b1

178
Q

Write a SLR in log-lin form

A

Ln Y = bo + b1. X

179
Q

Write a SLR in lin-log form

A

Y= bo + b1. ln(X)

180
Q

Write SLR in log-log form

A

Ln(Y)= bo + b1. Ln(X)

181
Q

Describe Ypred in terms of a simple linear regression

A

Ypred=-(b0 + b1.X1)

182
Q

Describe an alternative way of calculating SEE using unexplained variation

A

Unexplained variation = sum of squared residuals = SSE Sqrt[SSE/(n-2)]

183
Q

How is SEE & SSE related?

A
  1. Both are using residual. 2. SSE is sum of all squared residuals. 3. SEE is standard deviation of the residuals. SEE=Sqrt[SSE/(n-k-1)]
184
Q

Give two formula for variance of regression

A
  1. SSE/(n-k-1) 2. SEE squared.
185
Q

What does R squared (coefficient of determination) describe?

A

The fraction of the unexplained variation / total variation SSE / SST =SUM SQUARED ERRORS(Ypred-Yact))/SUM SQUARED (Yact-Ymean)

186
Q

What is the variance of Yact around Ymean

A

Total sum of squares (TSS) = sum square(Yact-Ymean)

187
Q

What is data wrangling What is the purpose

A

Data wrangling, sometimes referred to as data munging. The process of transforming and mapping data from one “raw” data form into another format. The purpose of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

188
Q

What is winsorization?

A

Cap and Floor to outliers

189
Q

What does standardisation of data require

A

Normal distribution

190
Q

What does standardisation do? How is it calculated?

A

Centers and rescales (X-mean)/stdev

191
Q

Describe normalisation

A

Rescales between 0 and 1

192
Q

What is a unigram

A

Single word token

193
Q

Does the numerator or denominator drive the sign of both the correlation and slope coefficients?

A

Because the denominators of both the slope and the correlation are positive, the sign of the slope and the correlation are driven by the numerator: If the covariance is positive, both the slope and the correlation are positive, and if the covariance is negative, both the slope and the correlation are negative.

194
Q

What is consistency of coefficients

A

The values each slope coefficient will converge to

195
Q

The test-statistic to test whether the correlation is equal to zero is

A

t=r√(n-2) /√(1−r2)