Quantitative Methods Flashcards

Question

What is the sixth of six classic normal linear regression assumptions, concerning the distribution of residual values

Answer 1

The distribution of the residuals is a normal distribution (with mean zero?)

Answer 2

It is the change in Y due to a 1 unit change in X b1=cov(X,Y)/var(X)

Answer 3

Y=b0 + b1.X b0 = Y\_mean - b1.X\_mean b1 is the slope = Cov(X,Y)/Var(X)

Answer 4

1. "E" is error of the estimate = residual SEE is a function of SSE 2. Sum of squares vs standard deviation SSE uses the sum of the squared residuals SEE uses the standard deviation of the residuals = sqrt[(SSE)/(n-2)].

Answer 5

Fit of the linear regression: 1. Standard deviation of the residuals (the standardized error) 2. Standard Error of the regression

Answer 6

For a good fit, strong relationship between the Y and X variables The standars deviation of the residuals will be low

Answer 7

Low fit, weak relationship between variables X and Y This means standard deviation of residuals will be high

Answer 8

R squared (Variation of X)/(Variation of Y)

Answer 9

Covariance (X,Y) = Sum (X- Xmean)(Y- Ymean)/(n-1)

Answer 10

``` Sample Variance (X) =[Sum (X- Xmean)squared /(n-1)] ```

Answer 11

1. Mean of X and Y is finite and constant 2. Variance of X and Y is finite and constant 3. Covariance (X,Y) must be finite and constant

Answer 12

Standard deviation of residuals sqrt [SSE/(n-2)]

Answer 13

Coefficient of determination. It is the explained variation by percentage of total variation of the Dependent variable. It is % of total variation that is explained by the independent variables R2 Squared = 65% means (variation X) /(variation of Y) = 0.65

Answer 14

R squared= r (correlation x,y) squared

Answer 15

Whether the coefficient is statistically significant or not. The test is based upon the coefficient not being zero, being "statistically different from zero" If coefficient is zero that variable should not be in the regression because it is unrelated to Y.

Answer 16

bi +/- (t\_crit × SE\_bi) t\_crit is obtained from student t where Two tailed significance = 0.05 df = 35-2 = 33 Zero must fall within the range to confirm the null hypothesis otherwise bi is statistically not zero

Answer 17

Compare estimated b1 with hypothetical b1=0 Null hypothesis is b1=0 Test is t is outside range= - t\_critical to + t\_critical t\_b1 \< - t\_crit t\_b1 \> + t\_crit t\_b1= (b1-0)/(SE b1) t\_crit = df=36-2 Sig= 0.05

Answer 18

For both the degree of freedom is adjusted for the number of parameters = number of coefficients plus intercept. df = (n-2)

Answer 19

Hnull: b0=0 Ha: b0\<\>0

Answer 20

R\_squared = (explained variation) / (total variation) = RSS/SST R\_squared = (Total variation - Unexplained variation) / (total variation) =(SST-SSE)/SST R\_squared =1-(unexplained/total) = 1-(SSE/SST)

Answer 21

The SSE is the sum of all the Unexplained Variation Sum of all the squared residuals (Y actual - Y predicted)

Answer 22

This is the sum of all squared differences between actual Y and mean( of all Y) = (Y actual - Y mean) SST = explained (RSS) + unexplained (SSE)

Answer 23

This is the sum of the squared differences of predicted Y from mean of Y Sum (Y\_predicted -Y\_mean) RSS = Regression explained variation

Answer 24

It does not. This is a trick question

Answer 25

CI pred Y = pred Y +/- (Sf x t\_crit) Two tailed because its either side of pred Y Sf is Standard Error of the Forecast Pred Y

Answer 26

1. n observations 2. SEE (standard deviation of residuals) 3. Variance and mean of X 4. Xi for Predicted Y

Answer 27

(Sf) squared = SEE squared x [1+ 1/n + (Xi-X mean) squared /((n-1)× variance(X))]

Answer 28

1. Explained = variation of Y pred around mean Y 2. Unexplained = variation of actual Y around Y pred Input formula 1 + 2 from above

Answer 29

Regression Sum of Squares The variation explained by the regression model (Y pred-Ymean) squared

Answer 30

This sum of the squared residuals The part of the regression model that cannot explain the part of total variation Yi from Y mean (this is the part not explained by RSS) (Yactual - Ypred) squared SSE=(MSE)x (n-k-1)

Answer 31

It is the total variation of Y actual from Y mean (Yactual - Ymean) squared SST= RSS + SSE

Answer 32

SEE indicates certainty about predictions using the regression equation It is the standard deviation of the SSE, the "sum of the squared residuals"

Answer 33

R2 indicates confidence about estimates using the regression It is the ratio of the variation "explained" by the model over the "total variation" of the observations against their mean (the variation due to the distribution of all the observations)

Answer 34

It is a range values either side of the estimated coefficient, b1 C.I. = b1pred +/- (t\_crit x standard error of b1 pred)

Answer 35

The effectiveness of the group of k independent variables

Answer 36

MSE= The sample mean of Squared residuals The adjusted sample size = n-k-1 SEE = Standard deviation of all the sampled residuals Standard deviation of residuals = sqrt (MSE) SEE = sqrt (MSE) S

Answer 37

Good explanation power

Answer 38

F stat is the square of the t-stat and the rejection of F critical where F \> Fcrit implies the same as the t-test, t\> tcrit

Answer 39

1. Parameter instability 2. Standard 6 assumptions do not hold, particularly presence of heteroskedasticity and autocorrelation. Both concerned with reliability of the residuals. 3. Public knowkedge limitation: widespread understanding causes participants to act in ways that distorts relationships of independent and dependent variables and future use of the regression is compromised. Note multicollinearity is not for simple linear regression because it concerns correlation of variables or functions of variables in a multiple regression.

Answer 40

Rsquared = explained/total variation F = Explained/Unexplained variation

Answer 41

If F test \> F crit reject Null Hypothesis. If F test \> F crit. At least one slope coefficient is non zero Null is that all slope coefficients = zero Alternative, at least 1 slope coefficient is not zero

Answer 42

R squared adjusted = 1 - (df TSS / df SSE)(1-R squared) As k increases df SSE decreases As k increases df TSS does not change As k increases (df TSS / df SSE) increases As k increases adj R squared decreases

Answer 43

1. Adj R squared always \<= R squared 2. R squared is always greater than adj R squared when k\>0 3. As k increases, adjusted R-squared increases but then begins to decrease 4. Where k=3 adjusted R squared is often max

Answer 44

1. The **omitted dummy variable** is the reference class (remember Q4 not included in the regression equation example) so its implicit in the b0 which is always in the output. 2. The hypothesis test applied to included dummy variables is whether or not they are statistically different to the reference class (in this case Q4) 3. The slope coefficient for each included Dummy gives an output from the regression that represents a function of the included Dummy and the omitted dummy 4. So for Ho: b1=0 this means bo=bo+[b1-bo), therefore the Ho tests if b1=bo 5. Ha: b1 \<\>0 this means b1\<\>bo If we accept Ho (t-test\<= t\_crit) this means b1=bo, e.g. Q1 equals Q4 (omitted dummy)

Answer 45

Positive Negative

Answer 46

Slope coefficients unreliable Standard Error of slope coefficients b\_se, is higher than it should be t-test is lower than it should be (b / b\_se) less likely to reject null hypothesis that (b=0) since t-test \> t\_crit increase in Type II error

Answer 47

If the individual statistical significance of each slope coefficient is low but the F test and R squared indicated high significance then this is classic multicollinearity

Answer 48

Stepwise regression elimination of variables to minimise multicollinearity

Answer 49

Heteroskedasticity is the opposite of homoskedasticity (the level of variance of the residual is constant across all values of the independent variables) **Unconditional Heteroskedasticity** is where the variance of the residuals has no pattern or relation to the level of the independent variable. It _does not cause a significant problem_ for the regression

Answer 50

The variance of the residual is related to the level (value) of the X variables. **Heteroskedasticity** is the opposite of **homoskedasticity** (the level of variance of the residual is constant across all values of the independent variables) Instead, with **Conditional Heteroskedasticity**, the variance of the residuals is linked to the values of the independent variables and is NOT constant. For example variance of residuals could increase as the value of independent variables increase. It does cause **_significant problems_** for a regression

Answer 51

When bo = 0 And b1=1 So for AR(1) Xt = bo + b1 Xt-1 + disturbance AR(1) becomes Xt = Xt-1 + disturbance

Answer 52

The mean-reverting level is = bo/(1-b1) for a random walk, b0=0, b1=1 and the mean reverting level = b0/(1-b1) = 0/0 = "undefined".

Answer 53

1. There must be a **linear relationship** between the dependent variable and the independent variables 2. Independent variables are **not random** 3. There is *no linear relationship between the independent variables* (e.g. one is not merely a function of the other) 4. The _expected value of the error_ term is **zero** 5. The *variance of the error* term is **constant** 6. The error terms are **not correlated** with each other 7. The error terms are **normally distributed**

Answer 54

When some samples are spread out more than others the variance of the residual changes.

Answer 55

The mean reverting level is undefined, unbound 0/0 So mean and variance grows with time. Violating both sinple and multiple regression assumptions. Mean reverting level, Xt=b0/(1-beta1) And Beta1=1, this means the time series has a unit root

Answer 56

t = (estimated value - hypothetical value (or actual value)) / (standard error of the estimated variable) _standard error_ is the risk of the estimate being different to the actual value which is equal to the standard deviation of the error between the estimate and the actual value. _This is why things that affect the residual also affect the validity of the coefficient estimates._

Answer 57

Its the lag coefficient

Answer 58

White corrected errors

Answer 59

Hanson-White

Answer 60

Conditional Heteroskedasticity

Answer 61

The slope coefficients themselves

Answer 62

To prove that some factor is significant. Formulate as Beta x Factor, where Beta = zero, the factor is zero in the equation. So a null hypothesis that Beta=0 means the factor is insiginificant Rejecting null hypothesis where t\>tcrit means Beta is non zero, so factor is significant

Answer 63

Standard errors of coefficients is inflated t test is therefore lower than it really is. t-test statistic is smaller and so **less likely to be greater than t crit** **Less likely to reject a null hypothesis** (that coefficient is not different to zero ho: b=0 and so b is not significant) So _more likely to conclude a variable is not significant_ when in fact **it is significant** This is a **type II** error

Answer 64

Accepting a variable as insignificant wgen in fact it is significant. eg t test is lower due to artificially high SE

Answer 65

F test and R squared indicate high explanation power but individual coefficients do not. This happens when coefficients are correlated, washing out individual effects but together explaining the model well

Answer 66

where the expected value for the parameter is equal to the actual value of the parameter

Answer 67

A consistent estimator is where the accuracy of the estimate increases as the sample size (n) increases

Answer 68

Simple linear regression has only one independent variable but the problems are: 1. Heteroskedasticity 2. Serial Correlations Multiple linear regression adds the problem of correlation between multiple independent variables: 3. Multi-collinearity

Answer 69

1. Standard error of the estimates is unreliable. If SE is lower than it should be 2. This means t-test is higher than it should be 3. This means more likely to reject the hypothesis that beta is not significant 4. This means more likely to consider a coefficient significant when in fact it is not significant 5. This is a Type I error. (false positive - incorrectly rejects a true null hypothesis and concludes the beta is significant and not zero) If SE is higher, since t-test will be lower, so more likely not to reject a false null hypothesis and to accept that beta is not significant. 6. This is a Type II error (false negative - incorrectly accepts a false null hypothesis and concludes that beta is not significant, beta=0)

Answer 70

SEE is the variation of the predicted Y around the regression line. Appeoximately equal to standard deviation of residuals SEE = sqrt (MSE) SEE = sqrt (SSE/dof) dof= n-k-1

Answer 71

Means one of the independent variables has statistically significant explanatory power F \> F crit

Answer 72

3 problems Heteroskedasticity Multicollinearity Serial Correlation

Answer 73

1. Conditional Heteroskedasticity 2. Multi collinearity 3. Serial correlation

Answer 74

Conditional is worse because linked to level(values) of X variables F test is unreliable. SE around individual coefficients are unreliable (too large or too small) t stats will be either too large (se too small) causing false rejection of Ho (Type 1) that there is a statistical difference from zero when in truth there is not Or t is too small (se too large), causing false acceptance of Ho no significance (false positive, Type 2)

Answer 75

Unconditional Conditional

Answer 76

Both Type 1 (reject Ho:b=0, when in reality it is true) and Type 2 (accept Ho: b=0, when in reality it is false)

Answer 77

False positive False accepting difference False rejecting similarity Wrongly assume **(t-t_crit)\>0**, positive, when it is _not positive_. Wrongly assume **t \> t_crit** Wrongly reject H₀ when it is really true and should be accepted Wrongly assume there is a significant difference when really there is _no significant difference_

Answer 78

False-negative False-accepting similarity False- rejecting "no difference" Wrongly assume **(t-t_crit)\<0**, negative Wrongly assume **t \< t_crit** Wrongly assume there is no difference (false negative) when really there is a difference Wrongly accept H₀ =X is true when it is really false Wrongly accepting the “negative hypothesis” when it should be rejected because there is a difference (a positive).

Answer 79

The null hypothesis “Not different” to the stated value

Answer 80

Type I error False rejection Increased probability of rejecting H₀ when it should be accepted Wrongly accepting the positive and wrongly rejecting the negative. Incorrectly assume there is a difference and the null hypothesis (negative) is wrong when in fact there is no difference and the null hypothesis is true. Wrongly assume “positive” that **(t\>t_crit)** when it is **really false** and **(tcrit )** “negative” is actually true **Wrongly accepting** **(t-t_crit)\>0 “positive” as true** when it is _really false and negative_ In reality **(t-t_crit)≤0 “negative” is true** so above is a _“false positive”_ This leads to (t\>t_crit) rejecting the H_o: b=0 when it should instead be accepted.

Answer 81

Type II error False acceptance Increased probability of accepting H₀ when it should be rejected Accepting **(t_crit)** as true when it is **really false** and **(t\>t_crit )** is actually true **Wrongly accepting** **(t-t_crit)\<0 “negative” as true** when it is _really false_ In reality **(t-t_crit)≥0 is true** so above is a “false negative” This leads to accepting the H_o: b=0 (t_crit) when it is false, a false negative.

Answer 82

Increased probability of Type 1 errors

Answer 83

Increased probability of Type II errors Increased probability of false-negative **(tcrit)** Increased probability of accepting H₀

Answer 84

Heteroskedasticity

Answer 85

From a regressio of the squared residuals from the first regression

Answer 86

1. Use Chi-square BP crit 2. One degree of freedom 3. 5% one-tailed test BP test = n x Rsquared (regresssion on squared residuals)

Answer 87

BP test \> chi-square This rejects Ho no skedasticity and concludea there is conditional heteroskedasticity.

Answer 88

Test the regression coefficients using: t\_stat= coefficient / White-corrected SE. t crit from t tables with n-k-1 dof

Answer 89

1. Estimates SE are smaller than actual 2. t stat is therefor larger than reality 3. Type 1 errors are more common (false positive). 4. False positive is where Ho is rejected too often.

Answer 90

1. Residual plots 2. Durbin-Watson

Answer 91

DW=2(1-r) r=correlation between residuals

Answer 92

When r=0 When there is no serial correlation, r=0

Answer 93

When r is positive -\> serial correlation

Answer 94

With negative serial correlation

Answer 95

Ho: No positive serial correlation If **DW _statlower** then _reject Ho_ - conclude there _is serial correlation_ If **DW _stat\> d_upper** then _accept Ho_ - conclude there is _no serial correlation_ Inconclusive If **d_lower\< DW _stat\< d_upper**

Answer 96

An AR model regresses against prior periods of its own data series. We drop notation of *y*_t as the dependent variable and only use *x*_t A *p*th- order autoregression, **AR(*****p*****), for** ***x_t*** **is: xt=b₀+b₁x_t−1+b₂x_t−2+...+b_px_t−p**

Answer 97

Coefficient *b*₁ = 1 (i.e., unit root) implies nonstationarity via mean reversion; therefore, first difference a random walk with drift before using an AR model: * y_t* = *x_t* − *x*_t−1, * y_t* = *b*₀ + ε*_t*, *b*₀ ≠ 0 Perform the same differencing operation for any *b*₁ \> 1.

Answer 98

The independent variable in a linear trend changes at a constant rate with time: * y_t* = *b*₀ + *b*₁*t* + ε*_t* * where t* = 1, 2, . . . , *T*

Answer 99

To focus on the underlying trend by eliminating “noise” from a time series.

Answer 100

**Remove html tags**: Most text data from web pages have html markup tags. **Remove punctuations**: Most punctuations are unnecessary, but some may be useful for ML training. **Remove numbers**: If numbers are in the text, they should be removed or substituted with an annotation /number/. **Remove white spaces**: White spaces should be identified and removed to keep the text intact and clean.

Answer 101

1. The uncertainty inherent in the **error term, ε.** 2. The uncertainty in the **estimated parameters**, *b*₀ and *b*₁.

Answer 102

^∧Y± t_Crit x s_f s_f=standard error of the forecast Y

Answer 103

**Data collection** Sourcing internal and external data; Structuring the data in columns and rows for (Excel) tabular format.

Answer 104

The objective of model training is to minimize forecasting errors:

Answer 105

**Method selection** involves deciding which ML method(s) to use based on the classification task and type and size of data.

Answer 106

**Performance evaluation** uses complementary techniques to quantify and understand model performance.

Answer 107

**Tuning** seeks to improve model performance.

Answer 108

A **corpus** is any collection of raw text data, which can be organized into a table containing two columns. The two columns are: 1. (sentence) for text and 2. (sentiment) is for the corresponding sentiment class. The separator character (@) splits the data into **text** and **sentiment** class columns

Answer 109

1. Examine for statistically significant autocorrelation for any residual. 2. Conduct the Dickey-Fuller test for unit root (preferred approach).

Answer 110

**Numbers** are converted into a token such as “/number/.” **N-grams** are discriminative multi-word patterns with their connection kept intact. For example, a bigram such as “stock market” treats the two adjacent words as one. **Name entity recognition (NER)** algorithm analyzes individual tokens and their surrounding semantics to tag an object class to the token. **Parts of speech (POS)** uses language structure and dictionaries to tag every token with a corresponding part of speech. Some common POS tags are nouns, verbs, adjectives, and proper nouns.

Answer 111

On the basis of their root mean square error (RMSE). The RMSE for each model under consideration is calculated based on out-of-sample data. The model with the lowest RMSE has the lowest forecast error and hence carries the most predictive power.

Answer 112

* Hansen's method - Adjusts standard errors for the coefficients. (a) The coefficients stay the same, but the standard errors change. (b) Robust standard errors for positive correlation are then larger. * Modify the regression equation to eliminate the serial correlation.

Answer 113

It is not causal (but it may be) It is not acausal. It simply postulates a functional, i.e., an associative relationship between them.

Answer 114

The R-square of the regression, measures the amount of variance of the dependent variable explained by the independent variable.

Answer 115

= square root of R-squared. The sign, is not given by the R-squared. This is the sign of the slope coefficient.

Answer 116

The variance of the residuals is related to the size of the independant variables

Answer 117

Two or more independent variables are correlated with each other

Answer 118

1. Too many Type 2 errors Too often accepting Ho 2. Unreliable slope coefficients

Answer 119

Correlation of one residual with the next

Answer 120

1. Too many Type 1 errors 2. Slope coefficients still reliable

Answer 121

When the expected value of the estimate is equal to the true value of the parameter

Answer 122

1. Biased coefficients 2. More Type 2 errors. Cannot rely on hypothesis tests

Answer 123

1. Uses time as an independent variable 2. Yt=bo + b1(t) + error 3. Plagued by violations, like serial correlation (DW) 4. Appropriate where data points are equally distributed above and below the line with constant mean. E g a mean reverting percent change timeseries. 4.

Answer 124

Unbiased coefficients

Answer 125

1. Omittimg a variable 2. Not transforming a variable 3. Incorrect data pooling 4. Lagged dependent variable as independent variable 5. Forecasting the past 6. Independent variable that cannot be directly observed and is represented by a proxy with large error

Answer 126

1. Probit and logit 2. Discriminant

Answer 127

P value \< significance level

Answer 128

Variance of residual is related to the size of the independent variables

Answer 129

Breusch-Pagan chi square test 1. Accept Ho (no heteroskedasticity) if BPcrit \>= n × R2

Answer 130

When BP (n x R2) \> Chi sq crit

Answer 131

Accuracy of parameter estimate increases as sample size increases

Answer 132

1. Too many Type 1 errors (false positive) 2. Rejecting Ho b=0 when it is really true, accepting Ha b\>0 when it is really false. 3. Because Standard Errors are underestimated

Answer 133

The rejection region at each side of the distribution is half of the size of the rejection region in a one tailed test

Answer 134

If x and Y move together directly or inversely (positive or negative association)

Answer 135

The variation of x with y is the same as the variation of y with x Cov(x, y) =Cov(y, x)

Answer 136

Cov(x, x) = var(x)

Answer 137

1. Expected value of x and Y is the mean of x and mean of y the best guess) 2. For each sample the product of the errors of x and Y from the mean of x and y 3. Sum i=1 to n samples (Xi-Xmean) (Yi-Ymean) 4. Divide by n-1 Or Cov (X, Y) = r (sX x sY)

Answer 138

1. No helpful magnitude of direction can range between negative to positive infinity 2. Only gives direction of relationship + or - 3. Must be standardised by standard deviations of x and y which gives correlation coefficient

Answer 139

The sum of the squared residuals Min (SSE)

Answer 140

Insert picture

Answer 141

1. Lower SEE - because SE forecast is dominated by SEE 2. Higher n - because n is in the denominator of a function of SEE 3. If Xforecast is closer to Xmean

Answer 142

Two tailed test

Answer 143

More difficult to reject with a two tailed test because critical values are higher (the rejection region is split on two sides)

Answer 144

Estimate of bo and b1

Answer 145

Ln Y = bo + b1. X

Answer 146

Y= bo + b1. ln(X)

Answer 147

Ln(Y)= bo + b1. Ln(X)

Answer 148

Ypred=-(b0 + b1.X1)

Answer 149

Unexplained variation = sum of squared residuals = SSE Sqrt[SSE/(n-2)]

Answer 150

1. Both are using residual. 2. SSE is sum of all squared residuals. 3. SEE is standard deviation of the residuals. SEE=Sqrt[SSE/(n-k-1)]

Answer 151

1. SSE/(n-k-1) 2. SEE squared.

Answer 152

The fraction of the unexplained variation / total variation SSE / SST =SUM SQUARED ERRORS(Ypred-Yact))/SUM SQUARED (Yact-Ymean)

Answer 153

Total sum of squares (TSS) = sum square(Yact-Ymean)

Answer 154

Data wrangling, sometimes referred to as data munging. The process of transforming and mapping data from one "raw" data form into another format. The purpose of making it more appropriate and valuable for a variety of downstream purposes such as analytics.

Answer 155

Cap and Floor to outliers

Answer 156

Normal distribution

Answer 157

Centers and rescales (X-mean)/stdev

Answer 158

Rescales between 0 and 1

Answer 159

Single word token

Answer 160

Because the denominators of both the slope and the correlation are positive, the sign of the slope and the correlation are driven by the numerator: If the covariance is positive, both the slope and the correlation are positive, and if the covariance is negative, both the slope and the correlation are negative.

Answer 161

The values each slope coefficient will converge to

Answer 162

t=r√(n-2) /√(1−r2)

Quantitative Methods Flashcards

(195 cards)