Quantitative Methods Flashcards

Question

Mean regression sum of squares (MSR) Mean squared error (MSE)

Answer 1

The mean regression sum of squares (MSR) and mean squared error (MSE) are simply calculated as the appropriate sum of squares divided by its degrees of freedom.

Answer 2

The R² is the percentage of the total variation in the dependent variable explained by the independent variable

Answer 3

The slope coefficients in a multiple regression. Each slope coefficient is the estimated change in the dependent variable for a one-unit change in that independent variable, holding the other independent variables constant.

Answer 4

* An F-test assesses how well a set of independent variables, as a group, explains the variation in the dependent variable. In multiple regression, the F-statistic is used to test whether at least one independent variable in a set of independent variables explains a significant portion of the variation of the dependent variable. * F = MSR / MSE * Always a one-tailed test

Answer 5

* Relationships change over time (Parameter instability * Public knowledge of relationships eliminate usefulness to traders * Assumption violations

Answer 6

The p-value is the smallest level of significance for which the null hypothesis can be rejected. * If the p-value is less than significance level, the null hypothesis can be rejected. * If the p-value is greater than the significance level, the null hypothesis cannot be rejected.

Answer 7

* Independent variables that fall into this category are called dummy variables and are often used to quantify the impact of qualitative events. * Dummy variables are assigned a value of “0" or “1.” * Whenever we want to distinguish between n classes, we must use n – 1 dummy variables. Otherwise, the regression assumption of no exact linear relationship between independent variables would be violated (multicollinearity).

Answer 8

Heteroskedasticity occurs when the variance of the residuals is not the same across all observations in the sample. This happens when there are subsamples that are more spread out than the rest of the sample. **Unconditional heteroskedasticity** occurs when the heteroskedasticity is not related to the level of the independent variables. it usually causes no major problems with the regression. **Conditional heteroskedasticity** is heteroskedasticity that is related to the level of (i.e., conditional on) the independent variables. It creates significant problems for statistical inference.

Answer 9

* The standard errors are usually unreliable estimates. * The coefficient estimates (the ˆbjb^j ) aren’t affected. * If the standard errors are too small, but the coefficient estimates themselves are not affected, the t-statistics will be too large and the null hypothesis of no statistical significance is rejected too often. The opposite will be true if the standard errors are too large. * The F-test is also unreliable.

Answer 10

1. Examining scatter plots of the residuals 2. Using the Breusch-Pagan chi-square (χ²) test. The more common way to detect conditional heteroskedasticity is the Breusch-Pagan test, which calls for the regression of the squared residuals on the independent variables. If conditional heteroskedasticity is present, the independent variables will significantly contribute to the explanation of the squared residuals

Answer 11

BP chi-square test = n × (R_resid )²with k degrees of freedom

Answer 12

* Robust standard errors (also called White-corrected standard errors or heteroskedasticity-consistent standard errors). These robust standard errors are then used to recalculate the t-statistics using the original regression coefficients. * A second method to correct for heteroskedasticity is the use of generalized least squares, which attempts to eliminate the heteroskedasticity by modifying the original equation.

Answer 13

Serial correlation, also known as autocorrelation, refers to the situation in which the residual terms are correlated with one another. * Positive serial correlation exists when a positive regression error in one time period increases the probability of observing a positive regression error for the next time period. * Negative serial correlation occurs when a positive error in one period increases the probability of observing a negative error in the next period.

Answer 14

1. Heteroskedasticity 2. Serial correlation (autocorrelation) 3. Multicollinearity

Answer 15

* Residual plots: a scatter plot of residuals versus time * Durbin-Watson Statistic (DW)

Answer 16

If the sample size is very large: DW = 2 ( 1 - r ) where: r = correlation coefficient between residuals from one period and those from the previous period * DW = 2, the error terms are homoskedastic and not serially correlated (r = 0) * DW \< 2, the error terms are positively serially correlated (r \> 0) * DW \> 2, the error terms are negatively serially correlated (r \< 0) * If DW \< dl, the error terms are positively serially correlated (i.e., reject the null hypothesis of no positive serial correlation). * If dl \< DW \< du, the test is inconclusive. * If DW \> du, there is no evidence that the error terms are positively correlated. (i.e., fail to reject the null of no positive serial correlation).

Answer 17

* **Adjust the coefficient standard errors**, using the **Hansen method**, which also corrects for conditional heteroskedasticity. Also called serial correlation consistent standard errors or Hansen-White standard errors. * Only use the Hansen method if serial correlation is a problem. The White-corrected standard errors are preferred if only heteroskedasticity is a problem. If both conditions are present, use the Hansen method. * **Improve the specification of the model.** The best way to do this is to explicitly incorporate the time-series nature of the data

Answer 18

* Multicollinearity refers to the condition when two or more of the independent variables, or linear combinations of the independent variables, in a multiple regression are highly correlated with each other. * Even though multicollinearity does not affect the consistency of slope coefficients, such coefficients themselves tend to be unreliable. Additionally, the standard errors of the slope coefficients are artificially inflated. Hence, there is a greater probability that we will incorrectly conclude that a variable is not statistically significant (i.e., a Type II error). * The most common way to detect multicollinearity is the situation where t-tests indicate that none of the individual coefficients is significantly different than zero, while the F-test is statistically significant and the R² is high. * The most common method to correct for multicollinearity is to omit one or more of the correlated independent variables.

Answer 19

1. The functional form can be misspecified. * Important variables are omitted. * Variables should be transformed. * Data is improperly pooled. 2. Explanatory variables are correlated with the error term in time series models. * A lagged dependent variable is used as an independent variable. * A function of the dependent variable is used as an independent variable (“forecasting the past”). * Independent variables are measured with error. 3. Other time-series misspecifications that result in nonstationarity.

Answer 20

Model misspecification leads to biased and inconsistent regression coefficients, which further leads to unreliable hypothesis testing and inaccurate predictions

Answer 21

A dummy variable that takes on a value of either zero or one. An ordinary regression model is not appropriate for situations that require a qualitative dependent variable. There are several different types of models that use a qualitative dependent variable. * **Probit** model is based on the normal distribution, while a logit model is based on the logistic distribution. * **Discriminant** models are similar to probit and logit models but make different assumptions regarding the independent variables. Discriminant analysis results in a linear function similar to an ordinary regression, which generates an overall score, or ranking, for an observation. The scores can then be used to rank or classify observations.

Answer 22

A special case of generalized linear model (GLM) is penalized regression. Penalized regression models seek to minimize forecasting errors by reducing the problem of overfitting. They seek to minimize the sum of square errors (same as in multiple regression models) as well as a penalty value. This penalty value increases with the number of independent variables (features) used by the model.

Answer 23

Classification trees are appropriate when the target variable is categorical while regression trees are appropriate when the target is continuous. More typically, classification trees are used when the target is binary Classification trees assign observations to one of two possible classifications at each node. At the top of the tree, the top feature (the one most important in explaining the target) is selected and a cutoff value “c” is estimated. The tree stops when the error cannot be reduced further resulting in a terminal node A random forest is a collection of randomly generated classification trees from the same data set. Because each tree only uses a subset of features, random forests can mitigate the problem of overfitting. Using random forests can increase the signal-to-noise ratio because errors across different trees tend to cancel each other out.

Answer 24

Specify the algorithm. Specify the hyperparameters (before the processing begins). Divide data into training and validation samples. In the case of cross validation, the training and validation samples are randomly generated every learning cycle. Evaluate the training using a performance parameter, P, in the validation sample. Repeat the training until adequate level of performance is achieved. In choosing the number of times to repeat, the researcher must use caution to avoid overfitting the model.

Answer 25

A time series is a set of observations for a variable over successive periods of time. The series has a trend if a consistent pattern can be seen by plotting the data on a graph.

Answer 26

A linear trend is a time series pattern that can be graphed using a straight line. A downward sloping line indicates a negative trend, while an upward-sloping line indicates a positive trend. y_t = b₀+ b1(t) + ε_t

Answer 27

Time series data, particularly financial time series, often display exponential growth (growth with continuous compounding). When a series exhibits exponential growth, it can be modeled as: y_t=e^b₀+b₁(t) We take the natural log of both sides of the equation and arrive at the log-linear model. ln(y_t) = ln(e^b₀+b₁(t)) = b₀+b₁(t)

Answer 28

When the dependent variable is regressed against one or more lagged values of itself, the resultant model is called as an autoregressive model (AR). x_t = b₀ + b₁x_t–1 + ε_t

Answer 29

Autoregressive model is only valid when the time series being modeled is covariance stationary: * Constant and finite expected value * Constant and finite variance * Constant and finite covariance between values at any give lag.

Answer 30

It is necessary to calculate a one-step-ahead forecast before a two-step-ahead forecast can be calculated.

Answer 31

* Cannot use Durbin Watson to test for serial correlation in AR models * Use a t-test on residual autocorrelations * If serial correlation exists, the model is incomplete * Solution: Increase order of model by adding more lagged variables

Answer 32

the mean-reverting level is expressed as x_t= b_{0 /}(1−b₁). * if x_{t >} b_{0 /}(1−b₁) , the AR(1) model predicts that x_t_{+ 1} will be lower than x_t, and * if x_{t <} b_{0 /}(1−b₁) , the model predicts that x_t_{+ 1} will be higher than x_t.

Answer 33

* **In-sample:** Data used to develop model * **Out-of-sample**: Any data outside above range * Forecasting accuracy is measured by square root of the mean squared error (RMSE). (=SEE) * Use the model with the **lowest RMSE** based on **out-of-sample** forecasting errors

Answer 34

* Defining characteristic: b₁= 1 (Unit Roots) * without a drift: b₀= 0 * with a drift: b₀ ≠ 0 * Random Walks is not covariance stationary * If a time series is a random walk, the best forecast of x_t that can be made in period t − 1 is x_t−1.

Answer 35

* (1) x_t= b₀+ b₁x_t−1+ ε * (2) x_t - x_t−1 = b₀+ b₁x_t−1 - x_t−1 + ε * x_t - x_t−1 = b₀+ (b₁- 1)x_t−1+ ε * H₀: b₁- 1 = 0 H_a: b₁- 1 \< 0

Answer 36

If we believe a time series is a random walk, we can transform the data to a covariance stationary time series using a procedure called first differencing. y_t = x_t − x_t–1 ⇒ y_t = ε_t Then, stating y in the form of an AR(1) model: y_t = b₀ + b₁y_t-1 + ε₁ where: b₀ = b₁ = 0 This transformed time series has a finite mean-reverting level of 0 / (1−0) = 0 and is, therefore, covariance stationary.

Answer 37

* When examining a single time series, such as an AR model, **autoregressive conditional heteroskedasticity (ARCH)** exists if the variance of the residuals in one period is dependent on the variance of the residuals in a previous period. * ε_t²= a₀+ a₁ε_(t−1)²+ μ_t, * If a₁is statistically significant, the time series is ARCH * ARCH can be used to predict the variance of the residuals in future periods (volatility) * Variance_t+1²= a₀+ a₁ε_t²

Answer 38

1. Both time series are covariance stationary. **(Reliable)** 2. Only the dependent variable time series is covariance stationary. **(Not Reliable)** 3. Only the independent variable time series is covariance stationary. **(Not Reliable)** 4. Neither time series is covariance stationary and the two series are not cointegrated. **(Not Reliable)** 5. Neither time series is covariance stationary and the two series are cointegrated. **(Reliable)**

Answer 39

Cointegration means that two time series are economically linked (related to the same macro variables) or follow the same trend and that relationship is not expected to change. If two time series are cointegrated, the error term from regressing one on the other is covariance stationary and the t-tests are reliable.

Answer 40

1. If there is no seasonality or structural shift, use a trend model. * If the data plot on a straight line with an upward or downward slope, use a linear trend model. * If the data plot in a curve, use a log-linear trend model. 2. Run the trend analysis, compute the residuals, and test for serial correlation using the Durbin Watson test. * If you detect no serial correlation, you can use the model. * If you detect serial correlation, you must use another model (e.g., AR). 1. If the data has serial correlation, reexamine the data for stationarity before running an AR model. If it is not stationary, treat the data for use in an AR model as follows: * If the data has a linear trend, first-difference the data. * If the data has an exponential trend, first-difference the natural log of the data. * If there is a structural shift in the data, run two separate models as discussed above. * If the data has a seasonal component, incorporate the seasonality in the AR model as discussed in the following. 2. After first-differencing in 5 previously, if the series is covariance stationary, run an AR(1) model and test for serial correlation and seasonality. * If there is no remaining serial correlation, you can use the model. * If you still detect serial correlation, incorporate lagged values of the variable (possibly including one for seasonality—e.g., for monthly data, add the 12th lag of the time series) into the AR model until you have removed (i.e., modeled) any serial correlation. 3. Test for ARCH. Regress the square of the residuals on squares of lagged values of the residuals and test whether the resulting coefficient is significantly different from zero. * If the coefficient is not significantly different from zero, you can use the model. * If the coefficient is significantly different from zero, ARCH is present. Correct using generalized least squares. 4. If you have developed two statistically reliable models and want to determine which is better at forecasting, calculate their out-of-sample RMSE.

Answer 41

1. Determine the probabilistic variables 2. Define probability distributions for these variables (3 approaches to specify a distribution): * Historical data: Examination of past data may point to a distribution that is suitable for the probabilistic variable. This method assumes that the future values of the variable will be similar to its past. * Cross-sectional data: When past data is unavailable (or unreliable), we may estimate the distribution of the variable based on the values of the variable for peers. * Pick a distribution and estimate the parameters. 3. Check for correlations among variables. When there is a strong correlation between variables, we can either: * Allow only one of the variables to vary * Build the rules of correlation into the simulation. 4. Run the simulations

Answer 42

* The number of uncertain variables. The higher the number of probabilistic inputs, the greater the number of simulations needed. * The types of distributions. The greater the variability in types of distributions, the greater the number of simulations needed. * The range of outcomes. The wider the range of outcomes of the uncertain variables, the higher the number of simulations needed.

Answer 43

2 advantages: * Better input quality. * Provides a distribution of expected value rather than a point estimate. 3 constraints: * Book value constraints * Regulatory capital requirements * Negative equity * Earnings and cash flow constraints (internal & external) * Market value constraints

Answer 44

* Input quality * Inappropriate statistical distributions * Non-stationary distributions * Dynamic correlations

Quantitative Methods Flashcards

(70 cards)