Quantitative Methods Flashcards

You may prefer our related Brainscape-certified flashcards:
1
Q

Describe a simple linear regression model and the roles of the dependent and independent variables in the model.

A

Linear regression provides an estimate of the linear relationship between an independent variable (the explanatory variable) and a dependent variable (the predicted variable).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Describe the least squares criterion, how it is used to estimate regression coefficients, and their interpretation.

A

The estimated intercept, bo^, represents the value of the dependent variable at the point of intersection of the regression line and the axis of the dependent variable (usually the vertical axis).

The estimated slope coefficient, b1^, is interpreted as the change in the dependent variable for a one-unit change in the independent variable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Explain the assumptions underlying the simple linear regression model, and describe how residuals and residual plots indicate if these assumptions may have been violated.

A

Assumptions made regarding simple linear regression include the following:

  • A linear relationship exists between the dependent and the independent variable.
  • The variance of the residual term is constant (homoskedasticity).
  • The residual term is free from serial correlation.
  • The residual term is normally distributed.

Residual Term = Error Term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Interpret the coefficient of determination in a simple linear regression.

A

The coefficient of determination, R2, is the proportion of the total variation of the dependent variable explained by the regression

R2 = RSS ÷ SST = (SST – SSE) ÷ SST

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Interpret the F-statistic in a simple linear regression.

A

In simple linear regression, because there is only one independent variable (k = 1), the F-test tests the same null hypothesis as testing the statistical significance of b1 using the t-test:

H0: b1 = 0 versus

Ha: b1 ≠ 0.

With only one independent variable, F is calculated as:
F-Stat = MSR ÷ MSE with 1 and n − 2 degrees of freedom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the calcualtions used in the use of analysis of variance (ANOVA) in regression analysis.

A

RSS = (ŷ - ȳ)2

SSE = (yi - ŷ)2

RSS + SSE = SST

MSR = RSS ÷ k
MSE = SSE ÷ n – (K – 1)
RSS + SSE = SST

F Test = MSR ÷ MSE
SEE = √SSE ÷ (n - 2)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Formulate a null and an alternative hypothesis about a population value of a regression coefficient, and determine whether the null hypothesis is rejected at a given level of significance.

A

We can assess a regression model by testing whether the population value of a regression coefficient is equal to a specific hypothesized value.

A t-test with n − 2 degrees of freedom is used to conduct hypothesis tests of the estimated regression parameters:

t = (b1^ - b1) ÷ Sb1^

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Calculate and interpret the predicted value for the dependent variable, and a prediction interval for it, given an estimated linear regression model and a value for the independent variable.

A

A predicted value of the dependent variable, ŷ, is determined by inserting the predicted value of the independent variable, Xp, in the regression equation and calculating ŷp = Bo^ + b1^Xp

The confidence interval for a predicted Y-value is [ŷp - (tc x Sf) < Y < ŷ + (tc x Sf]

where Sf is the standard error of the forecast.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is a Cross-Sectional Regression?

A

Uses several observations of X and Y over a particular period of time. These observations could come from different companies, asset classes, investment funds, countries, etc.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is Time-Series Regression?

A

Use many observations from different time periods for the same subject. For example, using monthly data from many years to test whether a country’s inflation rate determines its short-term interest rate

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Formulate a multiple regression equation to describe the relation between a dependent variable and several independent variables, and determine the statistical significance of each independent variable.

A

The multiple regression equation specifies a dependent variable as a linear function of two or more independent variables:

Yi = b0 + b1X1i + b2X2i + … + bkXki + εi

The intercept term is the value of the dependent variable when the independent variables are equal to zero. Each slope coefficient is the estimated change in the dependent variable for a one-unit change in that independent variable, holding the other independent variables constant.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Interpret estimated regression coefficients and their p-values.

A

The p-value is the smallest level of significance for which the null hypothesis can be rejected.

  • If the p-value is less than the significance level, the null hypothesis can be rejected.
  • If the p-value is greater than the significance level, the null hypothesis cannot be rejected.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Formulate a null and an alternative hypothesis about the population value of a regression coefficient, calculate the value of the test statistic, and determine whether to reject the null hypothesis at a given level of significance.

A

A t-test is used for hypothesis testing of regression parameter estimates:

tbj = bˆj − bj ÷ (sbˆj)

with n − k − 1 degrees of freedom

Testing for statistical significance means testing H0: bj = 0 vs. Ha: bj ≠ 0.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Interpret the results of hypothesis tests of regression coefficients.

A

For a two-tailed test of a regression coefficient, if the t-statistic is between the upper and lower critical t-values, we cannot reject the null hypothesis. We cannot conclude that the regression coefficient is statistically significantly different from the null hypothesis value at the chosen significance level.

If the t-statistic is greater than the upper critical t-value or lower than the lower critical t-value, we can reject the null hypothesis and conclude that the regression coefficient is statistically significantly different from the null hypothesis value at the specified significance level.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Calculate and interpret a predicted value for the dependent variable, given an estimated regression model and assumed values for the independent variables.

A

The value of dependent variable Y is predicted as:

Y=b0+b1X1+b2X2+…+bkXk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Explain the assumptions of a multiple regression model.

A

Assumptions of multiple regression mostly pertain to the error term, εi.

  • A linear relationship exists between the dependent and independent variables.
  • The independent variables are not random, and there is no exact linear relation between any two or more independent variables.
  • The expected value of the error term is zero.
  • The variance of the error terms is constant.
  • The error for one observation is not correlated with that of another observation.
  • The error term is normally distributed.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Calculate and interpret the F-statistic, and describe how it is used in regression analysis.

A

The F-distributed test statistic can be used to test the significance of all (or any subset of) the independent variables (i.e., the overall fit of the model) using a one-tailed test:

F = MSR÷ MSE = RSS/k ÷ SSE/[n−k−1]

with k and n - k -1 degree of freedom

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Contrast and interpret the R2 and adjusted R2 in multiple regression.

A

The coefficient of determination, R2, is the percentage of the variation in Y that is explained by the set of independent variables.

R2 increases as the number of independent variables increases—this can be a problem.

The adjusted R2 adjusts the R2 for the number of independent variables.

R2a=1 − [(n−1) ÷ (n−k−1) × (1−R2)]

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Formulate and interpret a multiple regression, including qualitative independent variables.

A

Qualitative independent variables (dummy variables) capture the effect of a binary independent variable:

Slope coefficient is interpreted as the change in the dependent variable for the case when the dummy variable is one.

Use one less dummy variable than the number of categories.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Explain how Conditional Heteroskedasticity affects statistical inference.

A

Conditional Heteroskedasticity: Residual variance related to level of independent variables

The Effect: Coefficients are consistent. Standard errors are underestimated. Too many Type I errors.

Detection: Breusch-Pagan chi-square test = n × R2

Correction: Use White-corrected standard errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Explain how serial correlation affects statistical inference.

A

Serial Correlation: Residuals are correlated

The Effect: Coefficients are consistent. Standard errors are underestimated. Too many Type I errors (positive correlation).

Detection: Durbin-Watson test ≈ 2(1 − r)

Correction: Use the Hansen method to adjust standard errors

22
Q

Describe multicollinearity, and explain its causes and effects in regression analysis.

A

Multicollinearity: Two or more independent variables are correlated

The Effect: Coefficients are consistent (but unreliable). Standard errors are overestimated. Too many Type II errors.

Detection: Conflicting t and F statistics; correlations among independent variables if k = 2

Correction: Drop one of the correlated variables

23
Q

Describe how model misspecification affects the results of a regression analysis, and describe how to avoid common forms of misspecification.

A

There are six common misspecifications of the regression model that you should be aware of and able to recognize:

  • Omitting a variable.
  • Variable should be transformed.
  • Incorrectly pooling data.
  • Using lagged dependent variable as independent variable.
  • Forecasting the past.
  • Measuring independent variables with error.

The effects of the model misspecification on the regression results are basically the same for all of the misspecifications: regression coefficients are biased and inconsistent, which means we can’t have any confidence in our hypothesis tests of the coefficients or in the predictions of the model.

24
Q

Interpret an estimated logistic regression.

A

Qualitative dependent variables (e.g., bankrupt versus non-bankrupt) require methods other than ordinary least squares (e.g., logit analysis).

25
Q

Evaluate and interpret a multiple regression model and its results.

A

The values of the slope coefficients suggest the economic meaning of the relationship between the independent and dependent variables, but it is important for the analyst to keep in mind that a regression may have statistical significance even when there is no practical economic significance in the relationship.

26
Q

Calculate and evaluate the predicted trend value for a time series, modeled as either a linear trend or a log-linear trend, given the estimated trend coefficients.

A

A time series is a set of observations for a variable over successive periods of time. A time series model captures the time series pattern and allows us to make predictions about the variable in the future.

27
Q

Describe factors that determine whether a linear or a log-linear trend should be used with a particular time series and evaluate limitations of trend models.

A

A simple linear trend model is: yt = b0 + b1t + εt, estimated for t = 1, 2, …, T.

A log-linear trend model, ln(yt) = b0 + b1t + εt, is appropriate for exponential data.

A plot of the data should be used to determine whether a linear or log-linear trend model should be used.

The primary limitation of trend models is that they are not useful if the residuals exhibit serial correlation.

28
Q

Explain the requirement for a time series to be covariance stationary and describe the significance of a series that is not stationary.

A

A time series is covariance stationary if its mean, variance, and covariances with lagged and leading values do not change over time. Covariance stationarity is a requirement for using AR models.

29
Q

Describe the structure of an autoregressive (AR) model of order p and calculate one- and two-period-ahead forecasts given the estimated coefficients.

A

Autoregressive time series multiperiod forecasts are calculated in the same manner as those for other regression models, but since the independent variable consists of a lagged variable, it is necessary to calculate a one-step-ahead forecast before a two-step-ahead forecast may be calculated. The calculation of successive forecasts in this manner is referred to as the chain rule of forecasting.

A one-period-ahead forecast for an AR(1) would be determined in the following manner:

ˆxt+1=ˆb0+ˆb1xt

A two-period-ahead forecast for an AR(1) would be determined in the following manner:

ˆxt+2=ˆb0+ˆb1ˆxt+1

30
Q

Explain how autocorrelations of the residuals can be used to test whether the autoregressive model fits the time series.

A

When an AR model is correctly specified, the residual terms will not exhibit serial correlation. If the residuals possess some degree of serial correlation, the AR model that produced the residuals is not the best model for the data being studied and the regression results will be problematic.

The procedure to test whether an AR time-series model is correctly specified involves three steps:

  1. Estimate the AR model being evaluated using linear regression.
  2. Calculate the autocorrelations of the model’s residuals.
  3. Test whether the autocorrelations are significant.
31
Q

Explain mean reversion and calculate a mean-reverting level.

A

A time series is mean reverting if it tends towards its mean over time. The mean reverting level for an AR(1) model is b0 ÷ (1−b1).

32
Q

Contrast in-sample and out-of-sample forecasts and compare the forecasting accuracy of different time-series models based on the root mean squared error criterion.

A

In-sample forecasts are made within the range of data used in the estimation. Out-of-sample forecasts are made outside of the time period for the data used in the estimation.

The root mean squared error criterion (RMSE) is used to compare the accuracy of autoregressive models in forecasting out-of-sample values. A researcher may have two autoregressive (AR) models, both of which seem to fit the data: an AR(1) model and an AR(2) model. To determine which model will more accurately forecast future values, we calculate the square root of the mean squared error (RMSE). The model with the lower RMSE for the out-of-sample data will have lower forecast error and will be expected to have better predictive power in the future.

33
Q

Explain the instability of coefficients of time-series models.

A

Most economic and financial time series data are not stationary. The degree of the nonstationarity depends on the length of the series and changes in the underlying economic environment.

34
Q

Describe characteristics of random walk processes and contrast them to covariance stationary processes.

A

A random walk time series is one for which the value in one period is equal to the value in another period, plus a random error. A random walk process does not have a mean reverting level and is not stationary.

35
Q

Describe implications of unit roots for time-series analysis, explain when unit roots are likely to occur and how to test for them, and demonstrate how a time series with a unit root can be transformed so it can be analyzed with an AR model.

A

A time series has a unit root if the coefficient on the lagged dependent variable is equal to one. A series with a unit root is not covariance stationary. Economic and finance time series frequently have unit roots. Data with a unit root must be first differenced before being used in a time series model.

36
Q

Describe the steps of the unit root test for nonstationarity and explain the relation of the test to autoregressive time-series models.

A

To determine whether a time series is covariance stationary, we can (1) run an AR model and/or (2) perform the Dickey Fuller test.

37
Q

Explain how to test and correct for seasonality in a time-series model and calculate and interpret a forecasted value using an AR model with a seasonal lag.

A

Seasonality in a time series is tested by calculating the autocorrelations of error terms. A statistically significant lagged error term corresponding to the periodicity of the data indicates seasonality. Seasonality can be corrected by incorporating the appropriate seasonal lag term in an AR model.

If a seasonal lag coefficient is appropriate and corrects the seasonality, the AR model with the seasonal terms will have no statistically significant autocorrelations of error terms.

38
Q

Explain autoregressive conditional heteroskedasticity (ARCH) and describe how ARCH models can be applied to predict the variance of a time series.

A

ARCH is present if the variance of the residuals from an AR model are correlated across time. ARCH is detected by estimating εˆ2t = a0 + a1εˆ2t−1 + μt

If a1 is significant, ARCH exists and the variance of errors can be predicted using: ˆσˆ2t+1= a0 + aˆ1εˆ2t.

39
Q

Explain how time-series variables should be analyzed for nonstationarity and/or cointegration before use in a linear regression.

A

When working with two time series in a regression: (1) if neither time series has a unit root, then the regression can be used; (2) if only one series has a unit root, the regression results will be invalid; (3) if both time series have a unit root and are cointegrated, then the regression can be used; (4) if both time series have a unit root but are not cointegrated, the regression results will be invalid.

The Dickey Fuller test with critical t-values calculated by Engle and Granger is used to determine whether two times series are cointegrated.

40
Q

Determine an appropriate time-series model to analyze a given investment problem and justify that choice.

A

The RMSE criterion is used to determine which forecasting model will produce the most accurate forecasts. The RMSE equals the square root of the average squared error.

41
Q

Describe supervised machine learning, unsupervised machine learning, and deep learning.

A

With supervised learning, inputs and outputs are identified for the computer, and the algorithm uses this labeled training data to model relationships.

With unsupervised learning, the computer is not given labeled data; rather, it is provided unlabeled data that the algorithm uses to determine the structure of the data.

With deep learning algorithms, algorithms such as neural networks and reinforced learning learn from their own prediction errors, and they are used for complex tasks such as image recognition and natural language processing.

42
Q

Describe overfitting and identify methods of addressing it.

A

In supervised learning, overfitting results from a large number of independent variables (features), resulting in an overly complex model that may have generalized random noise that improves in-sample forecasting accuracy. However, overfit models do not generalize well to new data (i.e., low out-of-sample R-squared).

To reduce the problem of overfitting, data scientists use complexity reduction and cross validation. In complexity reduction, a penalty is imposed to exclude features that are not meaningfully contributing to out-of-sample prediction accuracy. This penalty value increases with the number of independent variables used by the model.

43
Q

Describe supervised machine learning algorithms—including penalized regression, support vector machine, k-nearest neighbor, classification and regression tree, ensemble learning, and random forest—and determine the problems for which they are best suited.

A

Supervised learning algorithms include:

Penalized regression. Reduces overfitting by imposing a penalty on and reducing the nonperforming features.

Support vector machine (SVM). A linear classification algorithm that separates the data into one of two possible classifiers based on a model-defined hyperplane.

K-nearest neighbor (KNN). Used to classify an observation based on nearness to the observations in the training sample.

Classification and regression tree (CART). Used for classifying categorical target variables when there are significant nonlinear relationships among variables.

Ensemble learning. Combines predictions from multiple models, resulting in a lower average error rate.

Random forest. A variant of the classification tree whereby a large number of classification trees are trained using data bagged from the same data set.

44
Q

Describe unsupervised machine learning algorithms—including principal components analysis, k-means clustering, and hierarchical clustering—and determine the problems for which they are best suited.

A

Unsupervised learning algorithms include:

  • Principal components analysis. Summarizes the information in a large number of correlated factors into a much smaller set of uncorrelated factors, called eigenvectors.
  • K-means clustering. Partitions observations into k nonoverlapping clusters; a centroid is associated with each cluster.
  • Hierarchical clustering. Builds a hierarchy of clusters without any predefined number of clusters.
45
Q

Describe neural networks, deep learning nets, and reinforcement learning.

A

Neural networks comprise an input layer, hidden layers (which process the input), and an output layer. The nodes in the hidden layer are called neurons, which comprise a summation operator (that calculates a weighted average) and an activation function (a nonlinear function).

Deep learning networks are neural networks with many hidden layers (more than 20), useful for pattern, speech, and image recognition.

Reinforcement learning (RL) algorithms seek to learn from their own errors, thus maximizing a defined reward.

46
Q

Identify and explain steps in a data analysis project.

A

The steps involved in a data analysis project include:

  1. conceptualization of the modeling task
  2. data collection
  3. data preparation and wrangling
  4. data exploration
  5. model training
47
Q

Describe objectives, steps, and examples of preparing and wrangling data.

A

Data cleansing deals with missing, invalid, inaccurate, and non-uniform values as well as with duplicate observations. Data wrangling or preprocessing includes data transformation and scaling. Data transformation types include extraction, aggregation, filtration, selection, and conversion of data. Scaling is the conversion of data to a common unit of measurement. Common scaling techniques include normalization and standardization. Normalization scales variables between the values of 0 and 1, while standardization centers the variables at a mean of 0 and a standard deviation of 1. Unlike normalization, standardization is not sensitive to outliers, but it assumes that the variable distribution is normal.

48
Q

Describe objectives, methods, and examples of data exploration.

A

Data exploration involves exploratory data analysis (EDA), feature selection, and feature engineering (FE). EDA looks at summary statistics describing the data and any patterns or relationships that can be observed. Feature selection involves choosing only those features that meaningfully contribute to the model’s predictive power. FE optimizes the selected features.

49
Q

Describe objectives, steps, and techniques in model training.

A

Before model training, the model is conceptualized where ML engineers work with domain experts to identify data characteristics and relationships. ML seeks to identify patterns in the training data, such that the model is able to generalize to out-of-sample data. Model fitting errors can be caused by using a small training sample or by using an inappropriate number of features. Too few features may underfit the data, while too many features can lead to the problem of overfitting.

Model training involves model selection, model evaluation, and tuning.

50
Q

Describe preparing, wrangling, and exploring text-based data for financial forecasting.

A

Text processing involves removing HTML tags, punctuations, numbers, and white spaces. Text is then normalized by lowercasing of words, removal of stop words, stemming, and lemmatization. Text wrangling involves tokenization of text. N-grams is a technique that defines a token as a sequence of words, and is applied when the sequence is important. A bag-of-words (BOW) procedure then collects all the tokens in a document. A document term matrix organizes text as structured data: documents are represented by words, and tokens by columns. Cell values reflect the number of times a token appears in a document.

51
Q

Describe methods for extracting, selecting and engineering features from textual data.

A

Summary statistics for textual data includes term frequency and co-occurrence. A word cloud is a visual representation of all the words in a BOW, such that words with higher frequency have a larger font size. This allows the analyst to determine which words are contextually more important. Feature selection can use tools such as document frequency, the chi-square test, and mutual information (MI). FE for text data includes identification of numbers, usage of N-grams, name entity recognition (NER), or parts of speech (POS) tokenization.

52
Q

Evaluate the fit of a machine learning algorithm.

A

Model performance can be evaluated by using error analysis. For a classification problem, a confusion matrix is prepared, and evaluation metrics such as precision, recall, accuracy score, and F1 score are calculated.

precision (P) = true positives / (false positives + true positives)

recall (R) = true positives / (true positives + false negatives)

accuracy = (true positives + true negatives) / (all positives and negatives)

F1 score = (2 × P × R) / (P + R)

The receiver operating characteristic (ROC) plots a curve showing the tradeoff between false positives and true positives.

Root mean square error (RMSE) is used when the target variable is continuous.