Quant Flashcards

Question

Evaluate limitations of trend models

Answer 1

We may reject the null hypothesis that all the slope coefficients equal zero based on the F-test even though individual slope coefficient t-tests cannot reject the null hypothesis. We may fail to reject the null hypothesis that all the slope coefficients equal zero based on the F-test even though individual slope coefficient t-tests reject the null hypothesis.

Answer 2

Plot the series to determine whether a linear or exponential trend seems most reasonable. Use the developed model for forecasting if the Durbin-Watson statistic indicates no significant serial correlation in the residuals.

Answer 3

Multicollinearity occurs when two or more independent variables (or combinations of independent variables) in a regression model are highly correlated with each other.

Answer 4

Qualitative dependent variables are dummy variables representing a state of being (e.g., bankrupt or not). The probit model estimates the probability of a qualitative condition using a normal distribution while the logit model uses the heavier-tailed and higher kurtosis logistic distribution. Discriminant analysis uses a linear function like regression to create overall scores used to classify observations qualitatively.

Answer 5

Heteroskedasticity can lead to false inferences about the independent variable but does not affect the consistency of estimators of regression parameters. Both the F-test and t-tests can become unreliable, the latter due to bias introduced into standard errors of the regression coefficients.

Answer 6

Examine for statistically significant autocorrelation for any residual. Conduct the Dickey-Fuller test for unit root (preferred approach).

Answer 7

On the basis of their root mean square error (RMSE). The RMSE for each model under consideration is calculated based on out-of-sample data. The model with the lowest RMSE has the lowest forecast error and hence carries the most predictive power.

Answer 8

The mean, variance, and covariance of the series with itself in the past or future must be constant and finite. Otherwise, model output has no economic meaning.

Answer 9

Use linear regression when 1) neither series (dependent or independent variable) has a unit root or 2) both series have a unit root but are cointegrated (i.e., share a common trend and have bounded divergence over time). Do not use linear regression if 1) either (but not both) series has a unit root or 2) both series have a unit root and are not cointegrated.

Answer 10

Hansen's method - Adjusts standard errors for the coefficients. The coefficients stay the same, but the standard errors change. Robust standard errors for positive correlation are then larger. Modify the regression equation to eliminate the serial correlation.

Answer 11

Unstable parameters. No set criteria for determining p and q. Poor forecasting ability.

Answer 12

ARCH models are used to determine whether the variance of the error in one period depends on the variance of the error in previous periods.

Answer 13

Data exploration is used to investigate and comprehend data distributions and relationships: Exploratory data analysis (EDA) is the first step in data exploration. Feature selection involves selecting only pertinent data for ML model training; fewer features create less complex models that require less time to train. Feature engineering involves creating new features by changing or transforming existing ones.

Answer 14

A corpus is any collection of raw text data, which can be organized into a table containing two columns: (sentence) is for text and (sentiment) is for the corresponding sentiment class. The separator character (@) splits the data into text and sentiment class columns

Answer 15

Dataset Size: Small datasets can lead to underfitting because they are not sufficient to expose patterns in the data. Number of Features: A small/large number of features can lead to underfitting/overfitting.

Answer 16

A cleansed and preprocessed dataset is partitioned using a common ratio of 60:20:20, respectively: Training set (60%) Cross-validation set (20%) Test set (20%)

Answer 17

A training set should include approximately 60% of the master dataset. A cross-validation set (CV set) to tune and validate the model should constitute approximately 20% of the master dataset. A test set uses the remaining data, which are split using random sampling techniques; for unsupervised learning, splitting is not needed.

Answer 18

Remove html tags: Most text data from web pages have html markup tags. Remove punctuations: Most punctuations are unnecessary, but some may be useful for ML training. Remove numbers: If numbers are in the text, they should be removed or substituted with an annotation /number/. Remove white spaces: White spaces should be identified and removed to keep the text intact and clean.

Answer 19

Deciding on the output of the model (i.e., future price movements), how the model will be used, who will use it, and how it will be incorporated into the investment process.

Answer 20

The objective of model training is to minimize forecasting errors: Method selection involves deciding which ML method(s) to use based on the classification task and type and size of data. Performance evaluation uses complementary techniques to quantify and understand model performance. Tuning seeks to improve model performance.

Answer 21

Two overall performance metrics are accuracy and F1 score; high scores suggest good performance. Accuracy is the percentage of correctly predicted classes out of total predictions. Accuracy = (TP + TN)/(TP + FP + TN + FN) F1 score is the harmonic mean of precision and recall. F1 score = (2 × P × R)/(P + R)

Answer 22

Incompleteness error is when data are missing. Missing and not applicable/available values (NAs) must be omitted or replaced with “NA” and deleted or substituted with imputed values. Invalidity error—data are outside of a meaningful range. Inaccuracy error—data are not a measure of true value. Inconsistency error—data conflict with other data points or reality. Non-uniformity error—the data are not present in an identical format. Duplication error—delete duplicate observations.

Answer 23

Bias error refers to the extent to which the inferred relationship fits the training data. Algorithms with erroneous assumptions produce high bias from underfitting and high in-sample error, leading to poor predictive value. Variance error reflects how much the model’s results change in response to new data from validation and test samples. Unstable models pick up noise and spurious relationships, resulting in overfitting and high out-of-sample error. Base error, which arises from randomness in the data.

Answer 24

Numbers are converted into a token such as “/number/.” N-grams are discriminative multi-word patterns with their connection kept intact. For example, a bigram such as “stock market” treats the two adjacent words as one. Name entity recognition (NER) algorithm analyzes individual tokens and their surrounding semantics to tag an object class to the token. Parts of speech (POS) uses language structure and dictionaries to tag every token with a corresponding part of speech. Some common POS tags are nouns, verbs, adjectives, and proper nouns.