CFA L2 Quant Flashcards

Question

Functional form misspecifications (A regression suffers from misspecification of the functional form when the functional form of the estimated regression model differs from the functional form of the population regression function):

Answer 1

* Omission of important independent variables: may lead to biased and inconsistent regression parameters OR serial correlation or heteroskedasticity in the residuals. * Inappropriate variable form (ex: you may need to take the natural log of a variable): may lead to heteroskedasticity in the residuals. This can happen if there is no linear relationship between the independent & dependent variables. * Inappropriate variable scaling (ex: common-size financial statements): May lead to heteroskedasticity in the residuals or multicollinearity. * Data improperly pooled: May lead to heteroskedasticity or serial correlation in the residuals.

Answer 2

When the variance of the residuals is not constant across all observations in the sample. This happens when there are subsamples that are more spread out than the rest of the sample.

Answer 3

When the heteroskedasticity is not related to the # of independent variables meaning heteroskedasticity won't increase/decrease as the amount of independent variables increase/decrease. ## Footnote * Although it's a violation of our assumptions, it is usually not a big problem.

Answer 4

Heteroskedasticity that is related to the # of independent variables. Creates significant problems for statistical interference if not corrected properly. ## Footnote * Conditional heteroskedasticity DOES NOT affect the slope coefficients. It DOES affect the computed F-stat and and computed T-stat

Answer 5

If the pattern of heteroskedasticity is low (most observations on the plot are low values): Standard errors (SEE) of the coefficients in a regression are affected by conditional heteroskedasticity and usually become unreliable estimates by being underestimated. This will lead to the T-stat being too large too often and thus rejecting the null too often, a.k.a type 1 error. ## Footnote * For the F test (MSR/MSE), MSE is underestimated, and therefore the F-stat is often too large leading to the null is rejected too often, a.ka type 1 error. * If the pattern of heteroskedasticity is high (most observations on the plot are high values): the same errors will happen but in the opposite direction.

Answer 6

There are two methods of detection: examining scatter plots of the residuals and by using the Breusch-Pagan chi-square test.

Answer 7

Look at a scatterplot of the residuals vs the independent variables. If the variation is constant there is no heteroskedasticity. If it's not constant, there is heteroskedasticity.

Answer 8

A test used to detect heteroskedasticity. The BP test calls for the squared residuals (as the dependent variable) to be regressed on the original set of independent variables. If conditional heteroskedasticity is present, the independent variables will significantly contribute to the explanation of the variability in the squared residuals. ## Footnote * We want a small R^2 when using a BP test. * This is a one-tailed test because we are only concerned w/ large values. * Use a chi-square dist. with k df

Answer 9

We can use robust standard errors/white-corrected standard errors/heteroskedasticity-consistent standard errors

Answer 10

When residuals are correlated with each other. ## Footnote * Poses serious problems when using time series data.

Answer 11

When a positive residual in one time period increases the probability of observing a positive residual in the next time period. ## Footnote * This type of correlation typically results in coefficient standard errors that are too small, causing T-stats or F-stats to be too large, which will lead to type 1 errors.

Answer 12

If the dependent variable's reaction to the independent variable has a lag in a regression model, serial correlation causes the estimates of the slope coefficients to be inconsistent. If there is no lag, then the estimates of the slope coefficient will be consistent.

Answer 13

First, we can use a scatter plot. This will show very dramatic scenarios. We can also use a Durbin-Watston (DW) statistic or a Breusch-Godfrey (BG) test. The DW statistic is used to detect serial correlation at a single lag, whereas a BG test is used to detect serial correlation at multiple lags. ## Footnote * The lower limit for the DW table is 15 observations.

Answer 14

The BG Test regresses the residuals against the original set of independent variables, plus one or more additional variables representing lagged residuals. Calculation: ε = a1x1 + a2x2... + p1x1 + pnxn ## Footnote * The null under the BG test is that there is no serial correlation (i.e p1=0).

Answer 15

We can calculate robust standard errors/Newey-West corrected standard errors/heteroskedasticity-consistent standard errors

Answer 16

When independent variables in a multiple regression are correlated w/ each other ## Footnote * This inflates standard errors and lowers t-stats leading to the null failing to be rejected more often (type 2 error). * Also causes the model's coefficients to become unreliable. * Multicollinearity has no effect on an F-stat

Answer 17

Multicollinearity DOES NOT affect the consistency of slope coefficients. Multicollinearity DOES make those estimates imprecise and unpredictable.

Answer 18

The most easily observable sign is when t-tests indicate none of the individual coefficients are significantly different than zero, but the F-test indicates that at least one of the coefficients is statistically significant and the R^2 is high. This means that none of the individual variables cause variation in the dependent variable but combined together they are highly correlated which washes out the individual effects. *More formally we use a variance inflation factor (VIF) for each of the independent variables.*

Answer 19

Estimates how much of the variation in the dependent variable in a multiple regressions model is due to multicollinearity. We start by regressing one of the independent variables (making it a dependent variable) against the remaining independent variables. VIF= 1 / (1 - Rj^2) * VIF values >1 indicates that the variable is not highly correlated with other independent variables. * VIF values >5 indicate further investigation. * VIF values >10 indicate high correlation. | Rj^2 is the R^2 of J. J is the independent variable being regressed.

Answer 20

The most common method to correct for multicollinearity is to omit one or more of the highly correlated independent variables. You can also use a proxy for one of the variables or increase the sample size.

Answer 21

False, using actual instead of expected inflation is likely to result in model misspecification.

Answer 22

Outliers: Extreme observations in the dependent (Y) variable High-leverage points: Extreme observations in the independent (X) variable

Answer 23

This is a way of identifying extreme observations in the independent variable. A measure of the distance between the xth observation of the independent variable relative to its sample mean. Leverage values will be between 0 and 1. The closer to 1 the farther the distance. If a variable's leverage is higher than three times the average ((3*(k+1))/n), it is considered potentially influential.

Answer 24

An alternative way of identifying outliers than leverage. The studentized residual is the # of standard deviations the data point is from the regression line. For each data point, the residual ÷ standard division is its standardized residual. There are four main steps to this process: 1. Estimate the regression model using the original sample size and then delete one observation and re-estimate the regression. Perform this sequentially deleting a new observation each time. 2. Compare the actual Y values of the deleted observation to the predicted y-values. ei= Y-ŷ 3. The studentized residual is the residual in #2 ÷ standard deviation. t= ei / s 4. Compare the studentized residuals to critical values in a t-table using n-k-2 df. Points that fall in the rejection region are termed outliers and potentially influential.

Answer 25

Extreme observations that, when excluded, cause a significant change to model coefficients.

Answer 26

A composite metric for evaluating if a high leverage and/or outlier is influential. Cook's distance measures how much the estimated values of the regression change if certain high leverage points or outliers are deleted from the sample. Calculation: D= [ ei^2 / ((K+1) * MSE) ] * [ hi / (1-hx)^2 ] * hi= leverage value for the xth observation * ei= the residual for the ith observation ## Footnote * Values > than √(k/n) indicate the observation is highly likely to be an influential data point. * Generally, values > 1 indicate highly influential, whereas values > 0.5 indicate the need for further investigation.

Answer 27

Binary variables with only two options ## Footnote * When assigning a numerical value, it can only be 0 and 1. * Always use (n-1) dummy variables to avoid multicollinearity (i.e., 3 dummy variables for 4 quarters in a year). * Ex: True/falseEx 2: On/off

Answer 28

EPS for four quarters: EPS = 1.25 + 0.75Q1 - 0.20Q2 + 0.10Q3 Question 1: What this the predicted EPS for Q4? Answer 1: EPS = 1.25 + 0.75(0) - 0.20(0) + 0.10(0) = 1.25 * omitted quarter shows as the intercept Question 2: What is the predicted value for Q1? Answer 2: EPS = 1.25 + 0.75(1) - 0.20(0) + 0.10(0) = 2.00 Question 3: What is the predicted EPS for Q1 of next year? Answer 3: EPS = 1.25 + 0.75(1) - 0.20(0) + 0.10(0) = 2.00 * This simple model uses average EPS for any specific quarter over the past ten years as a forecast of EPS in its respective quarter of the following year.

Answer 29

Estimates the probability of a DISCRETE binary variable occurring. Calculation: ln(p/(1-p)) = b0 + b1x1 + b2x2 ... + ε * The intercept value is an estimate of log odds when the values of all independent variables is zero. * The change in log odds when one of the independent variables change is dependent on the curvature of the function. * Odds= e^y * Probability = 1 / (1 + n(p/(1-p))) OR 1 / (1 + e^(-yhat)) ## Footnote * Logit models assume that residuals have a logistic distribution- similar to a normal distribution but with fatter tails. * Logit models are nonlinear

Answer 30

Similar to joint F-test but for logit models. Measures the goodness of fit of a logit model. Calculation= -2 * (log likelihood restricted model - log likelihood unrestricted model). ## Footnote * Recall, the restricted model has fewer independent variables. * Always provides a negative value. * Values closer to 0 indicate a better-fitting model. * LR test is a chi-square distribution.

Answer 31

A set of observations taken periodically (most often at equal intervals) at different points in time. ## Footnote * A key feature of a time series is that new data can be added w/o affecting the existing data. * Trends can be found by plotting these observations on a graph.

Answer 32

1/2 broad types of trend models. A time-series trend that can be graphed using a straight line. The independent variable will be time. A downward sloping linear trend indicates a negative trend and vice versa for a positive trend. Simplest form: Y= bo +b1(t) + b2(t) ... + ε

Answer 33

1/2 broad types of trend models. This is used to model positive and negative exponential growth. Recall, exponential growth is some constant growth rate (positive or negative). Exponential growth will show a convex curve. Simplest form: e^(b0 + b1(t)) * b1 is the constant rate of growth. * Rather than trying to fit the nonlinear data with a linear (straight line) regression, we take the natural log of both sides and transform it into a linear trend line called the log-linear model. This increases the predictive ability of the model. Form: ln(y) = ln(e^(b0 + b1(t))) ## Footnote * Financial time series data is often modeled using log-linear trend models.

Answer 34

Plot the data. A linear trend model may be used if the data points are equally distributed above and below the regression line (ex: inflation data is usually modeled with a linear trend model). If, when plotted, the data plots with a curved shape, use a log-linear trend model (ex: financial data- stock indices and stock prices- are often modeled with log-linear trend models). ## Footnote * If there is serial correlation, we will use an autoregressive model.

Answer 35

False, for a time series model without serial correlation, the DW statistic should be approximately equal to 2. A DW that significantly differs from 2 suggests that the residuals are correlated.

Answer 36

A time-series model that regresses the dependent variable against one or more lagged values of itself. Ex: A regression of the sales of a firm against the sales of the firm in the previous month. In this model, past values are used to predict the current value of the variable. Simplest form: Xt = bo + b1*x_t-1 .... bp*x_t-p + ε * Xt= value of time series at time t * X_t-1= value of time series at time t-1 ## Footnote * DW test stat cannot be used to test for serial correlation in AR model.

Answer 37

An AR model is covariance stationary if: * There is a constant and finite expected value: the expected value is constant over time. * Constant and finite variance: the volatility around the time series' mean is constant over time. * The covariance between any two observations w/ equal distance apart will be equal.

Answer 38

False, we need stationary covariance. A nonstationary time series will produce meaningless results.

Answer 39

False, we must use a t-test ## Footnote * We can use a DW or BG test for a TREND model.

Answer 40

correlation of the error term with the kth lagged error term ÷ (1 ÷ √n) Standard error = (1 ÷ √n)(n-2) * dfn= # of observations. ## Footnote * If data is monthly, check for 12 lags to see if there's serial correlation. If quarterly, check for 4 lags. * When there is statistically significant serial correlation in an AR model, it means that the model is incomplete. There's still some pattern of data in the residuals that the model has failed to reveal.

Answer 41

When a time-series has a tendency to move towards its mean. In other words, the dependent variable has a tendency to decline when the current value is above the mean and rise when the current value is below the mean. If a time series is at its mean reverting level, the model predicts the next value of the time series will be the same as its current value.

Answer 42

Xt = b0 ÷ (1 - b1) ## Footnote * The model will not be covariance stationary if b1 = 1 * If Xt > than the mean reverting level, the model predicts that x_t+1 will be lower than Xt and vice versa. * All covariance stationary time series have a finite mean-reverting level. * As forecasts become more distant, the value of the forecast will be closer to the mean reverting level.

Answer 43

Forecasts that are within the range of data used to estimate the model. This is where we compare how accurate our model is in forecasting the acutal data we used to develop the model.

Answer 44

Forecasts that are made outside of the sample period. This is where we compare how accurate a model is in forecasting the y-variable value for a time period outside the period used to develop the model.

Answer 45

Used to compare the accuracy of autoregressive models in forecasting out-of-sample values. Ex: We have two AR models. To determine which model will more acurately forecast future values, we calculate the RMSE for the out-of-sample data. ## Footnote * The model with the lower RMSE for the out-of-sample data will have lower forecast error and will be expected to have better predictive power in the future.

Answer 46

True. Since financial/economic conditions are dynamic, the coefficients in one period may be different from those in another period. Model with shorter estimated time periods are usually more stable for this reason. When selecting a time series sample, analysts should understand regulatory changes, changes to the economic environment, etc. If there have been large changes, the model may not be accurate.

Answer 47

True. Statistical reliability= if you use a long time period, there is more statistical reliability.

Answer 48

When, in an AR model, the value of the dependent variable in one period is equal to the value of the series in the previous period plus a random error term. Form: Xt = X_t-1 + ε * b0 = 0 * b1 = 1

Answer 49

The same concept as a random walk but the intercept term is not equal to zero. Thus, the time series model is expected to increase/decrease by the intercept term and the error term. Form: Xt = b0 + X_t-1 + ε * b1 = 1

Answer 50

True, random walks will always have a unit root which makes them not covariance stationary.

Answer 51

A unit root is when b1 = 1. If this occurs, then the mean reverting level (b0 ÷ (1 - b1)) is undefined.

Answer 52

1. We can run an AR model and examine autocorrelations 2. Perform a Dickey-Fuller test ## Footnote * We cannot use a T-test

Answer 53

A test we use in an AR model to determine if there's a unit root. Calculation: Xt = b0 + b1*X1 + ε ↠ Xt - X_t-1 = b0 * (b1* X_t-1) - X_t-1 + ε ↠ Xt - X_t-1 = b0 + (b1 -1) * X_t-1 + ε ↠Then, test whether the new coefficient (b-1) [(b-1) a.k.a G] = 0 using a t-test. ## Footnote * The null hypothesis is that (b-1)= 0. If the null is failed to be rejected, the time series has a unit root and is nonstationary.

Answer 54

False, it has its own distribution to calculate the critical values.

Answer 55

A procedure that transforms time series data w/ a random walk into a covariance stationary time series. The first differencing process involves subtracting the value of the time series (the dependent variable) in the immediately preceding period from the current value of the time series to define a new dependent variable, y.

Answer 56

1. If the original time series has a unit root, then ε= Xt - X_t-1 2. Then we will create a new dependent variable: Yt = Xt - X_t-1 OR Yt = ε 3. Then, if we state it in the form of an AR model: Yt= B0 + B1*(Y_t-1) + εB0 = B1 = 0

Answer 57

A characteristic of a time series in which the data experiences regular and predictable changes that recur every calendar year. ## Footnote * If seasonality is present, we MUST adjust the AR model in order for it to be correctly specified.

Answer 58

We add an additional lag of the dependent variable to the original model as another independent variable. The lag will be X_t-4 in a quarterly model or X_t-12 in a monthly model. Calculation: ln(Xt) = b0 + b1 * ln(X_t-1) + b2 * ln(X_t-4) + ε

Answer 59

False. H0 = 0- no seasonality present; Ha ≠ 0- seasonality present.

Answer 60

When the variance of the residuals in one period is dependent on the variance of the residuals in a previous period in an AR model. When ARCH exists, the standard errors of the coefficients and the hypothesis tests are invalid.

Answer 61

A model used to test for ARCH. ## Footnote * If a time-series model has been determined to contain ARCH errors, regression procedures that correct for heteroskedasticity, such as generalized least squares, must be used in order to develop a predictive model. Otherwise, the standard errors of the model's coefficients will be incorrect, leading to invalid conclusions.

Answer 62

After we run an ARCH model, if we determine that a1 is significant (the time series has ARCH), future variance of errors can be predicted by using: σ^2_t+1 = a0hat + a1hat * εt^2 ## Footnote * We cannot predict future variance if a1 is not significant.

Answer 63

When more than one time series is run at the same time. Ex: Yt = b0 + b1 * Xt + εt ↠ Yt and Xt are two different time series. ## Footnote * Either or both of these time series could be subject to nonstantionarity.

Answer 64

Run separate DF tests for each time series. ## Footnote * If either of the time series' are nonstationary, the coefficients will be unreliable.

Answer 65

When two time series are economically linked to the same macro variables or follow the same trend, and that relationship is not expected to change. ***If two time series are cointegrated, the error term from regressing one on the other is covariance stationary and the t-tests are reliable.***

Answer 66

When working with two time series in a regression: 1. If neither time series has a unit root, then the regression can be used 2. If only one series has a unit root, the regression results will be invalid 3. If both time series have a unit root and are cointegrated, then the regression can be used 4. If both time series have a unit root but are not cointegrated, the regression results will be invalid.

Answer 67

Regress one variable on the other: Yt = b0 + b1 * Xt + ε * Yt= value of time series 'Y' at time t. * Xt= value of time series 'X' at time t. Then, the residuals are tested for a unit root using the Dickey-Fuller test with critical t-values calculated by the Engle and Granger (DF-EG Test). If the DF test rejects the null hypothesis (Ho= no cointegration), then we conclude the error terms are covariance stationary and there is cointegration.

Answer 68

A significant shift in the plotted data at a point in time that essentially divides the data into two or more distinct patterns. ## Footnote * If there is structural change present, you must run two different models- one incorporating data before the date and one after the date.

Answer 69

Filters useful info from substantial amounts of data by learning from known examples to find a pattern in the data. Machine learning acts without human intervention.

Answer 70

The dependent variable. ## Footnote * Target variables can be continuous, categorical, or ordinal.

Answer 71

The independent variable

Answer 72

The sample used to fit the model

Answer 73

A model input specified by the researcher

Answer 74

1/3 types of ML. We teach the model, then with that knowledge have it predict future instances. Supervised learning uses labeled data- data where the target variable is defined. Supervised learning is used when the training data contains the ground truth= the target variable. Multiple regression is an example of supervised learning. Regression and classification are the two most common examples of supervised learning. If target variable is continuous then use a regression. If target variable is categorical or ordinal then use a classification model. Output of classification models looks to group observations.

Answer 75

1/3 types of ML. There is no labeled data, and instead the program seeks out patterns within the data.

Answer 76

1/3 types of ML that is used for complex tasks such as imagine recognition, natural language processing, etc. Deep learning is based on neural networks. Deep learning is a self-teaching system. A type of NN with many hidden layers (at least two but often more than 20)

Answer 77

Algorithms that have an agent seek for a reward given restraints. The RL does not rely on labeled data, but rather these programs learn from their own prediction errors.

Answer 78

A group of ML algorithms applied to problems w/ significant nonlinearities.

Answer 79

The extent to which a ML program is able to make out-of-sample predictions.

Answer 80

When a large number of features (independent variables) are in the data set. Overfitting will decrease the accuracy of out-of-sample forecasts. ## Footnote * The training sample will have a high R^2 and the test sample will have a low R^2.

Answer 81

Create three overlapping data sets: Training sample: In-sample data. Used to train the ML algorithm. Validation sample: Out-of-sample data. Used to tune the training model. Test sample: Out-of-sample data. ## Footnote * A model that generalized well should have a high R^2 for in-sample and out-of-sample data.

Answer 82

This is the in-sample error resulting from models with a poor fit. ## Footnote * Occurs when there is underfitting.

Answer 83

This is the out-of-sample error resulting from overfitted models that do not generalize well. This is the extent to which the ML model's results change in response to test and validation sample data. ## Footnote * Associated with overfitting. * Increases with model complexity. * Nonlinear models tend to have high variance error.

Answer 84

This is the out-of-sample error resulting from residual errors due to random noise. Just randomness in the data. ## Footnote * Decreases with model complexity. * Linear models tend to have high base error.

Answer 85

Plots the accuracy rate in the test sample versus the size of the training sample. A ML model that generalizes well will show an improving accuracy rate as the sample size increases. The in-sample and out-of-sample error rates should converge toward the desired level as the sample size increases.

Answer 86

In-sample accuracy rate= 1 - bias error rate Out-of-sample accuracy rate= 1 - variance error rate. Base accuracy rate= 1 - base error rate.

Answer 87

False, the accuracy rates will converge just far below the desired level.

Answer 88

False, only the in-sample accuracy rate will converge towards the desired level.

Answer 89

Reduce complexity and cross validation.

Answer 90

An estimate of out-of-sample error rates directly from the validation sample.

Answer 91

A penalty imposed to exclude features that do not meaningfully contribute to out-of-sample prediction accuracy.

Answer 92

When the ML algorithm fails to identify an actual relationship. This occurs when there is an oversimplified model. ## Footnote * R^2 will be low for in-sample and out-of-sample data. * High bias error * Linear functions are susceptible to underfitting.

Answer 93

A method for alleviating the holdout sample problem: when the training set is reduced too much. This process eliminates sampling bias. There are four steps in this process: 1. Shuffle the data randomly. 2. Divide the data into k equal sub-samples. 3. K-1 samples will be training samples with the remaining sample a validation sample. 4. This process is then repeated k times. The average of the k validation errors is then taken as a reasonable estimate of the model's out-of-sample error.

Answer 94

Models that reduce the problem of overfitting by imposing a penalty based on the # of features in the model. The penalty increases w/ the # of features used. This will exclude features that do not meaningfully contribute to out-of-sample prediction accuracy. Penalized regression models seek to minimize the SSE. ## Footnote * These models are used to forecast returns.

Answer 95

This is a popular penalized regression model. LASSO attempts to minimize SSE and the sum of the absolute values of the slope coefficients of the regression. The penalty increases with number of features. There is a tradeoff in reducing SSE (increasing independent variables) and the penalty imposed. Investment analysts use LASSO to build parsimonious (few predictor variables) models.

Answer 96

A type of penalized regression. Forces the beta coefficients of nonperforming features towards zero. Regularization can be applied to non-linear models.

Answer 97

A common supervised ML algorithm often used for textual, categorical data. The model assumes the data is linearly separable; An SVM is a linear classification algorithm. An SVM attempts to find the optimal hyperplane that separates two sets of data (classes) by the max amount using n features. ## Footnote * ***Applications of SVM in investment management include classifying debt issuers into likely-to-default versus not-likely-to-default issuers, stocks-to-short versus not-to-short, and even classifying text (from news articles or company press releases) as positive or negative.***

Answer 98

Handles misclassified observations in the training data in an SVM.

Answer 99

A common supervised ML algorithm. New observations are classified by finding the new observation and its k-nearest piece of data in the current data set. This is used for categorical data. ## Footnote * ***Investment applications of KNN include predicting bankruptcy, assigning a bond to a ratings class, predicting stock prices, and creating customized indices.***

Answer 100

A common supervised ML algorithm. Classification trees are used when the target variable is categorical and typically when the target is binary. Regression trees are used when the target is continuous. Classification trees assign observations to one of two possible classifications at each node starting w/ the root node at the top, then moving to the decision nodes in the middle, and then the terminal nodes at the bottom. ## Footnote * To avoid overfitting, regularization criteria such as maximum tree depth, maximum number of decision nodes, and so on are specified by the researcher. Alternatively, sections of tree with minimal explanatory power are **pruned**. * ***Investment applications of CART include detecting fraudulent financial statements and selecting stocks and bonds.*** * With Classification and Regression Trees (CART), one way that regularization can be implemented is via pruning which will reduce the size of the regression tree—sections that provide little explanatory power are pruned (i.e., removed).

Answer 101

A common supervised ML algorithm that combines the predictions from multiple models rather than a single model. The different models cancel out noise and result in a lower average error rate. There are two types of ensemble methods: aggregation of heterogeneous learners and aggregations of homogeneous learners. Ensemble learning typically produces more stable and accurate results than single models. Aims to decrease variance (bagging), decrease bias (boosting), and improving predictions (stacking).

Answer 102

Different algorithms are combined together through a voting classifier and each algorithm gets a vote. The answer with the most votes is the model we go with.

Answer 103

The same algorithm is used but on different training data. The different training data used by the same model can be derived through bootstrap resampling (a.k.a bagging).

Answer 104

A common supervised ML algorithm. This is a variation of a classification tree where a large # of classification trees are trained using bagged data from the same data set. A random subset of features is used in creating each tree, thus every tree is different. This process mitigates overfitting and reduce noise from errors. A drawback of Random Forests is that the transparency of CART is lost. ## Footnote * Random forests can INCREASE the signal-to-noise ratio. * ***Investment applications of random forest include factor-based asset allocation, and prediction models for the success of an IPO.***

Answer 105

A common unsupervised ML algorithm. Problems w/ too much noise arise when there are excessive amts of features (high dimensionality). PCA seeks to reduce this excess noise by discarding the excess features. A PCA transforms the feature's covariance matrix in order to reduce highly correlated features into a smaller # of uncorrelated features, called eigenvectors, which are linear combinations of the original feature. Each eigenvector has an eigenvalue: the proportion of total variance in the data set explained by the eigenvector. The end product is an algorithm with lower dimensionality, which makes the model easier to train and interpret. ## Footnote * The process of reducing noise is called dimension reduction. Dimension reduction seeks to reduce this noise by discarding those attributes that contain little information.

Answer 106

A plot that shows the proportion of total variance explained by each of the principal components.

Answer 107

A common unsupervised ML algorithm. Clustering is the process of grouping observations into categories based on similar attributes (a.k.a cohesion). The two most common types of clustering are: K-means clustering and hierarchical clustering.

Answer 108

Grouping observations into categories based on the observations' similarities.

Answer 109

1/2 main types of clustering that puts observations into k nonoverlapping clusters where k is a hyperparameter. Each cluster has a *centroid* (center of the cluster), and each new observation is assigned to a cluster based on its proximity to the centroid. As a new observation gets assigned to a cluster, its centroid is recalculated, which may result in reassignment of some observations, thus resulting in a new centroid and so forth until all observations are assigned and no new reassignment is made. ## Footnote * One limitation of this type of algorithm is that the hyperparameter k is chosen before clustering starts, meaning that one has to have some idea about the nature of the data set. * ***K-means clustering is used in investment management to classify thousands of securities based on patterns in high dimensional data.***

Answer 110

1/2 main types of clustering that builds a hierarchy of clusters without any predefined # of clusters.

Answer 111

1/2 types of hierarchical clusters. This starts with one observation as its own cluster and then adds other similar observations to that group, thus forming another nonoverlapping cluster. In the end, all observations are merged into a single cluster.

Answer 112

Made up of layers of neurons. The first layer is the input layer (node layer), which receives the input (the independent variables). The final layer is the output layer. In between exists hidden layers. Neurons of each layer are connected to neurons of the next layer through channels. There may be multiple hidden layers. The multiple layers allow the NN to model complex nonlinear functions. NNs are an adaptive system that computers use to learn from their mistakes and improve continuously. A group of ML algorithms applied to problems with significant nonlinearity.

Answer 113

1/2 types of hierarchical clusters. The algorithm starts with one giant cluster, and then it partitions that cluster into smaller and smaller clusters. In the end, each cluster contains only one observation.

Answer 114

Neurons comprise the summation operator which gathers the info from the neurons and assigns them a weighted average, then passes the info on to the activation function. The activation function then generates a value from the inputs. The value is then passed forward to other neurons in subsequent hidden layers- this process is called **forward propagation.**

Answer 115

This is how the machine learns from its errors. When the weighted averages from the summation operators are adjusted as the algorithm learns from its errors.

Answer 116

1. Conceptualization of the problem 2. Data collection 3. Data preparation and wrangling: cleaning the data set and preparing it for the model. 4. Data exploration: Feature selection and performing data analysis. Evaluating the data set and determining the most appropriate way to configure it for model training. 5. Model training: Determining which ML algorithm to use, using a training data set, and tuning the model.

Answer 117

1. Text problem formulation 2. Text curation: ensuring the quality of data, for example by adjusting for bad or missing data. 3. Text preparation and wrangling 4. Text exploration 5. Model training

Answer 118

Immense amounts of data

Answer 119

* Volume= The amt of data * Variety= The sources of data * Velocity= The speed w/ which the data is created and collected * Veracity= The quality of the data ## Footnote * Big data often suffers from low veracity

Answer 120

Reducing errors in raw data. Common errors include: * Missing values * Invalid values * Inaccurate values * Non-uniform values * Duplicate observations ## Footnote * Removing HTML tags is part of the data cleansing step.

Answer 121

Prepping data for model use. This includes transforming and scaling. Data transformations include: * Extraction * Aggregation: consolidating two variables into one (using appropriate weighting) * Filtration: removing irrelevant observations. * Selection: removing features not needed for processing. * Conversion of data of diverse types

Answer 122

Contain info about how, what, and where the data is stored. Helps ensure validity.

Answer 123

Data obtained from 3rd party sources

Answer 124

Data that describes other data by providing info about one or more aspects of the data. Essentially a summary.

Answer 125

A way researchers exclude outliers. Instead of entirely excluding outliers, they substitute reasonable values in for them.

Answer 126

One way researchers exclude outliers. This type of means excludes a certain portion of the highest values and lowest values. For example, excludes lowest 1% and highest 1% of all values.

Answer 127

1/2 common types of scaling. Scales variable values between 0 and 1. Calculation: (Xi - Xminimum) ÷ (Xmaximum - Xminimum) ## Footnote * Sensitive to outliers. * Use this when trying to understand where the variables lie within the data set.

Answer 128

* Lowercasing: Ex: Dog ↠ dog * Removal of stop words: super common unimportant words Ex: the, is, and, etc. * Stemming: Take similar words and combine them into a single word. Ex: integrate ↠ integration ↠ integrating * Lemmatization: Return the base of the word. Ex: saw ↠ see * Bag-of-words (BOW): A bow is just the results of steps #1-#4. All the collected words or tokens are collected w/o regard to occurrence. If order doesn't matter we can stop here. * N-gram: If ordering is important, we can create a two-gram to look for two specific words that go together or three-gram that looks for three words that go together, and so on. * Organizing the BOW and N-Gram into a document term matrix (DTM): ## Footnote * Lemmatization, which takes places during the text wrangling/preprocessing process for unstructured data, is the process of converting inflected forms of a word into its morphological root (known as lemma). Lemmatization reduces the repetition of words occurring in various forms while maintaining the semantic structure of the text data, thereby aiding in training less complex ML models.

Answer 129

ML models that give you a result without explaining how they get to their decision.

Answer 130

1. Exploratory data analysis (EDA) 2. Feature selection 3. Feature engineering

Answer 131

Involves looking at data descriptors (stats, heat maps, word clouds, etc.) w/ the objective of understanding the data's properties, finding patterns/relationships, and planning modeling in future steps.

Answer 132

A process to select only the needed attributes of the data for ML model training

Answer 133

When a feature is created from the data set. Ex: Creating a value for age using date of birth data.

Answer 134

Involves optimizing and improving the selected features.

Answer 135

A type of feature engineering. The process is used to convert a categorical feature into a dummy variable.

Answer 136

* Term frequency= The # of times the token appears in the dataset * Document frequency= The # of documents that a token appears in ÷ the # of documents. * Chi-square Test= Ranks tokens by their usefulness to a certain class of info. Tokens with higher chi-square test-stat occur more frequently. * Mutual information Test= A numerical value indicating the contribution of a token to a specific class. Tokens with less frequencies in a class compared to another class it will have a value close to 1, whereas if a token appears a lot in all classes it will have a value of 0.

Answer 137

* Numbers= Tokens w/ standard lengths are converted into new tokens. Ex: 4 letter words converted into '#4'. * N-Grams * Name entity recognition (NER)= Assign tokens a NER tag based on their context. Ex: Europe-place ; Google-website. * Parts of Speech= Assign tokens a POS tag based on their language structure. For example: Google- PPN (proper noun) ; 2000 - CDN (cardinal #).

Answer 138

The researcher must define the objective(s) of data analysis, identify useful data points, and conceptualize the model. Once a ML algorithm/method is selected, he should specify the hyperparameters.

Answer 139

* Small training samples * Low # of features in the model. This can lead to an underfitting problem because the model doesn't have enough info to find patterns. ## Footnote * Feature selection is important to mitigate underfitting and overfitting. * Feature engineering can reduce underfitting.

Answer 140

1. Method selection= choosing the right ML algorithm considering supervised/unsupervised learning, type of data, and size of data. 2. Performance evaluation 3. Tuning

Answer 141

* Text= SVMs and Generalized linear models (GLMs) * Numerical= Regression trees, CART methods, and classification methods. * Image= Neural networks and deep learning networks.

Answer 142

* Error analysis: Errors in classification problems can be false positives (type 1 errors) or false negatives (type 2 errors). We build confusion matrixes for type 1 and type 2 errors. * Receiver operating characteristic (ROC) * Root mean squared error (RMSE)

Answer 143

A way to evaluate the fit of an ML algorithm. It's the ratio of true positives (not false positives (type 1 errors)) to predicted positives. Use the precision metric when the cost of a type 1 error is large. Calculation: True positives ÷ (True positives + false positives)

Answer 144

A way to evaluate the fit of an ML algorithm. It's the ratio of true positives (not false positives (type 1 errors)) too all actual positives. Use when the cost of a type 2 error is large. Calculation: True positives ÷ (True positives + false negatives)

Answer 145

A way to evaluate the fit of an ML algorithm. It's the harmonic mean of precision and recall. The higher the better. Calculation: (2 * precision * recall) ÷ (Precision + recall) ## Footnote * More appropriate than the model accuracy metric when there are class imbalances.

Answer 146

A curve that plots the tradeoff between false positives and true positives. The true positive rate (recall metric) is plotted on the y-axis, whereas the false positive rate is plotted on the x-axis. The area under the curve (AUC) is a value from 0 - 1. The closer the value is to 1 the higher the predictive accuracy of the model. AUCs = 0 mean it's never right and 0.5 mean 50% of the time- just guessing. The higher convexity of the curve the higher its AUC.

Answer 147

A graph that plots error (in-sample error (training sample error) and out-of-sample error (cross-validation sample error) on the y-axis and model complexity on the x-axis. The graph shows two curves: a curve for training error and a curve for cross-validation prediction error.

Answer 148

An evaluation and tuning of each components in the model. ## Footnote * Applied to complex models.

Answer 149

The primary limitation of trend models is that they are not useful if the residuals exhibit serial correlation.

Answer 150

False, it's non-parametric: it makes no assumptions regrading the distribution of the data.

Answer 151

LASSO models are used to build parsimonious models and regularization is used for nonlinear models.

Answer 152

Generates binary classifications, such as: classifying debt issuers into likely-to-default versus not-likely-to-default issuers, stocks-to-short versus not-to-short, and even classifying text (from news articles or company press releases) as positive or negative.

Answer 153

Predicting bankruptcy, assigning a bond to a ratings class, predicting stock prices, and creating customized indices.

Answer 154

Fraud detection in financial statements and selecting stocks/bonds.

Answer 155

Factor-based asset allocation and prediction models for the success of an IPO.

Answer 156

False, the hidden layer nodes (not the input layer nodes) each consist of a summation operator and an activation function; these nodes are where learning takes place.

Answer 157

False, it only allows us the reject the hypothesis that all regression coefficients are zero and accept the hypothesis that at least one isn't.

Answer 158

The RSS is just the absolute amount of explained variation, the R^2 is the (RSS/SST)- the absolute amount of variation as a proportion of total variation. It’s like saying NI is an absolute figure, whereas ROE is NI as a proportion of equity.

Answer 159

The absolute amount of unexplained variation

Answer 160

A statistical test to determine if there is a significant difference between the means of two groups and how they're related. The t-stat tells us if we need to reject the null or not.

Answer 161

False, regression coefficients will be unbiased but standard errors will be biased.

Answer 162

Neural networks with many hidden layers—at least 3, but often more than 20 hidden layers—are known as deep learning nets.

CFA L2 Quant Flashcards

(196 cards)