Quant Flashcards

Question

Log regression model & Log-linear model

Answer 1

A regression that expresses the dependent and independent variables as natural logarithms & A time-series model in which the growth rate of the time series as a function of time is constant

Answer 2

A qualitative-dependent-variable multiple regression model based on the logistic probability distribution

Answer 3

With reference to regression, the set of variables included in the regression and the regression equation’s functional form

Answer 4

Regression assumption violation that occurs when two or more ind variable are highly (not perfectly) correlated with each other

Answer 5

Serial correlation in which a positive error for one observation increases the chance of a negative error for another observation

Answer 6

The property of having characteristics such as mean and variance that are not constant through time

Answer 7

Serial correlation in which a positive error for one observation increases the chance of positive error for another observation, same for negative errors

Answer 8

A qualitative-dependent-variable multiple regression model based on normal distribution

Answer 9

Dummy variables used as dependent variables rather than as independent variables

Answer 10

Time series in which the value of the series in one period is the value of the series in the previous period plus an unpredictable random error In AR(1) regression model, random walks will have an estimated intercept coefficient (b0) near zero and slope coefficient (B1) near 1

Answer 11

Standard errors of the estimated parameters of a regression that correct for the presence of heteroskedasticity in the regression’s error term

Answer 12

Errors that are correlated across observations in a regression model Correlation of a time series with its own past values

Answer 13

Error terms that are not correlated with the values of the independent variables in the regression model

Answer 14

Time series regressed on its own past values in which ind variable is lagged value of the dependent variable

Answer 15

The two period ahead forecast is determined by first solving the first period forecast and substituting it into the two period ahead forecast model

Answer 16

Two time series that have a long-term financial or economic relationship such that they do not diverge from each other without bound in the long run

Answer 17

A time series where the expected value and variance are constant and finite in all periods, its covariance with itself for a fixed number of periods in the past or future is constant and finite in all periods

Answer 18

A transformation that subtracts the value of the time series in period t-1 from its value in period t

Answer 19

Residuals from a fitted time-series model within the same period used to fit the model

Answer 20

A trend in which the dependent variable changes at a constant rate with time

Answer 21

1. Run an AR model and examine autocorrelations | 2. Perform the Dickey Fuller test

Answer 22

A pattern in a time-series that tends to repeat from year to year. E.g. monthly sales data for a retailer (Christmas season each year will have similar results all else constant)

Answer 23

To adjust for seasonality in an AR model, an additional lag of the dependent variable (corresponding to the same period in the previous year) is added to the original model as another independent variable

Answer 24

1. Calculate the relative valuation ratio for the comparable companies to determine their mean 2. Apply the mean of each ratio to the valuation variables of the target company to get estimated stock price for each valuation variable 3. Take the mean of the estimated stock price to arrive at your answer

Answer 25

1. Calculate the relative valuation ratios based on acquisition price and their mean 2. Multiply target company valuation variables with the mean multiples calculated in step 1 3. Calculate the mean estimated stock price to arrive at your answer

Answer 26

Occurs when examining a single time series like AR model Exist if variance of the residuals in one period is dependent on variance of the residuals in a previous period Correct by using regression procedures that correct for heteroskedasticity (generalized least squares)

Answer 27

Linear regression can be used if: 1. Both time series are covariance stationary 2. Neither time series is covariance stationary but the two series are cointegrated

Answer 28

1. Linear/Penalized regression 2. Logistic 3. CART 4. Logit 5. SVM 6. KNN 7. Ensemble & Random Forest

Answer 29

1. Principal Components Analysis | 2. Clustering

Answer 30

1. Conceptualization of the modeling task 2. Data Collection 3. Data Preparation and wrangling (*critical) 4. Data Exploration 5. Model Training

Answer 31

1. Text problem formulation 2. Data collection 3. Text preparation & wrangling (*critical) 4. Text exploration

Answer 32

Part of the neural network’s node that transform the total net input into final out. Activation function operates like a light dimmer switch that dec/inc strength of the input.

Answer 33

Mnemonic: think “conglomerate” Clustering method that starts off as individual clusters until two closest clusters (by distance) are combined into 1 larger. This process is repeated until all observations are clumped into single large cluster

Answer 34

Process of adjusting weights in a neural network by moving backward through the network’s layer to reduce total error

Answer 35

Model error due to randomness in the data

Answer 36

Occurs with under fitting due to 1 or very few features producing poor approximation and high in-sample error.

Answer 37

Process where original training data set is used to generate n new training data sets. Data can overlap between data sets. Helps improve the stability of the predictions and reduce chances of overfitting a regression

Answer 38

The center of a cluster formed using the K-means clustering algorithm

Answer 39

Supervised machine learning technique commonly applied to binary classifications or regression Categorical target variable = classification tree Continuous target variable = regression tree

Answer 40

A variable that combines two or more variables that are statistically strongly related to each other.

Answer 41

Technique for estimating out-of-sample error directly by determining the error in validation samples

Answer 42

Algorithms based on complex neural networks that address highly complex tasks like image classification, face recognition, speech recognition, and natural language processing

Answer 43

A type of tree diagram that highlights the hierarchical relationships among the clusters

Answer 44

Set of techniques for reducing in the number of features in a dataset while retaining variation across observations to preserve the information contained in that variation

Answer 45

Opposite technique from Agglomerate Clustering method that starts with all observations belonging to a single large cluster and then divided into two clusters based on measure of distance. The process repeats until each cluster only contains one observation

Answer 46

A measure that gives the proportion of total variance in the initial dataset that is explained by each eigenvector

Answer 47

A vector that defines new mutually uncorrelated composite variables that are linear combinations of the original features

Answer 48

A supervised learning technique of combining predictions from a collection of models to achieve a more accurate prediction Two types of ensemble methods: 1. Aggregation of heterogenous learners (different algorithms combined together via a voting classifier) 2. Aggregation of homogenous learners (same algorithm used on different training data)

Answer 49

A curve which shows in- and out-of-sample error rates on the y-axis plotted against model complexity

Answer 50

Opposite of backward propagation; adjusting weights in a neural network by moving forward through network layers to reduce total error of network

Answer 51

When a model retains its explanatory power when predicting out-of-sample

Answer 52

An interactive unsupervised learning procedure used for building a hierarchy of clusters

Answer 53

Data samples that are not used to train a model

Answer 54

Describes method that reduce statistical variability in high dimensional data estimation problem Can be applied to linear and non-linear

Answer 55

A technique in which data are shuffled randomly and then are divided into k equal sub-samples, with k-1 samples used as training samples and one sample, used as a validation sample

Answer 56

A clustering algorithm that repeatedly partitions observations into a fixed number, k, of non-overlapping clusters

Answer 57

Dataset that contains matched sets of observed inputs or features (X variable) and the associated output (Y variable).

Answer 58

Type of penalized regression which involves minimizing the sum of the absolute values of the regression coefficients plus a penalty term that increases in size with the number of included features Minimize the standard error of estimate and value of penalty associated with the number of features (independent values); think of adjusted R2

Answer 59

An example of unsupervised learning, summarized information in large number of correlated factors into small uncorrelated factors (eigenvectors) Principal components or “eigenvectors” are linear combinations of the original data set and cannot be easily labeled or interpreted

Answer 60

An unsupervised learning technique that groups observations into categories based on similarities in their attributes Requires human judgement in defining what is similar. Used in investment management for diversification by investing in assets from multiple clusters; analyze portfolio risk evidenced by a large portfolio allocation to a particular cluster Examples of clustering include: K-means clustering Hierarchical clustering

Answer 61

A type of supervised learning technique and a linear classification algorithm that separates dates into one of two possible classifiers (buy vs. sell, default vs. no default, pass vs. fail) Maximizes probability of making a correct prediction by determining boundary farthest away from all observations Used in investment management to classify debt issues, shorting stocks, classifying texts like news articles or company press release as positive or negative

Answer 62

Deals with reducing errors in the raw data. Errors in raw data for structured data include: i. Missing values ii. Invalid values iii. Inaccurate values iv. Inconsistent format v. Duplicates Accomplished via automated, rules-based algorithms and human intervention

Answer 63

Data wrangling involves preprocessing data for model use. Preprocessing includes data transformation. Data transformation types include: i. Extraction of data based on parameter ii. Aggregation of related data using appropriate weights iii. Filtration by removing irrelevant observations and features iv. Conversion of data of different types (e.g., nominal or ordinal)

Answer 64

1. Remove HTML tags (if text collected from web pages) 2. Remove punctuations 3. Remove numbers (digits replaced with annotations). If numbers are important for analysis, values are extracts via text applications first. 4. Remove white spaces

Answer 65

1. Lower casing 2. Remove of stop words (e.g. the, is) 3. Stemming (convert all variations of a word into a common value) 4. Lemmatization (similar to stemming)

Answer 66

Evaluate the data set and determine the most appropriate way to configure it for model training Steps include: 1. Understanding data properties, finding patterns or relationships, and planning modeling 2. Select the needed attributes of the data for model training (higher the features, higher the complexity and longer model training time) 3. Create new features by transforming or combining multiple features

Answer 67

1. Parameter instability 2. Public knowledge of regression relationships may negate their future usefulness 3. If regression assumptions are violated, hypothesis tests and predictions based on linear regression will not be valid. Uncertainty as to whether an assumption has been violated happens often.

Answer 68

1. Method Selection - Choosing the appropriate algorithm given objective and data characteristics (e.g. supervised/unsupervised, type of data, size of data) 2. Performance Evaluation - Quantify and critique model performance 3. Tuning - process of implementing changes to improve model performance

Answer 69

Resulting from overfitting model with noise-inducing features/too many features causing out-of-sample error

Answer 70

The ratio of true positives to all predicted positives High precision is values when the cost of a type I error is large. P = TP / TP + FP

Answer 71

Ratio of true positives to all actual positives High recall is values when the cost of a type II error is large R = TP / (TP + FN)

Answer 72

Harmonic mean of precision and recall. Precision and recall together determine model accuracy. F1: (2 x P x R) / (P + R) Accuracy: (TP + TN) / (TP + TN + FP + FN)

Answer 73

A curve showing the trade off between False Positives and True Positives. The true positive rate (or recall) is on the Y-axis and false positive rate (FPR) is on the X-axis. Area under the curve (AUC) is a value from 0 to 1. The closer the AUC to 1, the higher the predictive accuracy of the model.

Answer 74

1. Determine probabilistic variables - uncertain input variables that influence the value of an investment 2. Define probability distributions for the variables and specify parameters for distribution 3. Check for correlation among variables using historical data 4. Run simulation

Answer 75

1. Historical data - assumes future values of the variable will be similar to its past 2. Cross-sectional data - estimate distribution of the variable based on the values of the variable for peers 3. Subjective specification of a distribution along with related parameters

Answer 76

1. Better input estimation - forces users to think about variability in estimates 2. A distribution rather than a point estimates (i.e. a point in time) - simulations highlight the inherit uncertainty in valuing risky assets and explain divergence in estimates

Answer 77

1. Minimum book value of equity - e.g. maintain minimal capital for BASEL 2. Earnings and cash flow - externally and internally imported restrictions on profitability 3. Market Value - comparing value of business to value debt in all scenarios

Answer 78

1. GIGO - “garbage in garbage out” 2. Real data may not fit specified distribution 3. Non-stationarity - changes in market events may render model useless 4. Dynamic correlations - correlations between input variables may not be stable and if model is not factored for changes, output may be flawed

Answer 79

Considers all possible outcomes Better suited for continuous risks, which can be sequential or concurrent Allows for explicitly modeling corrections of input variables

Answer 80

Appropriate tool for measuring risk in an investment when risk is discrete and sequential It cannot accommodate correlated variables It can be used as complements to risk-adjusted valuation or as substitutes to such valuation

Answer 81

The value time series will show a tendency to move towards. If the value of the time series is greater (less) than the mean reverting level, the value is expected to decrease (increase) overtime to its mean reverting level Calculated as B0 / (1 - B1)

Answer 82

Refers to the error that occurs when the data is not presented in an identical format

Answer 83

Process of rescaling numeric variables in the range of [0,1] Xi(normalized) = Xi - Xmin / (Xmax-Xmin)

Answer 84

Is a hyper-parameter who value must be set before supervised learning begins of the regression model It will determine the balance between fitting the model versus keeping the model parsimonious When = 0, it is equivalent to an OLS regression

Answer 85

Sum {(X-Xbar)(Y-Ybar)} / n-1

Quant Flashcards

(109 cards)