Quant Flashcards

1
Q

Analysis of Variance (ANOVA)

A

The analysis of the total variability of a dataset (such as observations on the dependent variable in a regression) into components representing different sources of variation; with reference to regression, ANOVA provides the inputs for an F-test of the significance of the regression as a whole.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Dependent Variable

A

The variable whose variation about its mean is to be explained by the regression; the left-hand-side variable in a regression equation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Error Term

A

The portion of the dependent variable that is not explained by the independent variable(s) in the regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Estimated Parameters

A

With reference to a regression analysis, the estimated values of the population intercept and population slope coefficient(s) in a regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Fitted Parameters

A

With reference to a regression analysis, the estimated values of the population intercept and population coefficient(s) in a regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Independent Variable

A

A variable used to explain the dependent variable in a regression; a right-hand-side variable in a regression equations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Linear Regression

A

Regression that models the straight-line relationship between the dependent and independent variable(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Parameter Instability

A

The problem or issue of population regression parameters that have changed over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Regression coefficient

A

The intercept and slope coefficient(s) of a regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Adjusted R2

A

A measure of goodness-of-fit of a regression that is adjusted for degrees of freedom and hence does not automatically increase when another independent variable is added to a regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Breusch-Pagan test

A

A test for conditional heteroskedasticity in the error term of a regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Categorical dependent variables

A

An alternative term for qualitative dependent variables

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Common size statements

A

Financial statements in which all elements (accounts) are stated as a percentage of a revenue for income statement or total assets for balance sheet

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Conditional heteroskedasticity

A

Heteroskedasticity in error variance that is correlated with the values of the independent variable(s) in the regression

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Data Mining

A

The practice of determining a model by extensive searching through a dataset for statistically significant patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Discriminate analysis

A

A multivariate classification technique used to discriminate between groups, such as companies that either will or will not become bankrupt during some time frame

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Dummy variable

A

A type of qualitative variable that takes on a value of 1 if a particular condition is true and 0 if that condition is false

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

First-Order Serial Correlation

A

Correlation between adjacent observations in a time series

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Generalized least squares

A

A regression estimation technique that addresses heteroskedasticity of the error term

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Assumptions of Linear Regression Model

A
  1. Relationship between dep variable and ind variable is linear
  2. Ind variable is not random
  3. Expected value of error term = 0
  4. Variance of the error term is same for all observations
  5. Error term is not correlated across observations
  6. Error term is normally distributed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Type I error

A

Rejecting the null hypothesis when it is true (i.e. null hypothesis should not be rejected)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Type II error

A

Failing to reject the null hypothesis when it is false (i.e. null should be rejected)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

P-value

A

Smallest level of significance at which the null hypothesis can be rejected

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Heteroskedastic

A

With reference to the error term of regression, having a variance that differs across observations - i.e. non-constant variance

Having consistent standard errors will correct for this

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Log regression model & Log-linear model
A regression that expresses the dependent and independent variables as natural logarithms & A time-series model in which the growth rate of the time series as a function of time is constant
26
Logistic regression (logit model)
A qualitative-dependent-variable multiple regression model based on the logistic probability distribution
27
Model specification
With reference to regression, the set of variables included in the regression and the regression equation’s functional form
28
Multicollinearity
Regression assumption violation that occurs when two or more ind variable are highly (not perfectly) correlated with each other
29
Negative serial correlation
Serial correlation in which a positive error for one observation increases the chance of a negative error for another observation
30
Non-stationarity
The property of having characteristics such as mean and variance that are not constant through time
31
Positive serial correlation
Serial correlation in which a positive error for one observation increases the chance of positive error for another observation, same for negative errors
32
Probit regression
A qualitative-dependent-variable multiple regression model based on normal distribution
33
Qualitative dependent variables
Dummy variables used as dependent variables rather than as independent variables
34
Random walk
Time series in which the value of the series in one period is the value of the series in the previous period plus an unpredictable random error In AR(1) regression model, random walks will have an estimated intercept coefficient (b0) near zero and slope coefficient (B1) near 1
35
Robust standard errors (a.k.a. White-corrected standard errors)
Standard errors of the estimated parameters of a regression that correct for the presence of heteroskedasticity in the regression’s error term
36
Serially Correlated
Errors that are correlated across observations in a regression model Correlation of a time series with its own past values
37
Unconditional heteroskedasticity
Error terms that are not correlated with the values of the independent variables in the regression model
38
Autoregressive model
Time series regressed on its own past values in which ind variable is lagged value of the dependent variable
39
Chain rule of forecasting
The two period ahead forecast is determined by first solving the first period forecast and substituting it into the two period ahead forecast model
40
Cointegrated
Two time series that have a long-term financial or economic relationship such that they do not diverge from each other without bound in the long run
41
Covariance stationary
A time series where the expected value and variance are constant and finite in all periods, its covariance with itself for a fixed number of periods in the past or future is constant and finite in all periods
42
First-differencing
A transformation that subtracts the value of the time series in period t-1 from its value in period t
43
In-sample forecast errors
Residuals from a fitted time-series model within the same period used to fit the model
44
Linear trend
A trend in which the dependent variable changes at a constant rate with time
45
Unit Root Testing for Nonstationarity
1. Run an AR model and examine autocorrelations | 2. Perform the Dickey Fuller test
46
Seasonality
A pattern in a time-series that tends to repeat from year to year. E.g. monthly sales data for a retailer (Christmas season each year will have similar results all else constant)
47
How to correct for seasonality
To adjust for seasonality in an AR model, an additional lag of the dependent variable (corresponding to the same period in the previous year) is added to the original model as another independent variable
48
Steps to determine stock price of target company using relative valuation ratio approach given comparable companies
1. Calculate the relative valuation ratio for the comparable companies to determine their mean 2. Apply the mean of each ratio to the valuation variables of the target company to get estimated stock price for each valuation variable 3. Take the mean of the estimated stock price to arrive at your answer
49
How to arrive at fair acquisition price of target company using comparable transaction approach
1. Calculate the relative valuation ratios based on acquisition price and their mean 2. Multiply target company valuation variables with the mean multiples calculated in step 1 3. Calculate the mean estimated stock price to arrive at your answer
50
Autoregressive Conditional Heteroskedasticity (ARCH) How to Correct?
Occurs when examining a single time series like AR model Exist if variance of the residuals in one period is dependent on variance of the residuals in a previous period Correct by using regression procedures that correct for heteroskedasticity (generalized least squares)
51
When can a linear regression be used?
Linear regression can be used if: 1. Both time series are covariance stationary 2. Neither time series is covariance stationary but the two series are cointegrated
52
Examples of Supervised Learning
1. Linear/Penalized regression 2. Logistic 3. CART 4. Logit 5. SVM 6. KNN 7. Ensemble & Random Forest
53
Types of Unsupervised Learning
1. Principal Components Analysis | 2. Clustering
54
Steps in data analysis project
1. Conceptualization of the modeling task 2. Data Collection 3. Data Preparation and wrangling (*critical) 4. Data Exploration 5. Model Training
55
Steps to analyze unstructured, text-based data
1. Text problem formulation 2. Data collection 3. Text preparation & wrangling (*critical) 4. Text exploration
56
Activation function
Part of the neural network’s node that transform the total net input into final out. Activation function operates like a light dimmer switch that dec/inc strength of the input.
57
Agglomeration cluster
Mnemonic: think “conglomerate” Clustering method that starts off as individual clusters until two closest clusters (by distance) are combined into 1 larger. This process is repeated until all observations are clumped into single large cluster
58
Backward propagation
Process of adjusting weights in a neural network by moving backward through the network’s layer to reduce total error
59
Base error
Model error due to randomness in the data
60
Bias Error
Occurs with under fitting due to 1 or very few features producing poor approximation and high in-sample error.
61
Bootstrap aggregating (“Bagging”)
Process where original training data set is used to generate n new training data sets. Data can overlap between data sets. Helps improve the stability of the predictions and reduce chances of overfitting a regression
62
Centroid
The center of a cluster formed using the K-means clustering algorithm
63
Classification & Regression Tree (CART)
Supervised machine learning technique commonly applied to binary classifications or regression Categorical target variable = classification tree Continuous target variable = regression tree
64
Composite variable
A variable that combines two or more variables that are statistically strongly related to each other.
65
Cross-validation
Technique for estimating out-of-sample error directly by determining the error in validation samples
66
Deep Learning
Algorithms based on complex neural networks that address highly complex tasks like image classification, face recognition, speech recognition, and natural language processing
67
Dendogram
A type of tree diagram that highlights the hierarchical relationships among the clusters
68
Dimension reduction
Set of techniques for reducing in the number of features in a dataset while retaining variation across observations to preserve the information contained in that variation
69
Divisive clustering
Opposite technique from Agglomerate Clustering method that starts with all observations belonging to a single large cluster and then divided into two clusters based on measure of distance. The process repeats until each cluster only contains one observation
70
Eigenvalue
A measure that gives the proportion of total variance in the initial dataset that is explained by each eigenvector
71
Eigenvector
A vector that defines new mutually uncorrelated composite variables that are linear combinations of the original features
72
Ensemble learning
A supervised learning technique of combining predictions from a collection of models to achieve a more accurate prediction Two types of ensemble methods: 1. Aggregation of heterogenous learners (different algorithms combined together via a voting classifier) 2. Aggregation of homogenous learners (same algorithm used on different training data)
73
Fitting curve
A curve which shows in- and out-of-sample error rates on the y-axis plotted against model complexity
74
Forward propagation
Opposite of backward propagation; adjusting weights in a neural network by moving forward through network layers to reduce total error of network
75
Generalize
When a model retains its explanatory power when predicting out-of-sample
76
Hierarchical clustering
An interactive unsupervised learning procedure used for building a hierarchy of clusters
77
Holdout samples
Data samples that are not used to train a model
78
Regularization
Describes method that reduce statistical variability in high dimensional data estimation problem Can be applied to linear and non-linear
79
K-fold cross-validation
A technique in which data are shuffled randomly and then are divided into k equal sub-samples, with k-1 samples used as training samples and one sample, used as a validation sample
80
K-means
A clustering algorithm that repeatedly partitions observations into a fixed number, k, of non-overlapping clusters
81
Labeled data set
Dataset that contains matched sets of observed inputs or features (X variable) and the associated output (Y variable).
82
Least Absolute Shrinkage & Selection Operator (LASSO)
Type of penalized regression which involves minimizing the sum of the absolute values of the regression coefficients plus a penalty term that increases in size with the number of included features Minimize the standard error of estimate and value of penalty associated with the number of features (independent values); think of adjusted R2
83
Principal Component Analysis (PCA)
An example of unsupervised learning, summarized information in large number of correlated factors into small uncorrelated factors (eigenvectors) Principal components or “eigenvectors” are linear combinations of the original data set and cannot be easily labeled or interpreted
84
Clustering
An unsupervised learning technique that groups observations into categories based on similarities in their attributes Requires human judgement in defining what is similar. Used in investment management for diversification by investing in assets from multiple clusters; analyze portfolio risk evidenced by a large portfolio allocation to a particular cluster Examples of clustering include: K-means clustering Hierarchical clustering
85
Support Vector Machine (SVM)
A type of supervised learning technique and a linear classification algorithm that separates dates into one of two possible classifiers (buy vs. sell, default vs. no default, pass vs. fail) Maximizes probability of making a correct prediction by determining boundary farthest away from all observations Used in investment management to classify debt issues, shorting stocks, classifying texts like news articles or company press release as positive or negative
86
Data Cleansing
Deals with reducing errors in the raw data. Errors in raw data for structured data include: i. Missing values ii. Invalid values iii. Inaccurate values iv. Inconsistent format v. Duplicates Accomplished via automated, rules-based algorithms and human intervention
87
Data Wrangling & Transformation
Data wrangling involves preprocessing data for model use. Preprocessing includes data transformation. Data transformation types include: i. Extraction of data based on parameter ii. Aggregation of related data using appropriate weights iii. Filtration by removing irrelevant observations and features iv. Conversion of data of different types (e.g., nominal or ordinal)
88
Steps for Text Preparation or Cleaning
1. Remove HTML tags (if text collected from web pages) 2. Remove punctuations 3. Remove numbers (digits replaced with annotations). If numbers are important for analysis, values are extracts via text applications first. 4. Remove white spaces
89
Steps for Text Wrangling (i.e. normalization)
1. Lower casing 2. Remove of stop words (e.g. the, is) 3. Stemming (convert all variations of a word into a common value) 4. Lemmatization (similar to stemming)
90
Data Exploration
Evaluate the data set and determine the most appropriate way to configure it for model training Steps include: 1. Understanding data properties, finding patterns or relationships, and planning modeling 2. Select the needed attributes of the data for model training (higher the features, higher the complexity and longer model training time) 3. Create new features by transforming or combining multiple features
91
Limitations of Regression Analysis
1. Parameter instability 2. Public knowledge of regression relationships may negate their future usefulness 3. If regression assumptions are violated, hypothesis tests and predictions based on linear regression will not be valid. Uncertainty as to whether an assumption has been violated happens often.
92
Tasks for model training
1. Method Selection - Choosing the appropriate algorithm given objective and data characteristics (e.g. supervised/unsupervised, type of data, size of data) 2. Performance Evaluation - Quantify and critique model performance 3. Tuning - process of implementing changes to improve model performance
93
Variance error
Resulting from overfitting model with noise-inducing features/too many features causing out-of-sample error
94
Precision
The ratio of true positives to all predicted positives High precision is values when the cost of a type I error is large. P = TP / TP + FP
95
Recall (a.k.a. True Positive Rate)
Ratio of true positives to all actual positives High recall is values when the cost of a type II error is large R = TP / (TP + FN)
96
F1 score
Harmonic mean of precision and recall. Precision and recall together determine model accuracy. F1: (2 x P x R) / (P + R) Accuracy: (TP + TN) / (TP + TN + FP + FN)
97
Receiver Operating Characteristic (ROC)
A curve showing the trade off between False Positives and True Positives. The true positive rate (or recall) is on the Y-axis and false positive rate (FPR) is on the X-axis. Area under the curve (AUC) is a value from 0 to 1. The closer the AUC to 1, the higher the predictive accuracy of the model.
98
Steps in Simulation
1. Determine probabilistic variables - uncertain input variables that influence the value of an investment 2. Define probability distributions for the variables and specify parameters for distribution 3. Check for correlation among variables using historical data 4. Run simulation
99
3 approaches for specifying distribution
1. Historical data - assumes future values of the variable will be similar to its past 2. Cross-sectional data - estimate distribution of the variable based on the values of the variable for peers 3. Subjective specification of a distribution along with related parameters
100
Advantages of Simulations
1. Better input estimation - forces users to think about variability in estimates 2. A distribution rather than a point estimates (i.e. a point in time) - simulations highlight the inherit uncertainty in valuing risky assets and explain divergence in estimates
101
Examples of constraints in Simulations
1. Minimum book value of equity - e.g. maintain minimal capital for BASEL 2. Earnings and cash flow - externally and internally imported restrictions on profitability 3. Market Value - comparing value of business to value debt in all scenarios
102
Problems in Simulation
1. GIGO - “garbage in garbage out” 2. Real data may not fit specified distribution 3. Non-stationarity - changes in market events may render model useless 4. Dynamic correlations - correlations between input variables may not be stable and if model is not factored for changes, output may be flawed
103
Benefits of Montecarlo Simulation
Considers all possible outcomes Better suited for continuous risks, which can be sequential or concurrent Allows for explicitly modeling corrections of input variables
104
Decision Trees
Appropriate tool for measuring risk in an investment when risk is discrete and sequential It cannot accommodate correlated variables It can be used as complements to risk-adjusted valuation or as substitutes to such valuation
105
Mean Reverting Level
The value time series will show a tendency to move towards. If the value of the time series is greater (less) than the mean reverting level, the value is expected to decrease (increase) overtime to its mean reverting level Calculated as B0 / (1 - B1)
106
Non-Uniformity Error
Refers to the error that occurs when the data is not presented in an identical format
107
Normalization
Process of rescaling numeric variables in the range of [0,1] Xi(normalized) = Xi - Xmin / (Xmax-Xmin)
108
Lambda
Is a hyper-parameter who value must be set before supervised learning begins of the regression model It will determine the balance between fitting the model versus keeping the model parsimonious When = 0, it is equivalent to an OLS regression
109
Sample Covariance
Sum {(X-Xbar)(Y-Ybar)} / n-1