Quant Flashcards
Analysis of Variance (ANOVA)
The analysis of the total variability of a dataset (such as observations on the dependent variable in a regression) into components representing different sources of variation; with reference to regression, ANOVA provides the inputs for an F-test of the significance of the regression as a whole.
Dependent Variable
The variable whose variation about its mean is to be explained by the regression; the left-hand-side variable in a regression equation.
Error Term
The portion of the dependent variable that is not explained by the independent variable(s) in the regression
Estimated Parameters
With reference to a regression analysis, the estimated values of the population intercept and population slope coefficient(s) in a regression
Fitted Parameters
With reference to a regression analysis, the estimated values of the population intercept and population coefficient(s) in a regression
Independent Variable
A variable used to explain the dependent variable in a regression; a right-hand-side variable in a regression equations
Linear Regression
Regression that models the straight-line relationship between the dependent and independent variable(s)
Parameter Instability
The problem or issue of population regression parameters that have changed over time
Regression coefficient
The intercept and slope coefficient(s) of a regression
Adjusted R2
A measure of goodness-of-fit of a regression that is adjusted for degrees of freedom and hence does not automatically increase when another independent variable is added to a regression
Breusch-Pagan test
A test for conditional heteroskedasticity in the error term of a regression
Categorical dependent variables
An alternative term for qualitative dependent variables
Common size statements
Financial statements in which all elements (accounts) are stated as a percentage of a revenue for income statement or total assets for balance sheet
Conditional heteroskedasticity
Heteroskedasticity in error variance that is correlated with the values of the independent variable(s) in the regression
Data Mining
The practice of determining a model by extensive searching through a dataset for statistically significant patterns
Discriminate analysis
A multivariate classification technique used to discriminate between groups, such as companies that either will or will not become bankrupt during some time frame
Dummy variable
A type of qualitative variable that takes on a value of 1 if a particular condition is true and 0 if that condition is false
First-Order Serial Correlation
Correlation between adjacent observations in a time series
Generalized least squares
A regression estimation technique that addresses heteroskedasticity of the error term
Assumptions of Linear Regression Model
- Relationship between dep variable and ind variable is linear
- Ind variable is not random
- Expected value of error term = 0
- Variance of the error term is same for all observations
- Error term is not correlated across observations
- Error term is normally distributed
Type I error
Rejecting the null hypothesis when it is true (i.e. null hypothesis should not be rejected)
Type II error
Failing to reject the null hypothesis when it is false (i.e. null should be rejected)
P-value
Smallest level of significance at which the null hypothesis can be rejected
Heteroskedastic
With reference to the error term of regression, having a variance that differs across observations - i.e. non-constant variance
Having consistent standard errors will correct for this
Log regression model
&
Log-linear model
A regression that expresses the dependent and independent variables as natural logarithms
&
A time-series model in which the growth rate of the time series as a function of time is constant
Logistic regression (logit model)
A qualitative-dependent-variable multiple regression model based on the logistic probability distribution
Model specification
With reference to regression, the set of variables included in the regression and the regression equation’s functional form
Multicollinearity
Regression assumption violation that occurs when two or more ind variable are highly (not perfectly) correlated with each other
Negative serial correlation
Serial correlation in which a positive error for one observation increases the chance of a negative error for another observation
Non-stationarity
The property of having characteristics such as mean and variance that are not constant through time
Positive serial correlation
Serial correlation in which a positive error for one observation increases the chance of positive error for another observation, same for negative errors
Probit regression
A qualitative-dependent-variable multiple regression model based on normal distribution
Qualitative dependent variables
Dummy variables used as dependent variables rather than as independent variables
Random walk
Time series in which the value of the series in one period is the value of the series in the previous period plus an unpredictable random error
In AR(1) regression model, random walks will have an estimated intercept coefficient (b0) near zero and slope coefficient (B1) near 1
Robust standard errors (a.k.a. White-corrected standard errors)
Standard errors of the estimated parameters of a regression that correct for the presence of heteroskedasticity in the regression’s error term
Serially Correlated
Errors that are correlated across observations in a regression model
Correlation of a time series with its own past values
Unconditional heteroskedasticity
Error terms that are not correlated with the values of the independent variables in the regression model
Autoregressive model
Time series regressed on its own past values in which ind variable is lagged value of the dependent variable
Chain rule of forecasting
The two period ahead forecast is determined by first solving the first period forecast and substituting it into the two period ahead forecast model
Cointegrated
Two time series that have a long-term financial or economic relationship such that they do not diverge from each other without bound in the long run
Covariance stationary
A time series where the expected value and variance are constant and finite in all periods, its covariance with itself for a fixed number of periods in the past or future is constant and finite in all periods
First-differencing
A transformation that subtracts the value of the time series in period t-1 from its value in period t
In-sample forecast errors
Residuals from a fitted time-series model within the same period used to fit the model
Linear trend
A trend in which the dependent variable changes at a constant rate with time
Unit Root Testing for Nonstationarity
- Run an AR model and examine autocorrelations
2. Perform the Dickey Fuller test
Seasonality
A pattern in a time-series that tends to repeat from year to year.
E.g. monthly sales data for a retailer (Christmas season each year will have similar results all else constant)
How to correct for seasonality
To adjust for seasonality in an AR model, an additional lag of the dependent variable (corresponding to the same period in the previous year) is added to the original model as another independent variable
Steps to determine stock price of target company using relative valuation ratio approach given comparable companies
- Calculate the relative valuation ratio for the comparable companies to determine their mean
- Apply the mean of each ratio to the valuation variables of the target company to get estimated stock price for each valuation variable
- Take the mean of the estimated stock price to arrive at your answer
How to arrive at fair acquisition price of target company using comparable transaction approach
- Calculate the relative valuation ratios based on acquisition price and their mean
- Multiply target company valuation variables with the mean multiples calculated in step 1
- Calculate the mean estimated stock price to arrive at your answer
Autoregressive Conditional Heteroskedasticity (ARCH)
How to Correct?
Occurs when examining a single time series like AR model
Exist if variance of the residuals in one period is dependent on variance of the residuals in a previous period
Correct by using regression procedures that correct for heteroskedasticity (generalized least squares)
When can a linear regression be used?
Linear regression can be used if:
- Both time series are covariance stationary
- Neither time series is covariance stationary but the two series are cointegrated
Examples of Supervised Learning
- Linear/Penalized regression
- Logistic
- CART
- Logit
- SVM
- KNN
- Ensemble & Random Forest
Types of Unsupervised Learning
- Principal Components Analysis
2. Clustering
Steps in data analysis project
- Conceptualization of the modeling task
- Data Collection
- Data Preparation and wrangling (*critical)
- Data Exploration
- Model Training
Steps to analyze unstructured, text-based data
- Text problem formulation
- Data collection
- Text preparation & wrangling (*critical)
- Text exploration
Activation function
Part of the neural network’s node that transform the total net input into final out.
Activation function operates like a light dimmer switch that dec/inc strength of the input.
Agglomeration cluster
Mnemonic: think “conglomerate”
Clustering method that starts off as individual clusters until two closest clusters (by distance) are combined into 1 larger. This process is repeated until all observations are clumped into single large cluster
Backward propagation
Process of adjusting weights in a neural network by moving backward through the network’s layer to reduce total error
Base error
Model error due to randomness in the data
Bias Error
Occurs with under fitting due to 1 or very few features producing poor approximation and high in-sample error.
Bootstrap aggregating (“Bagging”)
Process where original training data set is used to generate n new training data sets.
Data can overlap between data sets. Helps improve the stability of the predictions and reduce chances of overfitting a regression
Centroid
The center of a cluster formed using the K-means clustering algorithm
Classification & Regression Tree (CART)
Supervised machine learning technique commonly applied to binary classifications or regression
Categorical target variable = classification tree
Continuous target variable = regression tree
Composite variable
A variable that combines two or more variables that are statistically strongly related to each other.
Cross-validation
Technique for estimating out-of-sample error directly by determining the error in validation samples
Deep Learning
Algorithms based on complex neural networks that address highly complex tasks like image classification, face recognition, speech recognition, and natural language processing
Dendogram
A type of tree diagram that highlights the hierarchical relationships among the clusters
Dimension reduction
Set of techniques for reducing in the number of features in a dataset while retaining variation across observations to preserve the information contained in that variation
Divisive clustering
Opposite technique from Agglomerate
Clustering method that starts with all observations belonging to a single large cluster and then divided into two clusters based on measure of distance. The process repeats until each cluster only contains one observation
Eigenvalue
A measure that gives the proportion of total variance in the initial dataset that is explained by each eigenvector
Eigenvector
A vector that defines new mutually uncorrelated composite variables that are linear combinations of the original features
Ensemble learning
A supervised learning technique of combining predictions from a collection of models to achieve a more accurate prediction
Two types of ensemble methods:
- Aggregation of heterogenous learners (different algorithms combined together via a voting classifier)
- Aggregation of homogenous learners (same algorithm used on different training data)
Fitting curve
A curve which shows in- and out-of-sample error rates on the y-axis plotted against model complexity
Forward propagation
Opposite of backward propagation; adjusting weights in a neural network by moving forward through network layers to reduce total error of network
Generalize
When a model retains its explanatory power when predicting out-of-sample
Hierarchical clustering
An interactive unsupervised learning procedure used for building a hierarchy of clusters
Holdout samples
Data samples that are not used to train a model
Regularization
Describes method that reduce statistical variability in high dimensional data estimation problem
Can be applied to linear and non-linear
K-fold cross-validation
A technique in which data are shuffled randomly and then are divided into k equal sub-samples, with k-1 samples used as training samples and one sample, used as a validation sample
K-means
A clustering algorithm that repeatedly partitions observations into a fixed number, k, of non-overlapping clusters
Labeled data set
Dataset that contains matched sets of observed inputs or features (X variable) and the associated output (Y variable).
Least Absolute Shrinkage & Selection Operator (LASSO)
Type of penalized regression which involves minimizing the sum of the absolute values of the regression coefficients plus a penalty term that increases in size with the number of included features
Minimize the standard error of estimate and value of penalty associated with the number of features (independent values); think of adjusted R2
Principal Component Analysis (PCA)
An example of unsupervised learning, summarized information in large number of correlated factors into small uncorrelated factors (eigenvectors)
Principal components or “eigenvectors” are linear combinations of the original data set and cannot be easily labeled or interpreted
Clustering
An unsupervised learning technique that groups observations into categories based on similarities in their attributes
Requires human judgement in defining what is similar.
Used in investment management for diversification by investing in assets from multiple clusters; analyze portfolio risk evidenced by a large portfolio allocation to a particular cluster
Examples of clustering include:
K-means clustering
Hierarchical clustering
Support Vector Machine (SVM)
A type of supervised learning technique and a linear classification algorithm that separates dates into one of two possible classifiers (buy vs. sell, default vs. no default, pass vs. fail)
Maximizes probability of making a correct prediction by determining boundary farthest away from all observations
Used in investment management to classify debt issues, shorting stocks, classifying texts like news articles or company press release as positive or negative
Data Cleansing
Deals with reducing errors in the raw data. Errors in raw data for structured data include:
i. Missing values
ii. Invalid values
iii. Inaccurate values
iv. Inconsistent format
v. Duplicates
Accomplished via automated, rules-based algorithms and human intervention
Data Wrangling & Transformation
Data wrangling involves preprocessing data for model use. Preprocessing includes data transformation.
Data transformation types include:
i. Extraction of data based on parameter
ii. Aggregation of related data using appropriate weights
iii. Filtration by removing irrelevant observations and features
iv. Conversion of data of different types (e.g., nominal or ordinal)
Steps for Text Preparation or Cleaning
- Remove HTML tags (if text collected from web pages)
- Remove punctuations
- Remove numbers (digits replaced with annotations). If numbers are important for analysis, values are extracts via text applications first.
- Remove white spaces
Steps for Text Wrangling (i.e. normalization)
- Lower casing
- Remove of stop words (e.g. the, is)
- Stemming (convert all variations of a word into a common value)
- Lemmatization (similar to stemming)
Data Exploration
Evaluate the data set and determine the most appropriate way to configure it for model training
Steps include:
- Understanding data properties, finding patterns or relationships, and planning modeling
- Select the needed attributes of the data for model training (higher the features, higher the complexity and longer model training time)
- Create new features by transforming or combining multiple features
Limitations of Regression Analysis
- Parameter instability
- Public knowledge of regression relationships may negate their future usefulness
- If regression assumptions are violated, hypothesis tests and predictions based on linear regression will not be valid. Uncertainty as to whether an assumption has been violated happens often.
Tasks for model training
- Method Selection - Choosing the appropriate algorithm given objective and data characteristics (e.g. supervised/unsupervised, type of data, size of data)
- Performance Evaluation - Quantify and critique model performance
- Tuning - process of implementing changes to improve model performance
Variance error
Resulting from overfitting model with noise-inducing features/too many features causing out-of-sample error
Precision
The ratio of true positives to all predicted positives
High precision is values when the cost of a type I error is large.
P = TP / TP + FP
Recall (a.k.a. True Positive Rate)
Ratio of true positives to all actual positives
High recall is values when the cost of a type II error is large
R = TP / (TP + FN)
F1 score
Harmonic mean of precision and recall. Precision and recall together determine model accuracy.
F1: (2 x P x R) / (P + R)
Accuracy: (TP + TN) / (TP + TN + FP + FN)
Receiver Operating Characteristic (ROC)
A curve showing the trade off between False Positives and True Positives. The true positive rate (or recall) is on the Y-axis and false positive rate (FPR) is on the X-axis.
Area under the curve (AUC) is a value from 0 to 1. The closer the AUC to 1, the higher the predictive accuracy of the model.
Steps in Simulation
- Determine probabilistic variables - uncertain input variables that influence the value of an investment
- Define probability distributions for the variables and specify parameters for distribution
- Check for correlation among variables using historical data
- Run simulation
3 approaches for specifying distribution
- Historical data - assumes future values of the variable will be similar to its past
- Cross-sectional data - estimate distribution of the variable based on the values of the variable for peers
- Subjective specification of a distribution along with related parameters
Advantages of Simulations
- Better input estimation - forces users to think about variability in estimates
- A distribution rather than a point estimates (i.e. a point in time) - simulations highlight the inherit uncertainty in valuing risky assets and explain divergence in estimates
Examples of constraints in Simulations
- Minimum book value of equity - e.g. maintain minimal capital for BASEL
- Earnings and cash flow - externally and internally imported restrictions on profitability
- Market Value - comparing value of business to value debt in all scenarios
Problems in Simulation
- GIGO - “garbage in garbage out”
- Real data may not fit specified distribution
- Non-stationarity - changes in market events may render model useless
- Dynamic correlations - correlations between input variables may not be stable and if model is not factored for changes, output may be flawed
Benefits of Montecarlo Simulation
Considers all possible outcomes
Better suited for continuous risks, which can be sequential or concurrent
Allows for explicitly modeling corrections of input variables
Decision Trees
Appropriate tool for measuring risk in an investment when risk is discrete and sequential
It cannot accommodate correlated variables
It can be used as complements to risk-adjusted valuation or as substitutes to such valuation
Mean Reverting Level
The value time series will show a tendency to move towards. If the value of the time series is greater (less) than the mean reverting level, the value is expected to decrease (increase) overtime to its mean reverting level
Calculated as B0 / (1 - B1)
Non-Uniformity Error
Refers to the error that occurs when the data is not presented in an identical format
Normalization
Process of rescaling numeric variables in the range of [0,1]
Xi(normalized) = Xi - Xmin / (Xmax-Xmin)
Lambda
Is a hyper-parameter who value must be set before supervised learning begins of the regression model
It will determine the balance between fitting the model versus keeping the model parsimonious
When = 0, it is equivalent to an OLS regression
Sample Covariance
Sum {(X-Xbar)(Y-Ybar)} / n-1