Quant Flashcards
5 Assumptions to use a multiple regression model
1) Linearity
2) Homoskedasticity
3) Independence of Errors
4) Normality
5) Independence of Independent Variables
Linearity Assumption
The relationship between the independent variable(s) and dependent variable needs to be linear
Homoskedasticity Assumption
the variance of the regression residuals should be the same for all observations
Independence of Errors Assumption
The observations are independent of one another and uncorrelated
Normality Assumption
The regression residuals are normally distributed
Independence of Independent Variables Assumption
Independent variables are not random and they are not correlated
Adjusted R-Squared
Adjusted version of R-squared that increases when new variables introduced into the model help improve its accuracy
AIC v. BIC
AIC is for prediction
BIC is for goodness of fit
Lower values are better for both
F Statistic
[(SSE of unrestricted - SSE of restricted)/q] / (SSE of restricted)(n-k-1)
SSE is mean squared
T Stat when only given coefficient and standard error, and what is null hypothesis
coefficient/error, null hypothesis is coefficient does not differ significantly from 0
Breusch Pagan Test (BP)
- What does it test for
- What is the formula
1) Conditional Heteroskedasticity - variance in residuals differs across observations
2) n*R-Squared
2 Types of Heteroskedasticity
1) Conditional - error variance is correlated with independent variables (much bigger problem) - high probability of Type 1 errors
2) Unconditional - less problematic, no correlations
Durbin-Watson Test (DW)
A test for first-order serial correlation in time series model
Breusch-Godfrey Test (BG)
A test to used to determine autocorrelation up to a predesignated order of the lagged residuals in a time series model
Multicollinearity
When two or more independent variables are correlated to each other
Test for multicollinearity
Variance inflation factor (VIF)
1 / (1-R-Squared)
Any value over 5 warrants investigation
Any value over 10 means multicollinearity is likely
Two types of observations that may influence regression results
1) High Leverage Point
2) Outlier
Difference between high leverage point and outlier
High leverage point is when x value is extreme and outlier is when the y value is extreme, however a point can be both high leverage and an outlier
How to calculate if a point is high leverage
Leverage
If leverage exceeds 3*(k+1)/n
k - independent variables
n - observations
When looking at regression, determine if independent variable is significantly different from 0
If T stat > p value, it is significantly different from 0
T stat if not given is coefficient / standard error
Method to identify if method is an outlier and what is the formula
Studentized deleted residuals
t(I) = residual with the ith term deleted (e(I)) / standard deviation of all residuals (s(e)) == this equals standard error
if greater than 3 or greater than the critical t stat with n-k-2 degrees of freedom, observation is an outlier
When is an observation considered influential
If its exclusion from the sample causes substantial changes in the regression function
Cook’s D
Metric for identifying influential observations
Interpreting Cook’s D
If value is greater than 0.5, possibly influential
If value is greater than 1, likely influential
If value greater than SqRt(k/n), likely influential
Dummy Variable
Independent variable that takes on a value of either 0 or 1
also called indicator variable
Types of dummy Variables
1) Intercept Dummy
2) Slope Dummy
3) Interaction Term
Go from log odds to probability
1) Raise it to power of e, this is odds
2) Take odds/(1+odds), this is probability
Likelihood Ratio (LR) Test
A method to assess the fit of logistic regression models that is based on the log-likelihood metric that describes the model’s fit to the data
LR = -2 * (Log-likelihood of restricted model - log-likelihood of unrestricted model)
Calculate Standard Error of autocorrelations in time series
1 / sqrt(T), where T is number of observations, uniform for every observation
Covariance Stationary
A key assumption to make a valid statistic inference in time series models
1) Expected value must be constant and finite in all periods
2) Variance must be constant and finite in all periods
3) Covariance must be constant and finite in all periods
Autocorrelation
Correlations of a time series with its own past values
Mean reverting level of a time series
b(0) / (1-b(1))
Root Mean Squared Error (RMSE)
The square root of the average squared forecast error, used to compare the out-of-forecast performance of forecasting models
Smallest RMSE is most accurate
How to handle simple random walk without drift
First difference the time series because it makes it covariance stationary
Expected Value of simple random walk without drift
0
How to test for unit root
Dickey-Fuller Test
The null hypothesis is that a unit root is present, so rejected the null is to say the time series is covariance stationary
Unit Root
A time series that is not covariance stationary has a unit root and is therefore a random walk
When the absolute value of the lag coefficient (b1) is 1 or greater than 1, unit root is present
Co-integration
If we are mapping two series and both have a unit root, they are co-integrated, meaning they move together, and a relationship can be established between the two
Mean Reverting Level
b(0) / (1-b(1)), where b0 and b1 are the coefficients in the model you’re referencing
How to interpret Durbin Watson
A value of 2 means there is no serial autocorrelation
2-4 is negative correlation
0-2 is positive correlation
1.5-2.5 is safe zone where you can use the results
When can you not use the Durbin Watson Test in a time series
When one of the independent models you are using is a lagged dependent variable
RMSE Calculation
1) Take difference between mean and forecasts
2) Square the differences
3) Sum the squares
4) Divide by the number of observations to get the mean
5) Take square root of the mean
The lower the RMSE the more accurate the model
How to tell if model is covariance stationary based off regression results
coefficient/standard error for each b term (or respective t stat) and compare to critical t stat
if not greater, not significantly different from 0 and therefore not covariance stationary, and also has a unit root
Null hypothesis in Dickey Fuller Test
Null is there is unit root, so if T stat below critical value, there is unit root
In AR1 Model, how do you know if there is a unit root (random walk)
If B0 is 0 and B1 is 1
A bag of words
Representation of text that describes the occurrence of words within a document
Winsorization
The process of replacing extreme values and outliers with the maximum and minimum points
Recall
TP/TP+FN -> uses first column only
Precision
TP/TP + FP -> Uses first row only
When would CART and random forests be used
classification of labeled data and regression
not used for unlabeled data
Low bias error but high variance are indicative of what
Overfitting
Tokenization
Splitting a given word into text or characters
Which supervised learning technique requires no hyperparameter
SVM
Hyperparameter in LASSO
lambda
Hyperparameter in KNN
k
K means clustering
Unsupervised technique where partitions observation into a fixed number, k, of non-overlapping clusters. Each cluster is characterized by its center (centroid) and each observation is assigned to the cluster with the centroid it matches closest with
What does the r stand for in DW equation 2(1-r)
The sample correlation between the regression residuals
What types of variables are logistic regression most suited for
discrete variables, where traditional regression is suited for continuous variables
Target vs. Features
In supervised learning, target is the y (dependent variable) and features are the x (independent variable)
Complexity
The number of features in a model
Bias Error
The degree to which a model fits the data
Base Error
Due to randomness in the data
Variance Error
How much the model changes to new observations
Learning Curve
Curve that plots the accuracy rate
Soft Margin Misclassification
Adds a penalty to the objective function for observations that are misclassified in a SVM model
K Nearest Neighbor
A supervised learning technique that classifies a new observation by finding similarities between this observation and the existing data
Classification and Regression Tree (CART)
a supervised learning technique that can be used to predict either a categorical or target variable, typically used on binary classification or regression
Pruning
a regularization technique used in CART models to reduce the dimensions of the model
Ensemble Learning
Combining the predictions from a collection of models
Bagging
- bootstrap aggregating
- the original training data is used to generate new training data
Random forest classifier
A collection of a large number of decision trees via bagging
F1 Score
Harmonic mean of recall and precision
(2PR) / (P+R)
Principal Components Reduction (PCA)
a unsupervised technique to reduce dimensions
Composite variable
a variable that combines two or more variables that are statistically strongly related to each other
Eigenvector
in the context of PCA, a vector that defines new mutually uncorrelated composite variables that are linear combinations of the original features
Eigenvalue
A measure that gives the proportion of total variance in the initial dataset that is explained by each eigenvector
Scree plot
a Plot that shows the proportion of total variance in the data explained by each principal component
Hierarchical Clustering
Iterative procedure used to build a hierarchy of clusters
Agglomerative clustering
a bottom-up hierarchical clustering method that begins with each observation being treated as its own cluster
Divise clustering
A top-down hierarchical clustering method that starts with all observations belong to a single large cluster
Dendrogram
a type of tree diagram used for visualizing a hierarchical cluster analysis
Summation operator
A functional part of a neural network’s node that multiplies each input value received by a weight and sums the weighted values to form the total net input, which is then passed to the activation function
Activation Function
A functional part of a neural network’s node that transforms the total net input received into the final output of the node
Backward propagation
The process of adjusting weights in a neural network, to reduce total error of the network, by moving backward through the network’s layers
Learning Rate
a Parameter that affects the magnitude of adjustments in the weights in a neural network
Forward Propagation
The process of adjusting weights in a neural network, to reduce total error of the network, by moving forward through the network’s layers
Deep Neural Networks
Neural networks with many hidden layers, at least 2, but often more than 20
Reinforcement Learning
Machine learning in which a computer learns from interacting with itself or data generated by the same algorithm
3 Characteristics of Big Data
1) Volume
2) Variety
3) Velocity
Stemming
Process of converting inflected forms of a word into its base word (analyzing -> analyz)
Lemmatization
Process of converting inflected forms of a word into its morphological root (analyzing -> analyze)
Bag-of-words
A collection of distinct set of tokens from all the texts in a sample dataset, but does not capture the position or sequence of those words
the next step after cleansing data
Document Term Matrix (DTM)
last step of text processing
uses the BOW
Matrix where each row belongs to a document and each column represents a token
N-grams
a representation of word sequences, unigram, bigram,trigram etc.
False positive rate
FP / (TN+FP)
True positive rate
TP / (TP + FN)
When is precision useful
Where the cost of FP/Type 1 Error is high
When is Recall useful
When cost of FN/Type 2 error is high
What type of data is best used with SVM models
linear data
Veracity
The accuracy of data
Inconsistency Error
The data conflicts with what it should be (male in name column), “it doesn’t make sense” data point
Non-Uniformity Error
Data not presented in same format
Extraction
New Variable is created using existing data
Difference in purpose between feature selection and feature engineering
Feature selection minimizes overfitting and feature engineering minimizes undercutting
Normalization Formula
(value - min) / (max - min)
How much should be allocated to training set when there is absence of ground of truth
0%, this is unsupervised data set
Invalidity Error
When the result is outside the meaningful range
SEE formula
if the relationship between the dependent and independent variables is strong, the SEE will be low
Sq (MSE)
MSE = SSE / n-k-1
Formula for T-statistic for correlation coefficient
t = (r * sq(n-2)) / (sq(1-r^2))
MSE Formula
SSE / n-k-1
Degrees of freedom for error term
n - k - 1
MSR formula
RSS / k
F stat formula
MSR / MSE
MSR formula = RSS / k
MSE = SSE / n-k-1
how many tails is f test
1
What does rejection of the null hypothesis of F test mean
at least one of the coefficient is significantly different than 0, which is good for explanatory reasons
What is the effect of serial correlation
Type 1 errors
what is the effect of multicollinearity
type 2 errors
Two categories of supervised learning
1) Regression
2) Classification
What type of learning is regression and when would it be used
If the target variable is continuous (supervised learning)
What type of learning is classification and when would it be used
If the target variable is categorical or ordinal, such as company rating
(Supervised learning)
Two categories of unsupervised learning
1) Dimension reduction
2) Clustering
What type of learning technique is CART
supervised learning
What type of variables is CART used to predict
EITHER continuous or categorical
What type of learning technique is K-means and is it top down or bottom up
unsupervised / clustering / bottom up
What type of learning technique is principal component analysis and what is it good for
unsupervised / provides insight into the volatility contained in a data set
What type of learning technique is KNN
supervised
What type of learning technique is LASSO
supervised / regression
What is k-fold-cross-validation
technique for mitigating excess reduction of the training set size by reshuffling the training set
Advantage of using CART over KNN
1) CART provides visual
2) CART does not require initial hyper parameters set
3) CART does not require to specify a similarity measure
when is model generalization maximized
when prediction error on test data is minimzed
What is high bias error and high variance error indicative of
underfitting
which error are linear functions more prone to
bias error
are linear functions more prone to underfitting or overfitting
underfitting
which ML technique makes use of root nodes, decision nodes, and terminal nodes
CART
Durbin Watson for AR(1) models
indeterminable
What modeling technique can you use on random walk patterns
first-differenced regression
what is the most common problem with trend models
serial correlation
when can you not calculate the mean reverting level
when x1 is greater than 1
Stop word
A word that is so common in a text that it carries no meaning
Standardization in text processing
lowercasing, removing stop words, stemming and lemmatization
What problem do stemming and lemmatization address
data sparseness and low frequency tokens