Quant Flashcards
5 Assumptions to use a multiple regression model
1) Linearity
2) Homoskedasticity
3) Independence of Errors
4) Normality
5) Independence of Independent Variables
Linearity Assumption
The relationship between the independent variable(s) and dependent variable needs to be linear
Homoskedasticity Assumption
the variance of the regression residuals should be the same for all observations
Independence of Errors Assumption
The observations are independent of one another and uncorrelated
Normality Assumption
The regression residuals are normally distributed
Independence of Independent Variables Assumption
Independent variables are not random and they are not correlated
Adjusted R-Squared
Adjusted version of R-squared that increases when new variables introduced into the model help improve its accuracy
AIC v. BIC
AIC is for prediction
BIC is for goodness of fit
Lower values are better for both
F Statistic
[(SSE of unrestricted - SSE of restricted)/q] / (SSE of restricted)(n-k-1)
SSE is mean squared
T Stat when only given coefficient and standard error, and what is null hypothesis
coefficient/error, null hypothesis is coefficient does not differ significantly from 0
Breusch Pagan Test (BP)
- What does it test for
- What is the formula
1) Conditional Heteroskedasticity - variance in residuals differs across observations
2) n*R-Squared
2 Types of Heteroskedasticity
1) Conditional - error variance is correlated with independent variables (much bigger problem) - high probability of Type 1 errors
2) Unconditional - less problematic, no correlations
Durbin-Watson Test (DW)
A test for first-order serial correlation in time series model
Breusch-Godfrey Test (BG)
A test to used to determine autocorrelation up to a predesignated order of the lagged residuals in a time series model
Multicollinearity
When two or more independent variables are correlated to each other
Test for multicollinearity
Variance inflation factor (VIF)
1 / (1-R-Squared)
Any value over 5 warrants investigation
Any value over 10 means multicollinearity is likely
Two types of observations that may influence regression results
1) High Leverage Point
2) Outlier
Difference between high leverage point and outlier
High leverage point is when x value is extreme and outlier is when the y value is extreme, however a point can be both high leverage and an outlier
How to calculate if a point is high leverage
Leverage
If leverage exceeds 3*(k+1)/n
k - independent variables
n - observations
When looking at regression, determine if independent variable is significantly different from 0
If T stat > p value, it is significantly different from 0
T stat if not given is coefficient / standard error
Method to identify if method is an outlier and what is the formula
Studentized deleted residuals
t(I) = residual with the ith term deleted (e(I)) / standard deviation of all residuals (s(e)) == this equals standard error
if greater than 3 or greater than the critical t stat with n-k-2 degrees of freedom, observation is an outlier
When is an observation considered influential
If its exclusion from the sample causes substantial changes in the regression function
Cook’s D
Metric for identifying influential observations
Interpreting Cook’s D
If value is greater than 0.5, possibly influential
If value is greater than 1, likely influential
If value greater than SqRt(k/n), likely influential
Dummy Variable
Independent variable that takes on a value of either 0 or 1
also called indicator variable
Types of dummy Variables
1) Intercept Dummy
2) Slope Dummy
3) Interaction Term
Go from log odds to probability
1) Raise it to power of e, this is odds
2) Take odds/(1+odds), this is probability
Likelihood Ratio (LR) Test
A method to assess the fit of logistic regression models that is based on the log-likelihood metric that describes the model’s fit to the data
LR = -2 * (Log-likelihood of restricted model - log-likelihood of unrestricted model)
Calculate Standard Error of autocorrelations in time series
1 / sqrt(T), where T is number of observations, uniform for every observation
Covariance Stationary
A key assumption to make a valid statistic inference in time series models
1) Expected value must be constant and finite in all periods
2) Variance must be constant and finite in all periods
3) Covariance must be constant and finite in all periods
Autocorrelation
Correlations of a time series with its own past values
Mean reverting level of a time series
b(0) / (1-b(1))
Root Mean Squared Error (RMSE)
The square root of the average squared forecast error, used to compare the out-of-forecast performance of forecasting models
Smallest RMSE is most accurate
How to handle simple random walk without drift
First difference the time series because it makes it covariance stationary
Expected Value of simple random walk without drift
0
How to test for unit root
Dickey-Fuller Test
The null hypothesis is that a unit root is present, so rejected the null is to say the time series is covariance stationary
Unit Root
A time series that is not covariance stationary has a unit root and is therefore a random walk
When the absolute value of the lag coefficient (b1) is 1 or greater than 1, unit root is present
Co-integration
If we are mapping two series and both have a unit root, they are co-integrated, meaning they move together, and a relationship can be established between the two
Mean Reverting Level
b(0) / (1-b(1)), where b0 and b1 are the coefficients in the model you’re referencing
How to interpret Durbin Watson
A value of 2 means there is no serial autocorrelation
2-4 is negative correlation
0-2 is positive correlation
1.5-2.5 is safe zone where you can use the results
When can you not use the Durbin Watson Test in a time series
When one of the independent models you are using is a lagged dependent variable
RMSE Calculation
1) Take difference between mean and forecasts
2) Square the differences
3) Sum the squares
4) Divide by the number of observations to get the mean
5) Take square root of the mean
The lower the RMSE the more accurate the model
How to tell if model is covariance stationary based off regression results
coefficient/standard error for each b term (or respective t stat) and compare to critical t stat
if not greater, not significantly different from 0 and therefore not covariance stationary, and also has a unit root
Null hypothesis in Dickey Fuller Test
Null is there is unit root, so if T stat below critical value, there is unit root
In AR1 Model, how do you know if there is a unit root (random walk)
If B0 is 0 and B1 is 1
A bag of words
Representation of text that describes the occurrence of words within a document
Winsorization
The process of replacing extreme values and outliers with the maximum and minimum points
Recall
TP/TP+FN -> uses first column only
Precision
TP/TP + FP -> Uses first row only
When would CART and random forests be used
classification of labeled data and regression
not used for unlabeled data
Low bias error but high variance are indicative of what
Overfitting
Tokenization
Splitting a given word into text or characters
Which supervised learning technique requires no hyperparameter
SVM
Hyperparameter in LASSO
lambda
Hyperparameter in KNN
k
K means clustering
Unsupervised technique where partitions observation into a fixed number, k, of non-overlapping clusters. Each cluster is characterized by its center (centroid) and each observation is assigned to the cluster with the centroid it matches closest with