Quantitative Methods Flashcards
Formula for Multiple Regression
Coefficient of Determination
R2
Measure of Goodness of Fit
Sum of Squares Regression / Sum of Squares Total
Adjusted R2
Adjusts R2 by the degrees of freedom;
Does not automatically increase when variables are added
Akaike’s Information Criterion (AIC)
Measure of Model Parsimony ie. Lower is better fitting model
Preffered model for prediction purposes
Schwarz’s Bayesian Information Criteria (BIC or SBC)
Allows us to choose the best model among a set of models
Preffered when Goodness of Fit is Desired
Unrestricted Model
Full model with all independent variables
Restricted Model
Also called nested Models, they take the unrestricted model and exclude one or more variables
F-Distributed Test Statistic When Comparing Restricted & Unrestricted Models
q is the number of restrictions
General Linear F-test
Heteroskedasticity
The variance of the residuals differ across observations
Arises from Ommited Variables, Incorrect Functional Form, Extreme Values
Use Breusch-Pagan (BP) Test
Unconditional Heterskedasticity
Error variance is not correlated with Independent Variables
Not a problem for statistical inference
Conditional Heterskedasticity
Error Variance is correlated to independent variables
Inflated T-Statistics
Use Breusch-Pagan (BP) Test
Breusch-Pagan (BP) Test
Used to test for Heterskedasticity;
1. Run Regression
2. Run another regression with the Dependent variable being the residuals squared from step 1
3. Use Chi Squared Statistic to solve Null Hypothesis that there is no Heteroskedasticity
Robust Standard Errors
Computed to correct for the effects of Heteroskedasticity
Serial Correlation
Regression Errors are correlated across observations
Typically seen in Time-Series Regressions
Use Durbin Watson (DW) Test or Breusch-Godfrey (BG) Test
Breusch-Godfrey (BG) Test
Used to Test for Serial Correlation;
1. Run the initial regression
2. Run Fitted Residuals from Step 1 as the Dependent Variable against the initial regressors + one or more lagged residuals
3. Test Hypothesis using Chi-Square Test
Correcting for Serial Correlation
Serial-correlation consistent standard errors
Computed by Software Packages
Multicollinearity
Independent Variables are correlated to each other
Use variance inflation factor (VIF) to quantify multicollinearity issues
Variance Inflation Factor (VIF) Formula
Used to test for Multicollinearity
VIF>5 Prompts Investigation
VIF>10 Serious Multicollinearity issues
Correcting Multicollinearity
- Excluding 1 or more variables
- Using a different proxy for one of the variables
- Increasing sample size
No easy way to fix
High Leverage Point
Extreme Value of a Independent Variable
Outlier
Extreme Value of Dependent Variable
Leverage
Difference between nth independent variable and the mean of all the independent variables
Rule of Thumb: Leverage above 3(K+1/N) is potentially influential
Studentized Residual
Way of testing for outliers
Cook’s Distance
Metric for identifying influential data points; How the estimated value if the regression changes after deleting an observation
Logistic Transformation
Transforms Qualitative Dependent Variable into a Linear relationship with the independent variables
Logistic Regression
Likelihood Ratio (LR) Test
Method to assess the fit of Logistic Regression models
Higher values (closer to 0) are better
Linear Trend Model Formula
Time Series with Linear Trend
Log-Linear Model Formula
Commonly used with time series that have exponential growth
Autoregressive (AR) Model Formula
Time series model regressed on its own past values
Covariance Stationary
- The expected value of the Time Series must be constant and finite in all periods
- The variance in the time series must be constant and finite in all periods
- The covariance of the time series to itself must be constant and finite in fixed periods in the future and the past
Mean Reversion level for AR (1) Model
Random Walk
The value of a time series in one period is the same as the one in the previous time period, with an error term added.
Use Dickey Fuller Test
Dickey-Fuller Test
Used to test for a Unit root; If there is a unit root, then the time series is a random walk.
Test for g=0
n-period Moving Average Formula
Used to smooth out period to period fluctuations in time series models
Moving Average Time-Series (MA1) Model
AR1 Model Adjusted for Quarterly Seasonality
Autoregressive Moving Average Models (ARMA)
Combines Autoregressive and Moving Average Time Series
Can be very unstable
Autoregressive Conditional Heteroskedasticity (ARCH1) Model
Way of Testing if an AR Model has Heterskedasticity
Cointegration
Long Term finanical or economic relationship exists and don’t diverge in the long run
Test of Conitegration between two time series that have a unit root
Supervised Learning
Infers patterns between inputs (features) and Outputs (targets); uses labeled data
Unsupervised Learning
Seeks to identify strucure in unlabeled data; Used in
1. Dimesion Reduction (reduce number of features)
2. Clustering
Guide to ML Algorithms
Overfitting
Does not generlize well to new data
Bias Error
Degree to which the model fits the training data; produces underfitting and in-sample errors
Variance Error
How much the model’s results change in response to new data; Causes overfitting and out-of-sample errors
Base Error
Due to Randomness of Data
Cross-Validation
Method of reducing overfitting
K-Fold Cross Validation
Used to randomize the data into training and validation samples
LASSO (Least Absolute Shrinkage and Selection Operator)
A type of Penalized Regression that applies as features are added to the regression
Hyperparameter
Paramater selcted by the researcher before learning begins
Support Vector Machine
Optimally separates the data into two sets
k-Nearest Neighbor (KNN)
Supervised learning technique used mostly for classification and sometimes for regression
Classification and Regression Tree (CART)
Supervised learning used in both classificatio and regression. Commonly applied to binary classification or regression
Ensemble Learning
Combining the predictions from a collection of models
Bootstrap Aggregation (Bagging)
Technique where orignial dataset is used to create n number of datasets
Random Forrest Classifier
Large number of decision trees trained via a bagging method
Principal Component Analysis (PCA)
Transform many highly correlated features of data into a smaller number of uncorrelated composite variables
Eigenvectors
Mutual uncorrelated composite variables that are linear combinations of the original features
Represents a direction
Eigenvalue
Represetns the proportion of the total variance explained by the eigenvectors
k-Means Clustering
A form of Unsupervised learning
Hierarchical Clustering
A form of unsupervised learning
Choosing an ML Algorithm Flowchart
ML Model Building Steps
- Conceptualization of the Modeling Task
- Data Collection
- Data Preperation and Wrangling
- Data Exploration
- Model Training
Text ML Model Building Steps
- Text Problem Formulation
- Data (Text) Curation
- Text Preperation and Wrangling
- Text Exploration
Trimming
When extreme values and outliers are removed from the dataset
Also called truncation
Winsorization
When extreme values or outliers are replaced by the maximum (minimum) values that are not outliers
Normalization Formula
Process of rescaling numeric variables in the range of [0,1]
Standardization Formula
Process of both centering and scaling the variables
Data must be normally distributed to be effective
Confusion Matrix
Precision Formula
Recall Formula
Accuracy Formula
F1 Score Formula
Root Mean Squared Error