Quantitative Methods Flashcards
When should you use Logistic regression models?
If the dependent Y variable is discrete
If out independent X variables is qualitative
When should you use Multiple regression models?
When the dependent variable is continuous (not discrete) and there is more than one explanatory variable (more than one dependent variable).
When multiple independent variables determine the outcome of a single dependent variable.
- Dependent Y Variable is continuous
- We have more than 1 Dependent Y variable
Assumption of Regression models
L.I.I.N.H.
Linearity: Relationship between dependent Y variable and Independent X variable is linear.
Independent of Errors: Regression residuals are uncorrelated across observation.
Independent: Independent X variable is not random, there is no exact linear relationship between 2 or more independent variables.
Normality: Regression residuals are normally distributed.
Homoscedasticity: Constant variance of regression residuals
How to determine if a variable is significant?
|T-Stat| > 1
Degrees of freedom for SSR
N-k
Degrees of freedom for SST
N-1
Degrees of freedom for SSE
N-K+1
What will happen to adjusted R-Square if we have insignificant varibles
Adjusted R-Square decreases
R-Square formula
SSR/SST = Explained Variation / Unexplained variation
1-(unexplained variation/total variation)
What kind of test is this?
H0: bi = Bi
Ha: bi /= Bi
Two tail test
What kind of test is this?
H0: bi <= Bi
Ha: bi > Bi
Right tail test
<= - is heading right
What kind of test is this?
H0: bi => Bi
Ha: bi < Bi
Left tail test
=> is heading left
Model Misspecification - Omitted variable
If we omit a significant variable from our model, the error term will capture the missing.
Model Misspecification - Inappropriate form of variable
Failing to account for non-linearity
Causes: Conditional heteroscedasticity
To fix it we can use natural log to transform the variable to be linear.
Model Misspecification - Inappropriate Scaling
Causes Conditional heteroscedasticity and multicollinearity
Model Misspecification - Inappropriate Pooling of Data
Causes Conditional heteroscedasticity and Serial correlation
What is Unconditional heteroscedasticity
Var(error) not correlated with independent variable.
No issue with interference.
What is Conditional heteroscedasticity
Var(error) are correlated with independent X variable
F-test is unreliable since MSE is a biased estimator of the true population variance.
variance at one time step has a positive relationship with variance at one or more previous time steps. This implies that periods of high variability will tend to follow periods of high variability and periods of low variability will tend to follow periods of low variability.
What does the Breusch Pagan BP tets do?
Tests for heteroskedasticity
The formula for BP test statistics
n * R-Square
BP test
Test statistics > Critical value
Reject the null.
No heteroskedasticity
homoskedasticity is present -* Constant vartiance *
- H0: No heteroskedasticity - homoskedasticity is present
- Ha: Heteroskedasticity
BP test
Test statistics < Critical value
Reject the null
There is Heteroskedasticity
H0: No heteroskedasticity
Ha: Heteroskedasticity
What is serial correlation?
Errors correlated across the observation
Positive Serial Correlation
Positive residuals is most likely followed by positive residuals
Negative residuals is most likely followed by negative residuals
Negative Serial Correlation
Negative residual is most likely followed by positive residual
Positive residual is most likely followed by negative residual
Multicollinearity
2 or more independent variables are highly correlated or there is an approximate linear relationship among the IVs.
Coefficients will be consistent but imprecise and unreliable
Inflated SE and insignificant T-Statistics, but possibly significant F-Statistics
How to detect multicollinearity?
Variance inflation factor
1 / (1- R Square)
We want VIF as low as possible
> 5 Concerning
10 Multicollinearity
How to fix multicollinearity?
- Increase sample size
- Excluding one or more of the regression variables.
- Use a different proxy for one of the variables
Formula and purpose of AIC
AIC = n * ln(SSE/n)+ 2(K+1)
AIC is better for forecasting purposes
Formula and purpose of BIC
BIC = n * ln(SSE/n) + Ln(n)(k+1)
Better for evaluating goodness-of-fit
How do we test joint coefficients?
F-Stat
[(SSE restricted - SSE unrestricted) / q] / (SSE unrestricted / N-k-1)
What is a High leverage point?
Extreme value of independent variables
Observation that is outside the range of independent variables (x axis)
What is a Outliers?
Extreme value in the dependent variable
Observation that is outside the range of the dependent variables (vertical Y range)
How do you detect and calculate a High leverage point?
Calculate leverage measure
HL = 3 (K+1/n)
1/n + ( Deviation of i / Sum of all deviations)
How do you detect and calculate a outlier?
**Externally studentized residuals
- Delete each case i
- Calculate new regression
- Add deleted observation back in, calculate residual
- Calculate sudentized residuals
T* = e* / se*
potentially influentia if ..
|T|> Critical t (for small samples)
|T| < 3 for large samples
How can we determine and find influential outliers
By calculating Cooks distance (aka Cooks D)
If cooks D is …
Di > 0.5
could be influential
If cooks D is
Di > 1
Likely to be influential
If cooks D is
Di > 2 x Rot(K/n)
Influential
How does an intercept dummy variable look like?
No interaction term
yi = b0 + b1x1 +b2x2 + d0D1
How does an Slope dummy variable look like?
interaction term
yi = b0 + b1x1 +d1x1D + epsilon
How do you interpret an independent variable’s slope coefficient in a logistic regression model
log odds that the event happens per unit change in the independent variable, holding all other independent variables constant.
The intercept in these logistic regressions is interpreted as the:
log odds of the ETF being a winning fund if all independent variables are zero.
When to use a Log-Linear trend model?
When the dependent Y variable changes at a constant growth rate
When to use a Linear trend model?
When the dependent Y variable changes at a constant rate with time.
DW test for Serial correlation in linear/log-linear model hypothesis
H0: Dw = 2 - Fail to reject - Do not reject the null hypothesis - No Serial correlation
Ha: Dw =/2 -Reject null - We have serial correlation
Autoregressive AR model
A time series regressed on its own past values.
A statistical model is autoregressive if it predicts future values based on past values. For example, an autoregressive model might seek to predict a stock’s future prices based on its past performance.
What are the 3 properties we must satisfy to have “Covariance Stationary Series”
Mean, Variance, and Cov(yt, yt-s) must be constant and finite in all periods.
- The expected value of the time series must be constant and finite in all periods.
- The variance of the time series must be constant and finite in all periods.
- The covariance of the time series with itself for a fixed number of periods in the past or future must be constant and finite in all period
What is “mean Reversion”
The value of the time series falls when it’s above its mean, and rises when it’s below its mean.
Mean reversion in finance suggests that various relevant phenomena such as asset prices and volatility of returns eventually revert to their long-term average levels.
The mean reversion theory has led to many investment strategies, from stock trading techniques to options pricing models.
Mean reversion trading tries to capitalize on extreme changes in the price of a particular security, assuming that it will revert to its previous state
Define the Mean reverting level …
Xt > b0/(1-b1)
The time series will decrease
Define the Mean reverting level …
Xt = b0/(1-b1)
The time series will remain the same
Define the Mean reverting level …
Xt < b0/(1-b1)
The time series will remain the increase
What is an “in-sample forecast”
Prediction
Predicted vs Observed values to generate the model
Models with a smaller variance of errors are more accurate
What is an “out-of-sample forecast”
Forecast
Forecast vs Outside the model’s values
Use Root Mean Squared Errors (RMSE) - used to compute out-of-sample forecasting performance. The smaller the RMSE, the better.
What 2 elements does Random Walk not have?
Finite mean reverting level, and finite variance
Which test do we use to test for unit root?
Dickey-Fuller test
When testing for Unit root
If the coefficient is |b1| < 1
No unit root - the time series is covariance stationary
When testing for Unit root
If the coefficient is b1 = 1
Unit root.
Time series is a random walk.
It is not covaraince stationary
DW Test for SC
Result from model output (DW statistics) < DW Critical
Evidence of Positive Serial Correltion
we can reject the hypothesis of no Positive Serial correlation
DW Test for SC
Result from model output (DW statistics) > DW Critical
NO Evidence of Positive Serial Correltion
When are residuals are not serially correlated in AR model test statistics?
|T-Stat| > Critical Value
When are residuals serially correlated in AR model test statistics?
|T-Stat| < Critical Value
The standard error of the autocorrelations is calculated as…
1/√T
where T represents the number of observations used in the regression
Explain the DW test
The DW statistic is designed to detect positive serial correlation of the errors of a regression equation.
Under the null hypothesis of no positive serial correlation, the DW statistic is 2.0.
Positive serial correlation will lead to a DW statistic that is less than 2.0.
We do NOT want positive serial correlation !!!
The steps to calculate RMSE …
The steps to calculate RMSE are as follows:
- Take the difference between the actual and the forecast values. This is the error.
- Square the error.
- Sum the squared errors.
- Divide by the number of forecasts.
- Take the square root of the average.
Root Mean Squared Error (RMSE)
Root Mean Square errors steps
1. Square the errors (Actual - Forecasted)
2. Sum of the differences and calculate the mean
3. Take the square root of the mean
4. Then we will have our RMSE - Smaller RMSE the better
A model’s accuracy in forecasting out-of-sample values is assessed using the root mean squared error (RMSE).
RMSE is the square root of the mean squared error. The model with the smallest RMSE is seen as the most accurate, as it is perceived to have better predictive power in the future.
What is a unit root
Is a stochastic trend in a time series
Random Walk with a drift
If TS has unit root, it shows a systamatic pattern that is unpredicable
How do we transform TS into covariance stationary
By using first differencing
Regression 2:yt = b0 + b1yt−1 + εt,
where yt = xt − xt−1.
Can we test for positive SC if we have lag variables using DW test?
NO!!!
DW can be used for linear models, not trend models
When testing for serial correlation using DW test.
0 ——-|dl|——–|du|——-2
Between 0 and lower level = + SC
Between Du and 2 = Okay
Between Dl and Du = We don’t know
Adjusted R square formula
1 - [(n-1)/(n-k-1)] x (1-R Squared)
Holding all other variables constant, the adjusted R-Square will decrease when all of the following variables increase expect…
The number of observation
What does the BP test for ?
Conditional Heteroskedasticity
What is the most common problem in trend models?
Serial correlation
Trend models often have the limitation that their errors are serially correlated. This is due to the fact that predictions in the trend models are based soley on what time period it is, and thus they fail to account for significant trends in the data such as recession.
Hierarchical clustering is most likely used when the problem involves
Classifying unlebeled data
What is Supervised machine learning
Involves training an algorithm to take a set of inputs (x variables) and find a model that best relates them to outputs (Y variables)
Training algorithm - Set of inputs - find models that relates to outputs.
What is unsupervised machine learning
Same as supervised learning, but does not make use of labeled training data.
We give it data and expect the algorithm to make sense of it.
What is Overfitting
ML models can produce overly complex models that may fit the training data too well and thereby not generalize new data well.
The prediction model of the traning sample (in-Sample data) is too complex.
The traning Data does not work well with the new data
Name Supervised ML Algorithms
Penalized regression
Support Vector Machine (SVM)
K - Nearest Neighbor
Classification and Regression Trees (CART)
Ensemble learning
Random Forest
Name unsupervised ML Algortihms
Principle component analysis
K-Mean clustering
Hierarachical clustering
High Bias Error in ML
High Bias Error means the model does not fit the training data well.
High Variance Error in ML
High variance error means the model does not predict well on the test data
Name Dimension Reduction in ML
Principle component analysis (unsupervised ML)
Penalized Regression
(Supervised ML)
What does Penalized Regression do?
- Simmilar to maximizing adjusted R square.
- Demension Reduction
- Eliminates/minimazie overfitting
Regression coefficients are chosen to minimize the sum of the squared error, plus a penalty term that increases with the number of included features
What is SVM
Support Vector Machine
It is Classification, Regression, and Outlier detection
Classifying data that is not complex or non-linear.
Is a linear classifier that determines the hyperplane that optimally seperates the observation into two sets of data points.
Does not requier any hyperparameter.
Maximize the probability of making a correct prediction by determining the boundry that is furthest from all observation.
Outliers do not affect either the support vectors or the discriminant boundry.
What is K-Nearest Neighbor
Classification
Classify new observation by finding similarities in the existing data.
Makes no assumption about the distribution of the data.
It is non-parametric.
KNN results can be sensitive to inclusion of irrelevant or correlated featuers, so it may be neccessary to select featuers manually.
Thereby removing less irrelevant information.
What is CART
Classification and Regression Trees
Part of supervised ML
Typically applied when the target is binary.
If the goal is regression, the prediction would be the mean of the values of the terminal node.
Makes no assumption about the characteristics of the traning data, so if left unconstrained, potentially it can perfectly learn the traning data.
To avoid overfitting, regulation paramterers can be added, such as the maximum dept of the tree.
What are the 3 types of layer in Neural Network
- Input layer
- Hidden layer
- Output layer
What are non-linear functions more susceptiable to?
Variance error and overfitting
What are linear functions more susceptiable to?
Bias error and underfitting
The main distinction between clustering and classification algorithms is that
The groups in clustering are determined by the data
Classification they are determined by the analyst/researcher
What is K-Means clustering in ML?
K-means partitions observations into a fixed number, k, of non-overlaping cluster.
Each cluster is characterized by its centroid, and each observation is assigned by the algorithm to the cluster with the centroid to which that observation is closest.
High bias error and high variance error are indicative of…
Underfitting
High bias error = model does not fit on the traning data.
High variance = Model does not predict well on test data.
Both combination results in a underfitted model.
Low bias error but high variance error is indicative of ..
Overfitting
Bias error = model does not fit the traning data well.
Variance error = Model does not predict well on test data.
What are linear models more susceptible to?
Bias Error (underfitting)
What are non-linear models more prone to?
Variance Error
(overfitting)
What is Principal Components Analysis
It is part of unsupervised ML
Dimension Reduction
Use to reduce highly correlaed featuers of data into few main uncorrelated composite variables.
Steps in Big Data Analysis/Projects: Traditional with strucutred data.
Conceptualize the task -> Collect data -> Data Preperation & processing -> Data Exploration -> Model traning.
Steps in Big Data Analysis/Projects: Textual Bid Data.
Text probelm formulation -> Data Curation -> Text preperation and processing -> Text exploration -> Classifier output.
Preperation in strucutred data: Extraction
Creating a new variable from an already existing one for easing the analysis.
Example: Date of birth -> Age
Preperation in strucutred data: Aggregation
2 or more variables aggregated into one signle variable.
Preperation in strucutred data: Filtration
Eliminate data rows which are not needed.
[We filter out the information that is not relevant]
CFA Lv 2 Candidates only
Preperation in strucutred data: Selection
Columns that can be eliminated
Preperation in strucutred data: Conversion
Nominal, ordinal, integer, ratio, categorical.
Cleansing strucutred data: Incomplete
Missing entries
Cleansing strucutred data: Invalid
Outside a meaningful range
Cleansing strucutred data: Inconsistent
Some data conflicts with other data.
Cleansing strucutred data: Inaccurate
Not a true value
Cleansing strucutred data: non-uniform
Non identical data format
American date (M/D/Y) vs European (D/M/Y)
Cleansing strucutred data: Duplication
Multiple identical observation
Adjusting the range of a feature: Normalization
Rescales in the rage 0-1
Sensitive to outliers.
Xi- Xmin /(Range)
Xi- Xmin /(Xmax -Xmin)
Adjusting the range of a feature: Standardization
Centers and Rescales
Requiers normal distribution
(Xi - u) / Standard deviation
Performance evaluation graph: Precision formula
P= TP / (TP + FP)
Remeber: Demoninator ( Positive)
Useful when type 1 error is high
is the ratio of correctly predictive positive classes to all predictive positive classes.
Precision is useful in situations where the cost of FP or Type I Error is high.
For example, when an expensive product fails quality inspection (predicted class 1) and is
scrapped, but it is actually perfectly good (actual class 0).
Performance evaluation graph: Recall formula
TP / / (TP + FN)
Remember: ( Recall we have the opposite in the denominator)
Sensitivity: useful when type 2 error is high.
also known as sensitivity i.e. is the ratio of correctly predicted positive classes to all actual
positive classes. Recall is useful in situations where the cost of FN or Type II Error is high.
For example, when an expensive product passes quality inspection (predicted class 0) and
is sent to the valued customer, but it is actually quite defective (actual class 1)
Performance evaluation graph: Accuracy formula
(TP + TN) / (TP + FN + TN + FP)
Is the percentage of correctly predicted classes out of total predictions.
Receiver operating characterisitcs: False Positive Rate Formula
FP / (FP + TN)
Statement / (Statement + Opposite)
Receiver operating characterisitcs: True Positive Rate Formula
TP / (TP + FN)
Statement / (Statement + Opposite)
In big data projects, which measure is the most appropriate for regression method
RMSE
(Root Mean Square Error)
What is “trimming” in big data projects?
Removing the bottom and top 1% of observation on a feature in a data set.
What is “Winsorization” in big data projects?
Replacing the extreme values in a data set with the same maximum or minumimum value
Confusion Matrix: F1 Score Formula
(2 x P x R) / (P + R)
is the harmonic mean of precision and recall.
F1 Score is more appropriate than Accuracy when unequal class distribution is in the dataset andit is necessary to measure the equilibrium of Precision and Recall.
High scores on both of these metrices suggest good model performance.
Confusion Matrix Display
TP FP
FN TN
What is Mutual Information in big data projects?
How much info a token contributes to a class
Mutual Information (MI) Measures how much information is contributed by a token to a class of text.
MI = 0 The token’s distribution in all text classes is the same.
MI = 1 The token in any one class tends to occure more often in only that particular class of text.
Feature Engineering
Final stage in Data Exploration
Numbers: Differentitate among types of numbers
N-Grams: Multi-Word patterns kept intact
Name entity recognition (NER): Class: Money, Time, Organization.
How to deal with Class Imbalance?
The majority class can be under-sampled and the minority class can be over-sampled.
Tokenization is the process of
Splitting a givien text into seperate words or characters.
Token is equvulant to a word, and tokenization is the process of splitting the word into seperate tokens.
the sequence of steps for text preprocessing is to produce
Tokens -> N-grams which to build a bag -> Input to a document term matrix.
Big Data differs from traditional data sources based on the presence of a set of characteristics commonly referred to as the 4 V’s.. **What are thr 4 V’s? **
Volume: refers to the quantity of data.
Variety: pertains to the array of available data sources.
Velocity: is the speed at which data is created (data in motion is hard to analyze compared to data at rest).
Veracity: related to the credibility and reliability of different data sources.
What is Exploratory Data Analysis (EDA), and in which stage is it in?
Stage 4 and first stage in Data Exploration
is the preliminary step in data exploration. Exploratory graphs, charts and other visualizations
such as heat maps and word clouds are designed to summarize and observe data.
What is Feature Selection, and in which stage is it in?
Stage 4 and Second stage in Data Exploration
is a process whereby only pertinent features from the dataset are selected for ML model training.
Feature
What is Feature Engineering, and in which stage is it in?
Stage 4 and Third and final stage in Data Exploration
is a process of creating new features by changing or transforming existing features. Feature Engineering techniques systematically alter, decompose or combine existing features to produce more meaningful
features.
Formula For F-Test
MSR / MSE
[ RSS / K ] / [SSE / n-(k+1) ]
[ Regression / K ] / [ Residual / n-(k+1) ]
Hypothesis test for F test
F test > F stat : Reject null. b1 = b2 = bn = 0
F test < F stat : Fail to reject null. b1 =/ b2 =/ bn =/ 0
What are the 3 types of error in ML?
Bias error
Variance error
Base error
What is variance error in ML?
Variance Error or how much the model’s results change in response to new data from
validation and test samples.
Unstable models pick up noise and produce high variance
causing overfitting and ↑ out of-sample error.
What is Bias error in ML?
Bias Error or the degree to which a model fits the training data.
Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and ↑ in-sample error.
(Adding more training samples will not improve the model)
What is Bias error in ML?
Base Error due to randomness in the data.
(Out-of-sample accuracy increases as the training sample size increases)
Name 2 ways to Preventing Overfitting in Supervised Machine Learning
Ocean’s Razor: The problem solving principle that the simplest solution tends to be the correct one.
In supervised ML, it means preventing the algorithm from getting too complex during selection and training by limiting the no. of features and penalizing algorithms that are too complex or too flexible by constraining them to include only parameters that reduce out-of-sample error.
K-Fold Cross Validation: This strategy comes from the principle of avoiding sampling bias.
The challenge is having a large enough data set to make both training and testing possible on representative samples.