Quantitative Methods Flashcards

1
Q

When should you use Logistic regression models?

A

If the dependent Y variable is discrete
If out independent X variables is qualitative

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

When should you use Multiple regression models?

A

When the dependent variable is continuous (not discrete) and there is more than one explanatory variable (more than one dependent variable).

When multiple independent variables determine the outcome of a single dependent variable.

  • Dependent Y Variable is continuous
  • We have more than 1 Dependent Y variable
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Assumption of Regression models

A

L.I.I.N.H.

Linearity: Relationship between dependent Y variable and Independent X variable is linear.

Independent of Errors: Regression residuals are uncorrelated across observation.

Independent: Independent X variable is not random, there is no exact linear relationship between 2 or more independent variables.

Normality: Regression residuals are normally distributed.

Homoscedasticity: Constant variance of regression residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to determine if a variable is significant?

A

|T-Stat| > 1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Degrees of freedom for SSR

A

N-k

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Degrees of freedom for SST

A

N-1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Degrees of freedom for SSE

A

N-K+1

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What will happen to adjusted R-Square if we have insignificant varibles

A

Adjusted R-Square decreases

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

R-Square formula

A

SSR/SST = Explained Variation / Unexplained variation

1-(unexplained variation/total variation)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What kind of test is this?

H0: bi = Bi
Ha: bi /= Bi

A

Two tail test

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What kind of test is this?

H0: bi <= Bi
Ha: bi > Bi

A

Right tail test

<= - is heading right

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What kind of test is this?

H0: bi => Bi
Ha: bi < Bi

A

Left tail test

=> is heading left

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Model Misspecification - Omitted variable

A

If we omit a significant variable from our model, the error term will capture the missing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Model Misspecification - Inappropriate form of variable

A

Failing to account for non-linearity
Causes: Conditional heteroscedasticity

To fix it we can use natural log to transform the variable to be linear.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Model Misspecification - Inappropriate Scaling

A

Causes Conditional heteroscedasticity and multicollinearity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Model Misspecification - Inappropriate Pooling of Data

A

Causes Conditional heteroscedasticity and Serial correlation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is Unconditional heteroscedasticity

A

Var(error) not correlated with independent variable.
No issue with interference.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What is Conditional heteroscedasticity

A

Var(error) are correlated with independent X variable

F-test is unreliable since MSE is a biased estimator of the true population variance.

variance at one time step has a positive relationship with variance at one or more previous time steps. This implies that periods of high variability will tend to follow periods of high variability and periods of low variability will tend to follow periods of low variability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What does the Breusch Pagan BP tets do?

A

Tests for heteroskedasticity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

The formula for BP test statistics

A

n * R-Square

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

BP test
Test statistics > Critical value

A

Reject the null.
No heteroskedasticity
homoskedasticity is present -* Constant vartiance *

  • H0: No heteroskedasticity - homoskedasticity is present
  • Ha: Heteroskedasticity
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

BP test
Test statistics < Critical value

A

Reject the null

There is Heteroskedasticity

H0: No heteroskedasticity
Ha: Heteroskedasticity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is serial correlation?

A

Errors correlated across the observation

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Positive Serial Correlation

A

Positive residuals is most likely followed by positive residuals
Negative residuals is most likely followed by negative residuals

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Negative Serial Correlation
Negative residual is most likely followed by positive residual Positive residual is most likely followed by negative residual
26
Multicollinearity
2 or more independent variables are highly correlated or there is an approximate linear relationship among the IVs. Coefficients will be consistent but imprecise and unreliable Inflated SE and insignificant T-Statistics, but possibly significant F-Statistics
27
How to detect multicollinearity?
Variance inflation factor 1 / (1- R Square) We want VIF as low as possible > 5 Concerning > 10 Multicollinearity
28
How to fix multicollinearity?
* Increase sample size * Excluding one or more of the regression variables. * Use a different proxy for one of the variables
29
Formula and purpose of AIC
AIC = n * ln(SSE/n)+ 2(K+1) AIC is better for forecasting purposes
30
Formula and purpose of BIC
BIC = n * ln(SSE/n) + Ln(n)(k+1) Better for evaluating goodness-of-fit
31
How do we test joint coefficients?
F-Stat [(SSE restricted - SSE unrestricted) / q] / (SSE unrestricted / N-k-1)
32
What is a High leverage point?
Extreme value of independent variables Observation that is outside the range of independent variables (x axis)
33
What is a Outliers?
Extreme value in the dependent variable Observation that is outside the range of the dependent variables (vertical Y range)
34
How do you detect and calculate a High leverage point?
Calculate leverage measure **HL = 3 (K+1/n)** 1/n + ( Deviation of i / Sum of all deviations)
35
How do you detect and calculate a outlier?
****Externally studentized residuals - Delete each case i - Calculate new regression - Add deleted observation back in, calculate residual - Calculate sudentized residuals T* = e* / se* **potentially influentia if** .. |T*|> Critical t (for small samples) |T*| < 3 for large samples
36
How can we determine and find influential outliers
By calculating Cooks distance (aka Cooks D)
37
If cooks D is ... Di > 0.5
could be influential
38
If cooks D is Di > 1
Likely to be influential
39
If cooks D is Di > 2 x Rot(K/n)
Influential
40
How does an intercept dummy variable look like?
No interaction term yi = b0 + b1x1 +b2x2 + d0D1
41
How does an Slope dummy variable look like?
interaction term yi = b0 + b1x1 +d1x1D + epsilon
42
How do you interpret an independent variable’s slope coefficient in a logistic regression model
log odds that the event happens per unit change in the independent variable, holding all other independent variables constant.
43
The intercept in these logistic regressions is interpreted as the:
log odds of the ETF being a winning fund if all independent variables are zero.
44
When to use a Log-Linear trend model?
When the dependent Y variable changes at a constant growth rate
44
When to use a Linear trend model?
When the dependent Y variable changes at a constant rate with time.
45
DW test for Serial correlation in linear/log-linear model hypothesis
H0: Dw = 2 - Fail to reject - Do not reject the null hypothesis - No Serial correlation Ha: Dw =/2 -Reject null - We have serial correlation
45
Autoregressive AR model
A time series regressed on its own past values. A statistical model is autoregressive if it predicts future values based on past values. For example, an autoregressive model might seek to predict a stock's future prices based on its past performance.
46
What are the 3 properties we must satisfy to have "Covariance Stationary Series"
Mean, Variance, and Cov(yt, yt-s) must be constant and finite in all periods. 1. The expected value of the time series must be constant and finite in all periods. 2. The variance of the time series must be constant and finite in all periods. 3. The covariance of the time series with itself for a fixed number of periods in the past or future must be constant and finite in all period
47
What is "mean Reversion"
The value of the time series falls when it's above its mean, and rises when it's below its mean. Mean reversion in finance suggests that various relevant phenomena such as asset prices and volatility of returns eventually revert to their long-term average levels. The mean reversion theory has led to many investment strategies, from stock trading techniques to options pricing models. Mean reversion trading tries to capitalize on extreme changes in the price of a particular security, assuming that it will revert to its previous state
48
Define the Mean reverting level ... Xt > b0/(1-b1)
The time series will decrease
48
Define the Mean reverting level ... Xt = b0/(1-b1)
The time series will remain the same
49
Define the Mean reverting level ... Xt < b0/(1-b1)
The time series will remain the increase
50
What is an "in-sample forecast"
Prediction Predicted vs Observed values to generate the model Models with a smaller variance of errors are more accurate
51
What is an "out-of-sample forecast"
Forecast Forecast vs Outside the model's values Use Root Mean Squared Errors (RMSE) - used to compute out-of-sample forecasting performance. The smaller the RMSE, the better.
52
What 2 elements does Random Walk not have?
Finite mean reverting level, and finite variance
53
Which test do we use to test for unit root?
Dickey-Fuller test
54
When testing for Unit root If the coefficient is |b1| < 1
No unit root - the time series is covariance stationary
55
When testing for Unit root If the coefficient is b1 = 1
Unit root. Time series is a random walk. It is not covaraince stationary
56
DW Test for SC Result from model output (DW statistics) < DW Critical
**Evidence of Positive Serial Correltion** we can reject the hypothesis of no Positive Serial correlation
57
DW Test for SC Result from model output (DW statistics) > DW Critical
**NO Evidence of Positive Serial Correltion**
58
When are residuals **are not** serially correlated in AR model test statistics?
|T-Stat| > Critical Value
59
When are residuals serially correlated in AR model test statistics?
|T-Stat| < Critical Value
60
The standard error of the autocorrelations is calculated as...
1/√T **where T represents the number of observations used in the regression**
61
Explain the DW test
The DW statistic is designed to detect positive serial correlation of the errors of a regression equation. Under the null hypothesis of no positive serial correlation, the DW statistic is 2.0. Positive serial correlation will lead to a DW statistic that is less than 2.0. We do NOT want positive serial correlation !!!
62
The steps to calculate RMSE ...
The steps to calculate RMSE are as follows: 1. Take the difference between the actual and the forecast values. This is the error. 2. Square the error. 3. Sum the squared errors. 4. Divide by the number of forecasts. 5. Take the square root of the average.
63
Root Mean Squared Error (RMSE)
Root Mean Square errors steps 1. Square the errors (Actual - Forecasted) 2. Sum of the differences and calculate the mean 3. Take the square root of the mean 4. Then we will have our RMSE - Smaller RMSE the better A model’s accuracy in forecasting out-of-sample values is assessed using the root mean squared error (RMSE). RMSE is the square root of the mean squared error. The model with the smallest RMSE is seen as the most accurate, as it is perceived to have better predictive power in the future.
64
What is a unit root
Is a stochastic trend in a time series Random Walk with a drift If TS has unit root, it shows a systamatic pattern that is unpredicable
65
How do we transform TS into covariance stationary
By using first differencing Regression 2:yt = b0 + b1yt−1 + εt, where yt = xt − xt−1.
66
Can we test for positive SC if we have lag variables using DW test?
NO!!! DW can be used for linear models, not trend models
67
When testing for serial correlation using DW test.
0 -------|dl|--------|du|-------2 Between 0 and lower level = + SC Between Du and 2 = Okay Between Dl and Du = We don't know
68
Adjusted R square formula
1 - [(n-1)/(n-k-1)] x (1-R Squared)
69
Holding all other variables constant, the adjusted R-Square will decrease when all of the following variables increase expect...
The number of observation
70
What does the BP test for ?
Conditional Heteroskedasticity
71
What is the most common problem in trend models?
**Serial correlation** Trend models often have the limitation that their errors are serially correlated. This is due to the fact that predictions in the trend models are based soley on what time period it is, and thus they fail to account for significant trends in the data such as recession.
72
Hierarchical clustering is most likely used when the problem involves
Classifying unlebeled data
73
What is Supervised machine learning
Involves training an algorithm to take a set of inputs (x variables) and find a model that best relates them to outputs (Y variables) Training algorithm - Set of inputs - find models that relates to outputs.
74
What is unsupervised machine learning
Same as supervised learning, but does not make use of labeled training data. We give it data and expect the algorithm to make sense of it.
75
What is Overfitting
ML models can produce overly complex models that may fit the training data too well and thereby not generalize new data well. The prediction model of the traning sample (in-Sample data) is too complex. The traning Data does not work well with the new data
76
Name Supervised ML Algorithms
Penalized regression Support Vector Machine (SVM) K - Nearest Neighbor Classification and Regression Trees (CART) Ensemble learning Random Forest
77
Name unsupervised ML Algortihms
Principle component analysis K-Mean clustering Hierarachical clustering
78
High Bias Error in ML
High Bias Error means the model does not fit the training data well.
79
High Variance Error in ML
High variance error means the model does not predict well on the test data
80
Name Dimension Reduction in ML
Principle component analysis (unsupervised ML) Penalized Regression (Supervised ML)
81
What does Penalized Regression do?
* Simmilar to maximizing adjusted R square. * Demension Reduction * Eliminates/minimazie overfitting Regression coefficients are chosen to minimize the sum of the squared error, plus a penalty term that increases with the number of included features
82
What is SVM
**Support Vector Machine** **It is Classification, Regression, and Outlier detection** Classifying data that is not complex or non-linear. Is a linear classifier that determines the hyperplane that optimally seperates the observation into two sets of data points. Does not requier any hyperparameter. Maximize the probability of making a correct prediction by determining the boundry that is furthest from all observation. Outliers do not affect either the support vectors or the discriminant boundry.
83
What is K-Nearest Neighbor
Classification Classify new observation by finding similarities in the existing data. Makes no assumption about the distribution of the data. It is non-parametric. KNN results can be sensitive to inclusion of irrelevant or correlated featuers, so it may be neccessary to select featuers manually. Thereby removing less irrelevant information.
84
What is CART
**Classification and Regression Trees** Part of supervised ML Typically applied when the target is binary. If the goal is regression, the prediction would be the mean of the values of the terminal node. Makes no assumption about the characteristics of the traning data, so if left unconstrained, potentially it can perfectly learn the traning data. To avoid overfitting, regulation paramterers can be added, such as the maximum dept of the tree.
85
What are the 3 types of layer in Neural Network
1. Input layer 2. Hidden layer 3. Output layer
86
What are non-linear functions more susceptiable to?
Variance error and overfitting
87
What are linear functions more susceptiable to?
Bias error and underfitting
88
The main distinction between clustering and classification algorithms is that
The groups in clustering are determined by the data Classification they are determined by the analyst/researcher
89
What is K-Means clustering in ML?
K-means partitions observations into a fixed number, k, of non-overlaping cluster. Each cluster is characterized by its centroid, and each observation is assigned by the algorithm to the cluster with the centroid to which that observation is closest.
90
High bias error and high variance error are indicative of...
Underfitting High bias error = model does not fit on the traning data. High variance = Model does not predict well on test data. Both combination results in a underfitted model.
90
Low bias error but high variance error is indicative of ..
Overfitting Bias error = model does not fit the traning data well. Variance error = Model does not predict well on test data.
91
What are linear models more susceptible to?
Bias Error (underfitting)
92
What are non-linear models more prone to?
Variance Error (overfitting)
93
What is Principal Components Analysis
It is part of unsupervised ML Dimension Reduction Use to reduce highly correlaed featuers of data into few main uncorrelated composite variables.
94
Steps in Big Data Analysis/Projects: **Traditional with strucutred data**.
**Conceptualize the task** -> **Collect data** -> **Data Preperation & processing** -> **Data Exploration** -> **Model traning.**
95
Steps in Big Data Analysis/Projects: **Textual Bid Data**.
T**ext probelm formulation** -> **Data Curation** -> **Text preperation and processing** -> **Text exploration** -> **Classifier output**.
96
Preperation in strucutred data: **Extraction**
Creating a new variable from an already existing one for easing the analysis. **Example**: Date of birth -> Age
97
Preperation in strucutred data: **Aggregation**
2 or more variables aggregated into one signle variable.
98
Preperation in strucutred data: **Filtration**
Eliminate data rows which are not needed. [We filter out the information that is not relevant] CFA Lv 2 Candidates only
99
Preperation in strucutred data: **Selection**
Columns that can be eliminated
100
Preperation in strucutred data: **Conversion**
Nominal, ordinal, integer, ratio, categorical.
101
Cleansing strucutred data: **Incomplete**
Missing entries
102
Cleansing strucutred data: **Invalid**
Outside a meaningful range
103
Cleansing strucutred data: **Inconsistent**
Some data conflicts with other data.
103
Cleansing strucutred data: **Inaccurate**
Not a true value
104
Cleansing strucutred data: **non-uniform**
Non identical data format American date (M/D/Y) vs European (D/M/Y)
105
Cleansing strucutred data: **Duplication**
Multiple identical observation
106
Adjusting the range of a feature: **Normalization**
**Rescales in the rage 0-1** Sensitive to outliers. Xi- Xmin /(Range) Xi- Xmin /(Xmax -Xmin)
107
Adjusting the range of a feature: **Standardization**
Centers and Rescales Requiers normal distribution (Xi - u) / Standard deviation
108
Performance evaluation graph: **Precision formula**
**P**= TP / (TP + FP) **Remeber**: Demoninator ( Positive) Useful when type 1 error is high is the ratio of correctly predictive positive classes to all predictive positive classes. Precision is useful in situations where the cost of FP or Type I Error is high. For example, when an expensive product fails quality inspection (predicted class 1) and is scrapped, but it is actually perfectly good (actual class 0).
109
Performance evaluation graph: **Recall formula**
TP / / (TP + FN) **Remember**: ( Recall we have the opposite in the denominator) **Sensitivity**: useful when type 2 error is high. also known as sensitivity i.e. is the ratio of correctly predicted positive classes to all actual positive classes. Recall is useful in situations where the cost of FN or Type II Error is high. For example, when an expensive product passes quality inspection (predicted class 0) and is sent to the valued customer, but it is actually quite defective (actual class 1)
110
Performance evaluation graph: **Accuracy formula**
(TP + TN) / (TP + FN + TN + FP) Is the percentage of correctly predicted classes out of total predictions.
111
Receiver operating characterisitcs: **False Positive Rate Formula**
FP / (FP + TN) Statement / (Statement + Opposite)
112
Receiver operating characterisitcs: **True Positive Rate Formula**
TP / (TP + FN) Statement / (Statement + Opposite)
113
In big data projects, which measure is the most appropriate for regression method
RMSE (Root Mean Square Error)
114
What is "**trimming**" in big data projects?
Removing the bottom and top 1% of observation on a feature in a data set.
115
What is "**Winsorization**" in big data projects?
Replacing the extreme values in a data set with the same maximum or minumimum value
116
Confusion Matrix: F1 Score Formula
(2 x P x R) / (P + R) is the harmonic mean of precision and recall. F1 Score is more appropriate than Accuracy when unequal class distribution is in the dataset andit is necessary to measure the equilibrium of Precision and Recall. High scores on both of these metrices suggest good model performance.
117
Confusion Matrix Display
TP FP FN TN
118
What is **Mutual Information** in big data projects?
**How much info a token contributes to a class** Mutual Information (MI) Measures how much information is contributed by a token to a class of text. **MI = 0** The token's distribution in all text classes is the same. **MI = 1** The token in any one class tends to occure more often in only that particular class of text.
119
Feature Engineering
Final stage in Data Exploration **Numbers**: Differentitate among types of numbers **N-Grams**: Multi-Word patterns kept intact **Name entity recognition (NER)**: Class: Money, Time, Organization.
120
How to deal with **Class Imbalance?**
The majority class can be under-sampled and the minority class can be over-sampled.
121
Tokenization is the process of
Splitting a givien text into seperate words or characters. Token is equvulant to a word, and tokenization is the process of splitting the word into seperate tokens.
122
the sequence of steps for **text preprocessing** is to produce
Tokens -> N-grams which to build a bag -> Input to a document term matrix.
123
Big Data differs from traditional data sources based on the presence of a set of characteristics commonly referred to as the 4 V's.. **What are thr 4 V's? **
**Volume**: refers to the quantity of data. **Variety**: pertains to the array of available data sources. **Velocity:** is the speed at which data is created (data in motion is hard to analyze compared to data at rest). **Veracity**: related to the credibility and reliability of different data sources.
124
What is **Exploratory Data Analysis (EDA)**, and in which stage is it in?
**Stage 4 and first stage in Data Exploration** is the preliminary step in data exploration. Exploratory graphs, charts and other visualizations such as heat maps and word clouds are designed to summarize and observe data.
125
What is **Feature Selection**, and in which stage is it in?
**Stage 4 and Second stage in Data Exploration** is a process whereby only pertinent features from the dataset are selected for ML model training. Feature
126
What is **Feature Engineering**, and in which stage is it in?
**Stage 4 and Third and final stage in Data Exploration** is a process of creating new features by changing or transforming existing features. Feature Engineering techniques systematically alter, decompose or combine existing features to produce more meaningful features.
127
Formula For F-Test
MSR / MSE [ RSS / K ] / [SSE / n-(k+1) ] [ Regression / K ] / [ Residual / n-(k+1) ]
128
Hypothesis test for F test
F test > F stat : Reject null. b1 = b2 = bn = 0 F test < F stat : Fail to reject null. b1 =/ b2 =/ bn =/ 0
129
What are the 3 types of error in ML?
Bias error Variance error Base error
130
What is variance error in ML?
Variance Error or how much the model's results change in response to new data from validation and test samples. Unstable models pick up noise and produce high variance causing overfitting and ↑ out of-sample error.
131
What is Bias error in ML?
Bias Error or the degree to which a model fits the training data. Algorithms with erroneous assumptions produce high bias with poor approximation, causing underfitting and ↑ in-sample error. (Adding more training samples will not improve the model)
132
What is Bias error in ML?
Base Error due to randomness in the data. (Out-of-sample accuracy increases as the training sample size increases)
133
Name 2 ways to Preventing Overfitting in Supervised Machine Learning
Ocean's Razor: The problem solving principle that the simplest solution tends to be the correct one. In supervised ML, it means preventing the algorithm from getting too complex during selection and training by limiting the no. of features and penalizing algorithms that are too complex or too flexible by constraining them to include only parameters that reduce out-of-sample error. K-Fold Cross Validation: This strategy comes from the principle of avoiding sampling bias. The challenge is having a large enough data set to make both training and testing possible on representative samples.