Quantitative Analysis Flashcards
Token
tokenization
Word
Splitting a sentence into words
Document term matrix
Convert unstructured data into structured data
5 steps of data analysis
Conceptualization of modeling task
Data collection
Data preparation and wrangling
Data exploration
Model training
Errors reduced by data cleansing
Missing, invalid, non-uniform and inaccurate
Data Normalization and Standardization
Parsimonious model
Parsimonious models are simple models with great explanatory predictive power. They explain data with a minimum number of parameters, or predictor variables.
Techniques of feature engineering
Numbers - four digit number usually associated with years and are assigned number4
N-grams - multiword patterns ex expansionary_monetary_policy
Name of entity (NER) - Microsoft > ORG
Parts of speech (POS) - Microsoft > proper noun, 1969 > cardinal number
Feature selection methods
Frequency - number of documents with that token divided by total number of documents (document frequency DF)
Chi-square - rank tokens by usefulness to a class
Mutual information (MI) - if a token appears in all classes it is not considered useful discriminant and equals to 0.
Tokens associated with 1 or fewer classes would have a MI approaching 1.
Steps of data exploration
1 exploratory data analysis
2 feature selection
3 feature engineering
One-hot-encoding (OHE) - transform categorical feature into a binary variable for machine processing
What is overfitting?
Issue with a supervised ML that results when a large number of features (indep. Variables) are included in the data sample. It will decrease the accuracy of model forecasts on out of sample data (they do not generalize well to new data - low out of sample R2 )
What are the 3 tasks of model training?
1 method selection
Supervised learning - support vector machine (SVM) and Neural Networks (NNs)
Unsupervised learning - clustering, dimension reduction, anomaly detection
type of data
Numerical data - classification and regression trees (CART)
Text data - generalized linear model (GLM) and SVMs
Image data - NNs and deep learning methods
Size of data - large data SVMs and NNs work better with large number of observations and few features
2 Performance evaluation
3 tuning- implement changes to improve performance
How to divide data set for supervised learning in model training process?
60% for model training
20% model validation and tuning
20% test out of sample performance
Model fitting erros can be caused by:
Size of training sample (small data sets)
Number of features (small > underfitting, large > overfitting)
The three tasks of model training are:
1 method selection
Supervised (training data contains ground truth or known outcome) or unsurpervised learning (no target available)
2 Type of data
Numerical data (CART methods)
Text data (GLMs)
Image (Neural Networks and deep learning)
3 Size of data
Large data sets with many observations and features (SVMs)
Large number of observations and few features (NNs)
What is error type 1 and 2
Type 1 are false positives
Type 2 are false negatives
Formula of model accuracy and F1 score
Formula precision and recall
AUD/GBP 1.5060 - 1.5067
1 mm GBP and 1 mm AUD
Apply up the bid and multiply
Down the ask and divide
1 GBP X 1,5060
1 AUD x 1,5067
Z Statonato cpf 68%, 90%, 95%, 99%
T statistic of 90%, 95%, 99% os more ir less Z statistic
R2 or R2adj is better? Why?
R2 always increases with the addition of variables and it may cause overfiting.
R2adj
Effect of model misspecification
Assumptions de regressão multipla
What is heteroskedasticity type 1 and 2?
What is serial correlation? What are the implications?
What is serial correlation? What are the implications?
How to detect serial correlation?
What are the implications of multicolinearity?
How to detect multicollinearity?
Test F or
What is? Effect? Detection? Correction?
Conditional heteroskedasticity, serial correlation and multicollinearity
What is outlier and what is high leverage point
What is the rmse criterion?
How to calculate mean reverting level?
ARCH
What is ARCH, its effect and how to correct it.
Autoregressive conditional heteroskedasticity exists when the variance of the residuals from a period depends on the variance of the residuals from previous period.
How to test serial correl in AR model? And how to fix it?
Can’t use DW
Use t-test on residual autocorrelation. Add a lag , seasonal lag
ML - relation btw complexity and vias / variance
ML - What is generalization
ML model capacity to make accurate out of sample predictions
What is bagging? Why it is important?
Como calcular accruals ratio e aggregate accruals
Aggregate accruals = NI - (CFO + CFI)