Quantitative Analysis Flashcards by Carlos Shiguematsu

Token
tokenization

Word
Splitting a sentence into words

How well did you know this?

Not at all

Perfectly

Document term matrix

Convert unstructured data into structured data

How well did you know this?

Not at all

Perfectly

5 steps of data analysis

Conceptualization of modeling task
Data collection
Data preparation and wrangling
Data exploration
Model training

How well did you know this?

Not at all

Perfectly

Errors reduced by data cleansing

Missing, invalid, non-uniform and inaccurate

How well did you know this?

Not at all

Perfectly

Data Normalization and Standardization

How well did you know this?

Not at all

Perfectly

Parsimonious model

Parsimonious models are simple models with great explanatory predictive power. They explain data with a minimum number of parameters, or predictor variables.

How well did you know this?

Not at all

Perfectly

Techniques of feature engineering

Numbers - four digit number usually associated with years and are assigned number4

N-grams - multiword patterns ex expansionary_monetary_policy

Name of entity (NER) - Microsoft > ORG

Parts of speech (POS) - Microsoft > proper noun, 1969 > cardinal number

How well did you know this?

Not at all

Perfectly

Feature selection methods

Frequency - number of documents with that token divided by total number of documents (document frequency DF)

Chi-square - rank tokens by usefulness to a class

Mutual information (MI) - if a token appears in all classes it is not considered useful discriminant and equals to 0.
Tokens associated with 1 or fewer classes would have a MI approaching 1.

How well did you know this?

Not at all

Perfectly

Steps of data exploration

1 exploratory data analysis

2 feature selection

3 feature engineering
One-hot-encoding (OHE) - transform categorical feature into a binary variable for machine processing

How well did you know this?

Not at all

Perfectly

What is overfitting?

Issue with a supervised ML that results when a large number of features (indep. Variables) are included in the data sample. It will decrease the accuracy of model forecasts on out of sample data (they do not generalize well to new data - low out of sample R2 )

How well did you know this?

Not at all

Perfectly

What are the 3 tasks of model training?

1 method selection
Supervised learning - support vector machine (SVM) and Neural Networks (NNs)
Unsupervised learning - clustering, dimension reduction, anomaly detection
type of data
Numerical data - classification and regression trees (CART)
Text data - generalized linear model (GLM) and SVMs
Image data - NNs and deep learning methods
Size of data - large data SVMs and NNs work better with large number of observations and few features

2 Performance evaluation

3 tuning- implement changes to improve performance

How well did you know this?

Not at all

Perfectly

How to divide data set for supervised learning in model training process?

60% for model training
20% model validation and tuning
20% test out of sample performance

How well did you know this?

Not at all

Perfectly

Model fitting erros can be caused by:

Size of training sample (small data sets)
Number of features (small > underfitting, large > overfitting)

How well did you know this?

Not at all

Perfectly

The three tasks of model training are:

1 method selection
Supervised (training data contains ground truth or known outcome) or unsurpervised learning (no target available)
2 Type of data
Numerical data (CART methods)
Text data (GLMs)
Image (Neural Networks and deep learning)
3 Size of data
Large data sets with many observations and features (SVMs)
Large number of observations and few features (NNs)

How well did you know this?

Not at all

Perfectly

What is error type 1 and 2

Type 1 are false positives
Type 2 are false negatives

How well did you know this?

Not at all

Perfectly

Formula of model accuracy and F1 score

Study These Flashcards

Formula precision and recall

Study These Flashcards

AUD/GBP 1.5060 - 1.5067
1 mm GBP and 1 mm AUD
Apply up the bid and multiply
Down the ask and divide

Study These Flashcards

1 GBP X 1,5060
1 AUD x 1,5067

Z Statonato cpf 68%, 90%, 95%, 99%

T statistic of 90%, 95%, 99% os more ir less Z statistic

Study These Flashcards

R2 or R2adj is better? Why?

Study These Flashcards

R2 always increases with the addition of variables and it may cause overfiting.
R2adj

Effect of model misspecification

Study These Flashcards

Assumptions de regressão multipla

Study These Flashcards

What is heteroskedasticity type 1 and 2?

Study These Flashcards

What is serial correlation? What are the implications?

Study These Flashcards

What is serial correlation? What are the implications?

How to detect serial correlation?

What are the implications of multicolinearity?

How to detect multicollinearity?

Test F or

What is? Effect? Detection? Correction? Conditional heteroskedasticity, serial correlation and multicollinearity

What is outlier and what is high leverage point

What is the rmse criterion?

How to calculate mean reverting level?

ARCH What is ARCH, its effect and how to correct it.

Autoregressive conditional heteroskedasticity exists when the variance of the residuals from a period depends on the variance of the residuals from previous period.

How to test serial correl in AR model? And how to fix it?

Can’t use DW Use t-test on residual autocorrelation. Add a lag , seasonal lag

ML - relation btw complexity and vias / variance

ML - What is generalization

ML model capacity to make accurate out of sample predictions

What is bagging? Why it is important?

Como calcular accruals ratio e aggregate accruals

Aggregate accruals = NI - (CFO + CFI)

Quantitative Analysis Flashcards

(38 cards)