Quantitative Analysis Flashcards
Token
tokenization
Word
Splitting a sentence into words
Document term matrix
Convert unstructured data into structured data
5 steps of data analysis
Conceptualization of modeling task
Data collection
Data preparation and wrangling
Data exploration
Model training
Errors reduced by data cleansing
Missing, invalid, non-uniform and inaccurate
Data Normalization and Standardization
Parsimonious model
Parsimonious models are simple models with great explanatory predictive power. They explain data with a minimum number of parameters, or predictor variables.
Techniques of feature engineering
Numbers - four digit number usually associated with years and are assigned number4
N-grams - multiword patterns ex expansionary_monetary_policy
Name of entity (NER) - Microsoft > ORG
Parts of speech (POS) - Microsoft > proper noun, 1969 > cardinal number
Feature selection methods
Frequency - number of documents with that token divided by total number of documents (document frequency DF)
Chi-square - rank tokens by usefulness to a class
Mutual information (MI) - if a token appears in all classes it is not considered useful discriminant and equals to 0.
Tokens associated with 1 or fewer classes would have a MI approaching 1.
Steps of data exploration
1 exploratory data analysis
2 feature selection
3 feature engineering
One-hot-encoding (OHE) - transform categorical feature into a binary variable for machine processing
What is overfitting?
Issue with a supervised ML that results when a large number of features (indep. Variables) are included in the data sample. It will decrease the accuracy of model forecasts on out of sample data (they do not generalize well to new data - low out of sample R2 )
What are the 3 tasks of model training?
1 method selection
Supervised learning - support vector machine (SVM) and Neural Networks (NNs)
Unsupervised learning - clustering, dimension reduction, anomaly detection
type of data
Numerical data - classification and regression trees (CART)
Text data - generalized linear model (GLM) and SVMs
Image data - NNs and deep learning methods
Size of data - large data SVMs and NNs work better with large number of observations and few features
2 Performance evaluation
3 tuning- implement changes to improve performance
How to divide data set for supervised learning in model training process?
60% for model training
20% model validation and tuning
20% test out of sample performance
Model fitting erros can be caused by:
Size of training sample (small data sets)
Number of features (small > underfitting, large > overfitting)
The three tasks of model training are:
1 method selection
Supervised (training data contains ground truth or known outcome) or unsurpervised learning (no target available)
2 Type of data
Numerical data (CART methods)
Text data (GLMs)
Image (Neural Networks and deep learning)
3 Size of data
Large data sets with many observations and features (SVMs)
Large number of observations and few features (NNs)
What is error type 1 and 2
Type 1 are false positives
Type 2 are false negatives