Quant 2.7 Flashcards
What is Big Data?
Big data includes all the data generated by financial markets, businesses, governments, individuals, sensors, internet of things and more.
Big data has grown exponentially over the past decade.
Investment managers are increasingly using big data (example is using financial text data for forecasting stock movement)
Why is big data different than traditional data?
Cos of 3 characteristics (3 V’s):
Volume - Quantity of data
Variety - array of available data sources
Velocity - speed at which data are created
When using big data for prediction or inference, we have a 4th V
Veracity - Credibility and reliability of different data sources
What are the steps in executing a data analysis project with structured data?
There are two sources of data - internal & external
Then the steps to build a model include:
1. Conceptualization - What will the model do and who will use the model?
2. Data collection
3. Data Prep & Wrangling - Cleaning the data, filling in the missing elements, etc
4. Data exploration
5. Model training
What are the steps in executing a data analysis project with unstructured data?
There can be quite a lot of sources for unstructured data, news articles, social media, other docs, open data, etc
Steps to build a model are similar to that of structured data:
1. Text problem formulation - Define the text classification prob (example can be to come up with a sentiment score)
2. Data curation
3. Text prep & wrangling - Cleaning and pre-processing tasks.
4. Text exploration
5. Model Training
What is Data Collection?
collection from internal or external sources based on the trade-offs between time, financial costs and accuracy.
What is data preparation (cleansing)?
Examine, identify and mitigate errors in raw data
What is Data Wrangling (pre-processing)?
perform transformations and critical processing steps on the cleansed data to make it ready for the ML model.
What are the types of errors one can expect while dealing with structured data while data preparation?
Types are, incompleteness error (blanks), invalidity (DOB with year 1900), Inaccuracy (gender name mismatch), inconsistency (some other answer to Y or N), Non-uniformity (different formats for the same data type), duplication error.
What are the types of errors you can come across in the step data wrangling with structured data?
Some common terms you need to know are Extraction, Aggregation (Salary + other income = total income), Filtration (only looking at filtered data), Selection (only looking at a particular selection), Conversion (All total income should be in USD).
How are outliers handled in data wrangling?
By trimming and winsorization
What do you mean by scaling?
Normalization and standardization are two methods we use for scaling in data wrangling.
Normalization = (Xi - Xmin) / (Xmax - Xmin) Range.
Standardization (reducing the impact of outliers) = (Xi - u) / Std Deviation.
What does unstructured data preparation include?
Involves removing unimportant text from raw input - removing html tags, punctuations, most numbers & white spaces.
What does unstructured data wrangling include?
We start with tokenization which is splitting a given sentence into different tokens. Token is equivalent to a word.
Then, we normalise the tokens. Normalization includes Lower casing, then removing stop words, then stemming (increased or increasing are both stemmed down to increas - to increase the frequency of infrequent words) and final method is Lemmatization (which is seldom used).
After normalization, we’re left with a Bag of Words -> which can be thought of as a distinct set of tokens or words where the sequence doesn’t come into play.
What is a DTM?
A document term matrix consists of a table where rows are the different sources from which the tokens were collected and the columns are the tokens/words.
This table is the end result of our text prep and wrangling process which leaves us with the final DTM (a structured table) that we can use in our model.
What are N-grams?
We can also have tokens in the form of a sequence of words. It can be a two word sequence(Bi-grams) or three word sequence(Tri-grams) and the Bag of words thus can become bag of N-grams.