Data Manegement Flashcards
2 types of sources:
- Proprietary data (owned by the company)
- Publicly available - no competitive advantage, good for a starting point
This is why most companies invest in labeling their own datasets.
Also why the flywheel model is super useful
What is Semi-supervised?
It means that without having labels, you train the model by changing the question so something that you have (two parts of the same sentence, different parts of a cat from
The same pictures etc…)
Image data augmentation
Must do for trading vision,
Frameworks provide this.
Done on the cpu in parallel to the training on gpu.
For tabular - delete some cells to simulate missing data.
For text no well established techniques. Replace words with synonyms change order…
Storage option - filesystem
Foundational layer of storage.
The data can be changed and it’s not organized.
Fastest option
Can be simple on the machines, networked, or distributed.
Local data format
Binary data: just file (tensor flow has something called tfrecords batches files, not important with NVMe drives)
Large tabular/text:
Parquet is wide spread and recommended.
Feather by Apache arrow is up and coming
Hdf5 is old and not relevant
Try to use the native tensor flow/ PyTorch dataset classes
Object storage
An api over a file system (like Amazon s3).
Than you can store there without worrying where they are stored.
Good for versioning and redundancy
Fast enough within the cloud.
Cheaper some times
Database - what is it?
A fast, persistent, scalable storage and retrieval of structured data that will be accessed repeatedly
Database - a good mental model for it and what should be stored
A mental model: everything is actually in RAM, software insures it logged to disk
Not for binary data but for references.
Use Postgres
Don’t use noSQL
OLTP
Online transactions processing - databases
OLAP
Online analytical processing = warehouse
ETL
Extract, transform, load -
also warehouses
The idea is to extract data from different sources, transform it to some common schema and than upload it to a warehouse
Data lake
ELT - extract, load, transform
Not like warehouse - here you first load the data, than you transform it and move it to the place that needs it
SQL and dataframe
Pandas dataframes is like sql with benefits for both, I should know both
Lake house
This is the trend now - it’s both a data lake and a warehouse.
It’s an open source called “delta lake” where you can store all kinds of data - structured, semistructured and unstructured. Everything
Than it’s connects to the analytics tools or the ml tools etc
Data management summery:
Binary data: images sound files compressed text etc is stored as objects
Metadata (labels, user activity) is stored in database
If there are features that are not obtained from the data base (Like logs) set up a data lake and a process to aggregate needed data
At training time copy the data that is needed to a filesystem on a fast drive