Data Manegement Flashcards

Question 1

Q

2 types of sources:

Answer

A

Proprietary data (owned by the company)
Publicly available - no competitive advantage, good for a starting point

This is why most companies invest in labeling their own datasets.
Also why the flywheel model is super useful

Question 2

Q

What is Semi-supervised?

Answer

A

It means that without having labels, you train the model by changing the question so something that you have (two parts of the same sentence, different parts of a cat from
The same pictures etc…)

Question 3

Q

Image data augmentation

Answer

A

Must do for trading vision,
Frameworks provide this.
Done on the cpu in parallel to the training on gpu.

For tabular - delete some cells to simulate missing data.

For text no well established techniques. Replace words with synonyms change order…

Question 4

Q

Storage option - filesystem

Answer

A

Foundational layer of storage.
The data can be changed and it’s not organized.
Fastest option
Can be simple on the machines, networked, or distributed.

Question 5

Q

Local data format

Answer

A

Binary data: just file (tensor flow has something called tfrecords batches files, not important with NVMe drives)

Large tabular/text:
Parquet is wide spread and recommended.
Feather by Apache arrow is up and coming
Hdf5 is old and not relevant

Try to use the native tensor flow/ PyTorch dataset classes

Question 6

Q

Object storage

Answer

A

An api over a file system (like Amazon s3).
Than you can store there without worrying where they are stored.
Good for versioning and redundancy
Fast enough within the cloud.
Cheaper some times

Question 7

Q

Database - what is it?

Answer

A

A fast, persistent, scalable storage and retrieval of structured data that will be accessed repeatedly

Question 8

Q

Database - a good mental model for it and what should be stored

Answer

A

A mental model: everything is actually in RAM, software insures it logged to disk
Not for binary data but for references.

Use Postgres
Don’t use noSQL

Question 9

Q

OLTP

Answer

A

Online transactions processing - databases

Question 10

Q

OLAP

Answer

A

Online analytical processing = warehouse

Question 11

Q

ETL

Answer

A

Extract, transform, load -
also warehouses
The idea is to extract data from different sources, transform it to some common schema and than upload it to a warehouse

Question 12

Q

Data lake

Answer

A

ELT - extract, load, transform

Not like warehouse - here you first load the data, than you transform it and move it to the place that needs it

Question 13

Q

SQL and dataframe

Answer

A

Pandas dataframes is like sql with benefits for both, I should know both

Question 14

Q

Lake house

Answer

A

This is the trend now - it’s both a data lake and a warehouse.
It’s an open source called “delta lake” where you can store all kinds of data - structured, semistructured and unstructured. Everything
Than it’s connects to the analytics tools or the ml tools etc

Question 15

Q

Data management summery:

Answer

A

Binary data: images sound files compressed text etc is stored as objects

Metadata (labels, user activity) is stored in database

If there are features that are not obtained from the data base (Like logs) set up a data lake and a process to aggregate needed data

At training time copy the data that is needed to a filesystem on a fast drive

Question 16

Q

Air flow, prefect area used for:

Answer

A

Managing the different task and workers that we have to run for our program

Question 17

Q

Feature store - what is it and what options are out there

Answer

A

The features are created from the data lake, stored, and and they are used for the training. But the model is than loses into a different pipeline where the data is extracted from real time and than transformes etc…
So a feature store is a why to have this 2 processes as similar as possible so we will have as little bugs as possible.

Options - tecton (a company), feast (open source)

Question 18

Q

Tools to optimize panadas

Answer

A

Dask - if the data is too big, they can run very fast, with paralleling the dataset

Rapids - scaling data analytics to the GPU

Question 19

Q

Data labeling

Answer

A

Best is to hire a company to do it,
If not than label yourself with existing tools
And if not that than crowdsource…

Question 20

Q

Data versioning

Answer

A

It’s important to version the data as it is important to version the code and the model.

Using simple git-lfs is good but can get really big.
DVC - open source for versioning data.

Question 21

Q

How to deal with data that is private and kept within the company (hospitals and such)

Answer

A

Federated learning - the learning is done on the computers of the hospitals… and only the results come back to you so no need to have access to the data.

Differential privacy - aggregating data such that individual points cannot be identified

Learning on encrypted data

All in research