Feature Stores Flashcards
What is Feature Store?
Used for organizing, storing and serving features.
What are the most common challenges with features?
- Hard to share and reuse
- Hard to reliably serve with low latency in production
- Training-serving skew
What types of data ingestion is Feature Store supporting?
Both online and batch data ingestion
What is the purpose of the timestamp column in the feature store and is it always present?
To track changes of feature values over time. If you ingest all features at the same time you don’t have a timestamp column, you can set it explicitly.
What column represents entity in BQ, Avro and csv files?
BQ - column name
Avro - the name of the schema that represents binary data
CSV - first column
Are arrays supported as the feature type?
Yes, but not in CSV files. You can not include a null value for an array, it has to be an empty array.
Can you ingest data directly from a data source to a feature store?
You can, but you have to first pre-process the data.
Briefly explain the process of creating a feature store
- Pre-process the data
- Make sure that column that represents an entity is present
- (Optional) Add timestamp column
- Create a feature store
- Define entity
- Add features and their types
- Create an ingestion job to ingest features from data source
What is feature serving?
Process of exporting stored feature values for prediction or inference. Both online and batch serving is supported.
What are the main feature serving APIs and associated stores?
Online serving API - used for low latency online predictions and it uses online store for storing features
Batch serving APIs - used for model training and uses offline store for storing features
Is it possible to have multiple batch ingestion jobs running at the same time?
Yes, but only one per entity type.
What is data dredging?
While training the model, you overload the dataset with a lot of features for which you are not even sure if they are connected to the problem you are trying to solve
What options do you have when converting categorical features into numerical?
One (multi) - hot encoding or using Word2Vec to create an embedding vector
What is a rule of thumb when it comes to a minimum number of examples of a feature values?
It should have a minimum of 5 examples
What is data discretization?
When you have a continuous values and you convert them to set of intervals and associate them with a a specific data value. You should have at least 5 examples for each interval. Keep in mind outliers.