Feature Stores Flashcards

1
Q

What is Feature Store?

A

Used for organizing, storing and serving features.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the most common challenges with features?

A
  • Hard to share and reuse
  • Hard to reliably serve with low latency in production
  • Training-serving skew
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What types of data ingestion is Feature Store supporting?

A

Both online and batch data ingestion

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is the purpose of the timestamp column in the feature store and is it always present?

A

To track changes of feature values over time. If you ingest all features at the same time you don’t have a timestamp column, you can set it explicitly.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What column represents entity in BQ, Avro and csv files?

A

BQ - column name
Avro - the name of the schema that represents binary data
CSV - first column

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Are arrays supported as the feature type?

A

Yes, but not in CSV files. You can not include a null value for an array, it has to be an empty array.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Can you ingest data directly from a data source to a feature store?

A

You can, but you have to first pre-process the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Briefly explain the process of creating a feature store

A
  1. Pre-process the data
  2. Make sure that column that represents an entity is present
  3. (Optional) Add timestamp column
  4. Create a feature store
  5. Define entity
  6. Add features and their types
  7. Create an ingestion job to ingest features from data source
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is feature serving?

A

Process of exporting stored feature values for prediction or inference. Both online and batch serving is supported.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the main feature serving APIs and associated stores?

A

Online serving API - used for low latency online predictions and it uses online store for storing features
Batch serving APIs - used for model training and uses offline store for storing features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Is it possible to have multiple batch ingestion jobs running at the same time?

A

Yes, but only one per entity type.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is data dredging?

A

While training the model, you overload the dataset with a lot of features for which you are not even sure if they are connected to the problem you are trying to solve

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What options do you have when converting categorical features into numerical?

A

One (multi) - hot encoding or using Word2Vec to create an embedding vector

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a rule of thumb when it comes to a minimum number of examples of a feature values?

A

It should have a minimum of 5 examples

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is data discretization?

A

When you have a continuous values and you convert them to set of intervals and associate them with a a specific data value. You should have at least 5 examples for each interval. Keep in mind outliers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What would be the approach to create one/multi-hot encoding for the following situations:
1. if you know vocabulary keys upfront
2. if your data is already indexed, ex. 0-N integers
3. If you don’t have a vocabulary of all possible values

A
  1. create columns by using known keys as columns
  2. split them into buckets
  3. hash the values and then divide by the number of buckets you want to create
17
Q

Imagine you have a feature value from 0 - 20 that represents the number of months that the ad was displayed. What happens if the ad was never posted? What is the difference between 2 days and not being posted at all?

A

You need to create an additional boolean feature column “posted” with values 1 or 0

18
Q

How do you treat categorical features that have null value?

A

Open option would be to add an additional feature value like “Non know” and use it in one-hot encoding

19
Q

How do you treat numerical values that have null values?

A

One option would be to set its value to the mean of that feature

20
Q

How to decide if you should create intervals for continuous numerical values or use them in a raw format? Example age of a person

A

It depends on the problem, if there is a non-linear relation between a label and a continuous value you should create intervals, otherwise use a raw value.

21
Q

What types of binning exist?

A

Fixed - manually define ranges based on domain knowledge
Adaptive - automatically define bin ex. using quantile-based adaptive binning - split data points into equally distributed bins

22
Q

Name at least one approach to transforming the skewed dataset to normalized.

A

Using log transform y = log (x) with base b where b is usually e = 2.71828 popularly known as Euler’s number.

This is transforming the dataset to be close to normal distribution.

23
Q

What is the difference between nominal and ordinal categorical data?

A

With nominal data there is no relation between categories, one doesn’t come after another nor have a bigger value. (ex. type of weather)

24
Q

What is a “curse of dimensionality”?

A

When there are a lot of features compared to a number of examples - can lead to model overfitting

25
Q

How to handle features that have high correlation?

A

Remove one of them

26
Q

What is a feature cross and what is mandatory to do to create a feature cross in BQML?

A

Feature cross is combining values of different features to create a new feature (e.g. hour of the day and day of the week). You need to cast to STRING value so they are treated as categorical features that will have a coefficient assigned to them when creating a feature cross.

27
Q

How to convert continuous numerical features into categorical features with BQML?

A

ML.BUCKETIZE - create buckets by providing feature and splitting points

28
Q

What is the purpose of ML.TRANSFORM() function?

A

Add all preprocessing steps in TRANSFORM() when creating/training the model from BQML. These transformations will be then automatically invoked during evaluation and prediction.

29
Q

What is Dataflow?

A

It is a Google fully managed serverless service based on Apache Beam. It is used to create data pre-processing pipelines. It separates pipeline definition from execution (you first design the whole pipeline how it would look like and then run it). It supports various data sources like GCS, file systems, Pub/Sub etc. It supports both batch and streaming data.

30
Q

What is a PCollection?

A

PCollection represents one step/transformation in the Dataflow. It doesn’t store data in-memory when doing transformation, it is only having pointers to the storage where data is located and during transformations directly.

31
Q

How can you run a Dataflow pipeline?

A

You can run it locally or submit a job on Dataflow.

32
Q

Is tf.transform doing preprocessing both during training and evaluation/prediction?

A

Yes