Educative Machine Learning System Design - ML Primer Flashcards

Question

What is a crossed feature?

Answer 1

Crossed features, or **conjunction**, between two categorical variables of cardinality c1 and c2 is just another categorical variable of cardinality c1×c2. If c1 and c2 are large, the conjunction feature has high cardinality, and the use of the hashing trick is even more critical in this case. Crossed feature is usually used with a hashing trick to reduce high dimensions. As an example, suppose we have Uber pick-up data with latitude and longitude stored in the database, and we want to predict demand at a certain location. If we just use the feature latitude for learning, the model might learn that a city block at a particular latitude is more likely to have a higher demand than others. This is similar for the feature longitude. However, a feature cross of longitude by latitude would represent a well-defined city block. Consequently, the model will learn more accurately

Answer 2

Feature embedding is an emerging technique that aims to transform features from the original space into a new space to support effective machine learning. The purpose of embedding is to capture semantic meaning of features; for example, similar features will be close to each other in the embedding vector space.

Answer 3

For popular deep learning frameworks like TensorFlow, you need to define the dimension of embedding and network architecture. Once defined, the network can learn embedding automatically.

Answer 4

Normalization and Standardization For numeric features, normalization can be done to make the mean equal 0, and the values be in the range [-1, 1]. There are some cases where we want to normalize data between the range [0, 1]. If features distribution resembles a normal distribution, then we can apply a standardized transformation.

Answer 5

In practice, normalization can cause an issue as the values of min and max are usually outliers. One possible solution is “clipping”, where we choose a “reasonable” value for min and max.

Answer 6

* Use class weights in loss function * Use naive resampling * Use synthetic resampling

Answer 7

One common solution is to store data in a column-oriented format like Parquet or ORC. These data formats enable high throughput for ML and analytics use cases. In other use cases, the tfrecord data format is widely used in the TensorFlow ecosystem.

Answer 8

Parquet and ORC files usually get partitioned by time for efficiency as we can avoid scanning through the whole dataset. In this example, we partition data by year then by month. In practice, most common services on AWS, RedShift, and Athena support Parquet and ORC. In comparison to other formats like csv, Parquet can speed up the query times to be 30x faster, save 99% of the cost, and reduce the data that is scanned by 99%.

Answer 9

Most common metrics are the Mean Absolute Percentage Error (MAPE) and the Symmetric Absolute Percentage Error (SMAPE).

Answer 10

Cross-entropy

Answer 11

A **common design pattern** is to use a scheduler to retrain models on a regular basis, usually many times per day.

Answer 12

Machine learning engineers need to make the training pipeline run fast and scale well with big data. When you design such a system, you need to balance between **model complexity and training time.**

Answer 13

During inference, one common pattern is to split workloads onto multiple inference servers. We use similar architecture in Load Balancers. It is also sometimes called an Aggregator Service.

Answer 14

1. Clients (upstream process) send requests to the Aggregator Service. If the workload is too high, the Aggregator Service splits the workload and sends it to workers in the Worker pool. Aggregator Service can pick workers through one of the following ways: 1. Work Load 2. Round Robin 3. Request Parameter 2. Wait for response from workers. 3. Forward response to client.

Answer 15

In an online setting, data is always changing. Therefore, the data distribution shift is common. So, keeping the models fresh is crucial to achieving sustained performance. Based on how frequently the model performance degrades, we can then decide how often models need to update/retrain. One common algorithm that can be used is the [Bayesian Logistic Regression](https://quinonero.net/Publications/AdPredictorICML2010-final.pdf).

Answer 16

In an Ad Click prediction use case, it’s beneficial to allow some exploration when recommending new ads. However, if there are too few ad conversions, it can reduce company revenue. **This is a well-known exploration-exploitation trade-off**. One common technique is Thompson Sampling where at a time, *t*, we need to decide which action to take based on the reward.

Answer 17

During offline training and evaluating, we use metrics like logloss, MAE, and R2 to measure the goodness of fit. Once the model shows improvement, the next step would be to move to the staging/sandbox environment to test for a small percentage of real traffic.

Answer 18

A/B Testing This diagram shows one way to allocate traffic to different models in production. In reality, there will be few a dozen models, each getting real traffic to serve online requests. This is one way to verify whether or not a model actually generates lift in the production environment.

Educative Machine Learning System Design - ML Primer Flashcards

(42 cards)