Production Machine Learning systems Flashcards

Question 1

Q

What is a concept drift?

Answer

A

It is a change in the relationship between input and output of models. It doesn’t have to be necesseraly connected to data drift, it can be influenced by many different things like hidden context - ex. user behavior has changed over time due to influence of economy strength which is not visible in data.

Question 2

Q

What types of concept drift exist?

Question 3

Q

What are two main types of distributed training architecutres?

Answer

A

Data parallelism - split the training data between multiple worker nodes
Model parallelism - as the model can’t fit in a memory, split the model but use the same data

Question 4

Q

What are 2 common data parallelism model approaches? Explain in detail and when to use them.

Answer

A

Synchonous AllReduce architecture - each worker node is working with a subset of mini-batch used for training. Workers are splitting the workload but they have to be in-sync, waiting for others to finish their part and then proceed with the next mini-batch. They are good for dense models with a lot of features, ex. BERT
Asynchronous parameter server architecture - nodes are split between worker and parameter server nodes. Nodes are not in sync, each worked node is taking a mini-batch and the latest parameters provided from parameter server nodes. When training is completed for a mini-batch on a worker nodes, paramteres are updated. This architecture is more fitting for sparse models, models with not too many features, etc.

Question 5

Q

What is model parallelism?

Answer

A

It is a distributed training architecture where model is split in layers and each layer is used for training on the same mini-batch and communicating with other layers. They have to be in-sync. This is used when model is too big and it can’t fit in a memory.

Question 6

Q

What are 4 types of Tensorflow distirbuted training strategies?

Answer

A

Mirrored strategy
Multi-worker mirrored strategy
TPU strategy
Parameter server strategy

Question 7

Q

How Mirrored strategy for distributed training works?

Answer

A

It is used when there is a single machine with multiple GPUs. Model is replicated on each GPU and mini-batch size is split based on number of GPUs. Parameters have to be in sync across GPUs.

Question 8

Q

How Multi-worker Mirrored strategy for distributed training works?

Answer

A

Almost the same like for Mirrored strategy, the only difference is that there are now multiple machines that have multiple CPUs or GPUs and all of them are splitting the mini-batch. You need to define which machine is a chief (master) and which machines are worker nodes.

Question 9

Q

How TPU strategy for distributed training works?

Answer

A

The same like for Mirrored strategy, only difference is that workload is split between TPU cores. This strategy is optimised for biggest workloads and main consideration is to make sure that there is enough data that can be used and models are not sitting stale.

Production Machine Learning systems Flashcards

(9 cards)