W3-Machine Learning Modeling Pipelines in Production Flashcards

1
Q

What are the 2 basic ways to perform distributed training?

A

data parallelism and model parallelism.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

How is data parallelism done?

A

Data parallelism is probably the easiest of the two to implement. In this approach, you divide the data into partitions. You copy the complete model to all of the workers, where each one operates on a different partition of the data and the model updates are synchronized across workers.

This type of parallelism is model agnostic and can be applied to any neural network architecture. Usually, the scale of data parallelism corresponds to the batch size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

How is model parallelism done?

A

In model parallelism, however, you segment the model into different parts, training concurrently on different workers. Each model will train on the same piece of data.

Here workers only need to synchronize the shared parameters, usually once for each forward or back propagation step

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

There are two basic styles of distributed training using data parallelism. Name them, give an example of each and say how they work?

A

synchronous, asynchronous
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
In synchronous training, each worker trains on its current mini batch of data, applies its own updates, communicates out its updates to the other workers. And waits to receive and apply all of the updates from the other workers before proceeding to the next mini batch. And all-reduce algorithm is an example of this.
\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
In asynchronous training, all workers are independently training over their mini batch of data and updating variables asynchronously. Asynchronous training tends to be more efficient, but can be more difficult to implement. Parameter server algorithm is an example of this.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

One major disadvantage of asynchronous training is ____ and ____.

____ may not be a problem, since the speed up in asynchronous training may be enough to compensate. However, the ____ may be an issue depending on ____ and ____.

A

Reduced accuracy
Slower convergence (which means that more steps are required to converge)
Slow convergence
Accuracy loss
How much accuracy is lost
The requirements of the application

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

To use distributed training, it’s not important that models become distribute-aware. True/False

A

False, to use distributed training, it’s important that models become distribute-aware

Fortunately, high-level APIs like Keras or Estimators support distributed training.
It requires a minimal amount of extra code to adapt your models for distributed training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

There are many different strategies for performing distributed training with TensorFlow. Name some

A

The following ones are the most used,
* one device strategy,
* mirrored strategy,
* parameter service strategy,
* multi-worker mirrored strategy,
* central storage strategy,
* TPU strategy.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Typical usage of this strategy is testing your code before
switching to other strategies that actually distribute your code.
Which method of data parallelism is this?

A

one device strategy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

This strategy is typically used for training on one machine with multiple GPUs.
Which method of data parallelism is this?

A

Mirrored Strategy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Some machines are designated as workers and others as parameter servers.
Which method of data parallelism is this?

A

Parameter Server Strategy

By default, workers read and update these variables independently without synchronizing with each other. This is why sometimes parameter server style training is also referred to as asynchronous training.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

____ are a key part of high-performance modeling, training, and inference, but they are also expensive, so it’s important to use them efficiently.

A

Accelerators

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are the differences and similarities between inference pipelines and input pipelines

GPT Summary

A

Inference pipelines are similar to input pipelines, but instead of being used to feed data to a model during training, they are used to feed new data to an already trained model in order to make predictions or perform other inference tasks.

Like input pipelines, the goal of inference pipelines is to efficiently utilize available hardware resources and minimize the time required to load and process data. Typically, the same issues around pre-processing data and efficiently using compute resources apply to both input and inference pipelines.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Input pipelines are an essential part of training pipelines and inference pipelines, and____ is one framework that can be used to design an efficient input pipeline

GPT Summary

A

TensorFlow data or tf.data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Why are input pipelines used?

GPT Summary

A

In order to use accelerators efficiently and reduce the time required to load and pre-process data, input pipelines are used, which are an important part of many training and inference pipelines.

By optimizing pipelines, model training time can be significantly reduced, and resource utilization can be greatly improved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

There is a pressing need for an efficient, scalable infrastructure that can enable large-scale deep learning and overcome the memory limitations on current accelerators. True/False

GPT Summary

A

True

larger models, though effective, are limited by hardware constraints, especially memory limitations on current accelerators, and the gap between model growth and hardware improvement has increased the importance of parallelism.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Two ways to parallelize, data parallelism and model parallelism (data parallelism splits the input across workers, and model parallelism splits the model across workers) are not used for enormous models, why? What are the suggested replacements?

GPT Summary

A

Both data and model pipelines have synchronization costs that lead to degraded performance, preventing their use in training enormous models. This synchronization cost leads to slow communication between CPUs, GPUs, and accelerators, making them the bottleneck in the training process. Moreover, the author suggests some solutions, such as gradient accumulation and swapping

In parallel training, different portions of the model are processed by different hardware devices simultaneously, and the results of each device are combined at certain synchronization points to update the model’s parameters. However, synchronizing these updates can be expensive, both in terms of time and computational resources, and can result in degraded performance, especially in the case of enormous models.

17
Q

What’s knowledge distillation?

A

The concept of knowledge distillation, involves training a smaller, more efficient model to mimic a larger, more complex model’s knowledge. The goal is to capture the same level of sophistication in a smaller model, making it easier to deploy in production environments, such as mobile phones and edge devices.

18
Q

How does knowledge distillation work?

A

The teacher model is trained first, using a standard objective function that seeks to maximize its accuracy or similar metric, and the student model uses an objective function that seeks to match the probability distribution of the teacher’s predictions.

19
Q

Why do we use softmax temperature in knowledge distillation?

A

We use softmax temperature in knowledge distillation to transfer knowledge from a larger, more complex model (teacher model) to a smaller, simpler model (student model).

When the teacher model makes predictions, the softmax temperature controls how “soft” or “hard” the probabilities are. A higher temperature will result in a softer probability distribution, meaning that the probabilities are more spread out among the classes. On the other hand, a lower temperature will result in a harder probability distribution, meaning that the probabilities are more concentrated on a single class.

In knowledge distillation, we use a higher temperature for the teacher model to generate a softer probability distribution. We then train the student model using this soft distribution as targets, rather than the hard targets (i.e., one-hot encoded labels). This allows the student model to learn from the teacher model’s knowledge more effectively, as the soft targets provide more information than the hard targets.

By gradually decreasing the temperature during training, the student model can eventually learn to generate hard predictions (i.e., one-hot encoded labels) that match the teacher model’s predictions. This can result in improved performance on the task, especially when the student model is smaller than the teacher model.

20
Q

Give two examples of knowledge distillation techniques, explain how they work shortly.

A

1) The Two-Stage Multi-Teacher Knowledge Distillation method (TMKD) is a knowledge distillation technique that involves using multiple teachers (By ensembling the teacher models) in two stages to distill knowledge into a student model.

2) The Noisy Student method is a type of knowledge distillation technique used in machine learning. It involves training a student model on a larger and more diverse (noisier) dataset than the one used to train the teacher model.

21
Q

The TMKD advantage:

The Noisy student advantage:

A

The TMKD: By ensembling the predictions of multiple teachers, the method can reduce the impact of overfitting or noisy predictions from individual teachers. Additionally, the two-stage approach of TMKD allows for the use of pre-trained teacher models, which can save time and resources compared to training all the models from scratch.

The Noisy student: the technique can help address the issue of domain shift, where the distribution of the new dataset is different from the distribution of the original dataset used to train the teacher model.