W1- Machine Learning Data Lifecycle in Production Flashcards

Question 1

Q

A directed acyclic graph (DAG) is a directed graph with no cycles, ML pipeline workflows are usually DAGs. True/False

ML Pipelines 01:55

Question 2

Q

What are Machine learning orchestration tools used for?

External

ML Pipelines 02:17

Answer

A

Machine learning orchestration tools are used to automate and manage workflows and pipeline infrastructure, with a simple, collaborative interface. Exp: Argo, Airflow, TFX, etc.

An End-to-End platform for deploying ml production pipeline

Question 3

Q

What are the -unique- challenges in production grade ML?

Overview 10:30

Answer

A

Building an integrated ML system
Continuously operate it in production
Handle continuously changing data
Optimize compute resource costs

Question 4

Q

Broken data (low-quality data) is the most common cause of problems in the production ML systems. True/False

Importance of Data 06:52

Answer

A

True

data collection is an important and critical first step to building ml systems and data.

Question 5

Q

A data pipeline is a series of data processing steps such as: (name 3)

Data Collection Assignment

Answer

A

Data collection
Data ingestion
Data preparation

Question 6

Q

Degrading model performance is either slow or fast. what are the causes of each?

Case Study: Degraded Model Performance 03:25

Answer

A

For slow (gradual) problems:
* Data changes (Trends and Seasonalities/ Change of the distribution of features/ Relative importance of the features)
* World changes (Style changes in dressing/ Scope and processes change/ Competitors or Business changes)

For sudden problems:
* Data collection problems (Bad sensor/ Bad log data/ Moved or disabled sensors)
* System problems (Bad software updates/ Loss of network connectivity/System down)

Question 7

Q

Name the methods of labeling. (5)

Answer

A

Process feedback (direct labeling)
Human labeling
Semi-Supervised labeling
Active learning
Weak supervision

Last 3 are advanced methods

Process Feedback and Human Labeling 00:47

Question 8

Q

What’s an example of Process feedback labeling and an example of Human labeling

Process Feedback and Human Labeling 01:08

Answer

A

Let’s look at some examples, for process feedback, a very typical example is click-through rates. Actual versus predicted click-through rates. Suppose you have recommendations that you are giving to a user, did they actually click on the things that you recommend? If they did, you can label it positive, if they didn’t you can label it negative.
Human labeling, you can have humans look at data and apply labels to them. For example, you can ask cardiologists to look at MRI images and apply labels to them.

Question 9

Q

What are drift and skew in data?

Detecting Data Issues 01:01

Answer

A

Drift is changes in data over time. For example, data collected once a day over time, maybe a week later, a month later, there are changes that data has drifted.

Skew is the difference between two static versions from different sources of conceptually the same dataset. For example, it could be the difference between your training set and the data that you’re getting for prediction requests, your serving set.

Skew detection involves continuous evaluation of data coming to your server once you train your model.

Question 10

Q

What are two main reasons for model decay? what’s their definition?

Detecting Data Issues 02:14

Answer

A

Performance decay over time arises due to issues between training and serving data. There’s really two main reasons for that.

There’s Data drift, which are changes in the data between training and serving typically
Concept drift, which are changes in the world, changes in the ground truth. (Example of fraudulant online activities in covid time, during this time a lot of things were done online, so a lot of patterns previousely deemed as fraudulant, are now normal)

Question 11

Q

When do we have schema skew?

Detecting Data Issues 04:45

Answer

A

Schema skew occurs when the training and serving data do not conform to the same schema, which you might think could never happen but actually it can because you’re collecting data and things change and suddenly you’re getting an integer where are you expecting a float. Or you’re getting a string where you are expecting a category.

Question 12

Q

How are dataset shift, covariate shift and concept shift defined, using the features (x) and labels (y) probability distribution?

Detecting Data Issues 05:45

Answer

A

Dataset shift occurs when the joint probability of x (features) and y (labels) is not the same during training and serving. The data has shifted over time.

P_train(y,x) != P_serve(y,x)

Covariate shift refers to the change in distribution of the input variables present in training and serving data. In other words, it’s where the marginal distribution of x (features) is not the same during training and serving, but the conditional distribution remains unchanged.
* P_train(x) != P_serve(x)
* P_train(y|x) = P_serve(y|x)

Concept shift refers to a change in the relationship between the input and output variables as opposed to the differences in the Data Distribution or input itself. In other words, it’s when the conditional distribution of y (labels) given x (features) is not the same during training and serving, but the marginal distribution of x (features) remains unchanged.
* P_train(x) = P_serve(x)
* P_train(y|x) != P_serve(y|x)

Question 13

Q

With TFDV (TensorFlow Data Validation) tool you can easily detect three different types of skew: ____ skew, ____ skew, and ____ skew.

TensorFlow Data Validation 02:00

Answer

A

schema, feature, distribution

Question 14

Q

When do Schema skew, feature skew and distribution skew happen? give examples for each.

TensorFlow Data Validation 02:51

Answer

A

Schema skew occurs when the serving and training data don’t conform to the same schema. For example, it could be a change in type, an int, where you’re expecting a float, which could be a change in the feature itself.

Feature skew is when the values of features are different in training and serving sets. For example, feature values are modified between the training and serving time, or, transformation is applied in just one of the two instances

Distribution skew is changes in the distribution of individual features in the dataset. Features that’s in training might have a range of 0-100 when you’re training it and then at serving time, you’re seeing data between 5-600. That would be a change in the distribution for that feature. You got to have things like changes is the mean or the median or the standard deviation changes, all of those are changes in distribution. Depending on how severe it is, it may or may not be a problem.

Question 15

Q

What measure is typically used to determine the degree of data drift?

Issues in Training Data Assignment

Answer

A

Chebyshev distance (L-infinity)

TFDV performs skew or drift detection on categorical features and skew is expressed in terms of an L-infinity distance, which is also known as Chebyshev distance. (TensorFlow Data Validation 02:16)

Question 16

Q

What does TFDV (TensorFlow Data Validation) do?

TFDV exercise notebook

Answer

Study These Flashcards

A

TFDV helps to understand, validate, and monitor production machine learning data at scale. It provides insight into some key questions in the data analysis process such as:

What are the underlying statistics of my data?
What does my training dataset look like?
How does my evaluation and serving datasets compare to the training dataset?
How can I find and fix data anomalies?

W1- Machine Learning Data Lifecycle in Production Flashcards

(16 cards)