W1- Machine Learning Data Lifecycle in Production Flashcards
A directed acyclic graph (DAG) is a directed graph with no cycles, ML pipeline workflows are usually DAGs. True/False
True
Machine learning orchestration tools are used to automate and manage workflows and pipeline infrastructure, with a simple, collaborative interface. Exp: Argo, Airflow, TFX, etc.
An End-to-End platform for deploying ml production pipeline
What are the -unique- challenges in production grade ML?
Building an integrated ML system
Continuously operate it in production
Handle continuously changing data
Optimize compute resource costs
Broken data (low-quality data) is the most common cause of problems in the production ML systems. True/False
True
data collection is an important and critical first step to building ml systems and data.
A data pipeline is a series of data processing steps such as: (name 3)
Data Collection Assignment
Data collection
Data ingestion
Data preparation
Degrading model performance is either slow or fast. what are the causes of each?
For slow (gradual) problems:
* Data changes (Trends and Seasonalities/ Change of the distribution of features/ Relative importance of the features)
* World changes (Style changes in dressing/ Scope and processes change/ Competitors or Business changes)
For sudden problems:
* Data collection problems (Bad sensor/ Bad log data/ Moved or disabled sensors)
* System problems (Bad software updates/ Loss of network connectivity/System down)
Name the methods of labeling. (5)
Process feedback (direct labeling)
Human labeling
Semi-Supervised labeling
Active learning
Weak supervision
Last 3 are advanced methods
What’s an example of Process feedback labeling and an example of Human labeling
- Let’s look at some examples, for process feedback, a very typical example is click-through rates. Actual versus predicted click-through rates. Suppose you have recommendations that you are giving to a user, did they actually click on the things that you recommend? If they did, you can label it positive, if they didn’t you can label it negative.
- Human labeling, you can have humans look at data and apply labels to them. For example, you can ask cardiologists to look at MRI images and apply labels to them.
What are drift and skew in data?
Drift is changes in data over time. For example, data collected once a day over time, maybe a week later, a month later, there are changes that data has drifted.
Skew is the difference between two static versions from different sources of conceptually the same dataset. For example, it could be the difference between your training set and the data that you’re getting for prediction requests, your serving set.
Skew detection involves continuous evaluation of data coming to your server once you train your model.
What are two main reasons for model decay? what’s their definition?
Performance decay over time arises due to issues between training and serving data. There’s really two main reasons for that.
- There’s Data drift, which are changes in the data between training and serving typically
- Concept drift, which are changes in the world, changes in the ground truth. (Example of fraudulant online activities in covid time, during this time a lot of things were done online, so a lot of patterns previousely deemed as fraudulant, are now normal)
When do we have schema skew?
Schema skew occurs when the training and serving data do not conform to the same schema, which you might think could never happen but actually it can because you’re collecting data and things change and suddenly you’re getting an integer where are you expecting a float. Or you’re getting a string where you are expecting a category.
How are dataset shift, covariate shift and concept shift defined, using the features (x) and labels (y) probability distribution?
Dataset shift occurs when the joint probability of x (features) and y (labels) is not the same during training and serving. The data has shifted over time.
- Ptrain(y,x) != Pserve(y,x)
Covariate shift refers to the change in distribution of the input variables present in training and serving data. In other words, it’s where the marginal distribution of x (features) is not the same during training and serving, but the conditional distribution remains unchanged.
* Ptrain(x) != Pserve(x)
* Ptrain(y|x) = Pserve(y|x)
Concept shift refers to a change in the relationship between the input and output variables as opposed to the differences in the Data Distribution or input itself. In other words, it’s when the conditional distribution of y (labels) given x (features) is not the same during training and serving, but the marginal distribution of x (features) remains unchanged.
* Ptrain(x) = Pserve(x)
* Ptrain(y|x) != Pserve(y|x)
With TFDV (TensorFlow Data Validation) tool you can easily detect three different types of skew: ____ skew, ____ skew, and ____ skew.
schema, feature, distribution
When do Schema skew, feature skew and distribution skew happen? give examples for each.
Schema skew occurs when the serving and training data don’t conform to the same schema. For example, it could be a change in type, an int, where you’re expecting a float, which could be a change in the feature itself.
Feature skew is when the values of features are different in training and serving sets. For example, feature values are modified between the training and serving time, or, transformation is applied in just one of the two instances
Distribution skew is changes in the distribution of individual features in the dataset. Features that’s in training might have a range of 0-100 when you’re training it and then at serving time, you’re seeing data between 5-600. That would be a change in the distribution for that feature. You got to have things like changes is the mean or the median or the standard deviation changes, all of those are changes in distribution. Depending on how severe it is, it may or may not be a problem.
What measure is typically used to determine the degree of data drift?
Issues in Training Data Assignment
Chebyshev distance (L-infinity)
TFDV performs skew or drift detection on categorical features and skew is expressed in terms of an L-infinity distance, which is also known as Chebyshev distance. (TensorFlow Data Validation 02:16)