W3- Machine Learning Data Lifecycle in Production Flashcards
In the event of unexpected pipeline behavior or errors, metadata can be leveraged to analyze the lineage of pipeline components and to help you debug issues. True/False
True
MLMD helps you understand and analyze all the interconnected parts of your ML pipeline, instead of analyzing them in isolation.
In addition to the executor where your code runs, each component also includes two additional parts, the driver and publisher. What do these 3 parts do?
The executor is where the work of the component is done and that’s what makes different components different. Whatever input is needed for the executor, is provided by the driver, which gets it from the metadata store. Finally, the publisher will push the results of running the executor back into the metadata store. Most of the time, you won’t need to customize the driver or publisher.
What’s MLMD?
MLMD is a library for tracking and retrieving metadata associated with ML developer and data scientist workflows.
MLMD can be used as an integral part of an ML pipeline or it can be used independently. When integrated with an ML pipeline, you have to explicitly interact with MLMD. True/False
False, MLMD can be used as an integral part of an ML pipeline or it can be used independently. However, when integrated with an ML pipeline, you may not even explicitly interact with MLMD.
Objects which are stored in MLMD are referred to as ____. MLMD stores the properties of each artifact in a ____ and stores large objects like data sets on disc, or in a file system or block store.
artifacts,
relational database,
When you’re working with ML metadata, you need to know how data flows between different successive components. Each step in this data flow is described through an entity that you need to be familiar with. At the highest level of MLMD, there are some data entities that can be considered as units.
Artifcats
Execution
Context
define each.
An artifact is data going in as input or generated as output of a component.\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Each execution is a record of any component run during the ML pipeline workflow, along with its associated runtime parameters.\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
Artifacts and executions can be clustered together for each type of component separately. This grouping is referred to as the context.
Definition of:
Feature store
Data Warehouse
Data Lake
Feature store: Central repository for storing documented, curated, and access-controlled features, specifically for ML
Data Warehouse: Subject-oriented repository for structured data, optimized for fast read
Data Lakes: Repository of data stored in its raw, natural format
As data evolves during its life cycle, does addressing “Monitoring model and data provenance” help ML pipelines to operate properly?
C2-W3-Quiz
No, Monitor provenance is an important aspect of ML pipelines, but it will not help in coping with evolving data.
Me: Evolving data is basically data changes, that require addressing, now monitoring provenance doesn’t address any challenges created by data evolution.
TFX components interact with each other by getting artifact information from the metadata store. True/False
C2-W3-Assignment
True
What does ImportSchemaGen do?
C2-W3-Assignment
ImportSchemaGen is a TFX component to import a schema file into the pipeline.