Orchestrating machine learning with pipelines Flashcards
What are pipelines used for?
In Azure Machine learning, you run workloads as experiments that leverage data assets and compute resources. In an enterprise data science process, you’ll generally want to separate the overall process into individual tasks, and orchestrate these tasks as pipelines of connected steps. Pipelines are key to implementing an effective Machine Learning Operationalization (ML Ops) solution in Azure, so you’ll explore how to define and run them in this module.
What is a pipeline in AML?
In AML, a pipeline is a workflow of machine learning tasks in which each task is implemented as a step.
Steps can be arranged sequentially or in parallel, enabling you to build sophisticated flow logic to orchestrate machine learning operations. Each step can be run on a specific compute target, making it possible to combine different types of processing as required to achieve an overall goal
A pipeline can be executed as a process by running the pipeline as an experiment. Each step in the pipeline runs on its allocated compute target as part of the overall experiment run
What examples of pipeline steps are there
PythonScriptStep: Runs a specified Python script
EstimatorStep: Runs an estimator
DataTransferStep: Uses Azure Data Factory to copy data between data stores
DatabricksStep: Runs a notebook, script, or compield KAR on a databricks cluster
AdlaStep: Runs a U-SQL job in Azure Data Lake Analytics
How do you implement a pipeline
To create a pipeline, you must first define each step and then create a pipeline that includes the steps. The specific configuration of each step depends on the step type. After defining the steps, you can assign them to a pipeline, and run it as an experiment
What is a PipelineData object
The PipelineData object is a special kind of DataReference that:
- References a location in a datastore
- Creates a data dependency between pipeline steps
You can view a PipelineData object as an intermediary store for data that must be passed from a step to a subsequent step
What is necessary to use a PipelineData object?
- Define a named PipelineData object that references a location in a datastore
- Specify the PipelineData object as an input or output for the steps that use it
- Pass the PipelineData object as a script parameter in steps that run scripts (and include code in those scripts to read or write data)
What is step reuse?
By default, the step output from a previous pipeline run is reused without rerunning the step provided the script, source directory, and other parameters for the step have not changed.
Step reuse can reduce the time it takes to run a pipeline, but it can lead to stale results when changes to downstream data sources have not been accounted for
Where can you publish pipelines
After you created a pipeline, you can publish it to create a REST endpoint through which the pipeline can be run on demand
How do you initiate a published endpoint (pipeline)
To initiate a published endpoint, you make an HTTP request to its REST endpoint, passing an authorization header with a token for a service principal with permission to run the pipeline, and a JSON payload specifying the experiment name
What do you create for using parameter for a pipeline?
To define parameters for a pipeline, create a PipelineParameter object for each parameter, and specify each parameter in at least one step.
You must define parameters for a pipeline before publishing it
What can you do to run a pipeline in periodic intervals?
To schedule a pipeline to run at periodic intervals, you must define a ScheduleRecurrance that determines the run frequency, and use it to create a Schedule
Next to run the pipeline in time frequencies, it is also possible to apply the pipeline after the data was changed - Triggering a pipeline