Orchestrating machine learning with pipelines Flashcards

Question 1

Q

What are pipelines used for?

Answer

A

In Azure Machine learning, you run workloads as experiments that leverage data assets and compute resources. In an enterprise data science process, you’ll generally want to separate the overall process into individual tasks, and orchestrate these tasks as pipelines of connected steps. Pipelines are key to implementing an effective Machine Learning Operationalization (ML Ops) solution in Azure, so you’ll explore how to define and run them in this module.

Question 2

Q

What is a pipeline in AML?

Answer

A

In AML, a pipeline is a workflow of machine learning tasks in which each task is implemented as a step.

Steps can be arranged sequentially or in parallel, enabling you to build sophisticated flow logic to orchestrate machine learning operations. Each step can be run on a specific compute target, making it possible to combine different types of processing as required to achieve an overall goal

A pipeline can be executed as a process by running the pipeline as an experiment. Each step in the pipeline runs on its allocated compute target as part of the overall experiment run

Question 3

Q

What examples of pipeline steps are there

Answer

A

PythonScriptStep: Runs a specified Python script
EstimatorStep: Runs an estimator
DataTransferStep: Uses Azure Data Factory to copy data between data stores
DatabricksStep: Runs a notebook, script, or compield KAR on a databricks cluster
AdlaStep: Runs a U-SQL job in Azure Data Lake Analytics

Question 4

Q

How do you implement a pipeline

Answer

A

To create a pipeline, you must first define each step and then create a pipeline that includes the steps. The specific configuration of each step depends on the step type. After defining the steps, you can assign them to a pipeline, and run it as an experiment

Question 5

Q

What is a PipelineData object

Answer

A

The PipelineData object is a special kind of DataReference that:

References a location in a datastore
Creates a data dependency between pipeline steps

You can view a PipelineData object as an intermediary store for data that must be passed from a step to a subsequent step

Question 6

Q

What is necessary to use a PipelineData object?

Answer

A

Define a named PipelineData object that references a location in a datastore
Specify the PipelineData object as an input or output for the steps that use it
Pass the PipelineData object as a script parameter in steps that run scripts (and include code in those scripts to read or write data)

Question 7

Q

What is step reuse?

Answer

A

By default, the step output from a previous pipeline run is reused without rerunning the step provided the script, source directory, and other parameters for the step have not changed.

Step reuse can reduce the time it takes to run a pipeline, but it can lead to stale results when changes to downstream data sources have not been accounted for

Question 8

Q

Where can you publish pipelines

Answer

A

After you created a pipeline, you can publish it to create a REST endpoint through which the pipeline can be run on demand

Question 9

Q

How do you initiate a published endpoint (pipeline)

Answer

A

To initiate a published endpoint, you make an HTTP request to its REST endpoint, passing an authorization header with a token for a service principal with permission to run the pipeline, and a JSON payload specifying the experiment name

Question 10

Q

What do you create for using parameter for a pipeline?

Answer

A

To define parameters for a pipeline, create a PipelineParameter object for each parameter, and specify each parameter in at least one step.

You must define parameters for a pipeline before publishing it

Question 11

Q

What can you do to run a pipeline in periodic intervals?

Answer

A

To schedule a pipeline to run at periodic intervals, you must define a ScheduleRecurrance that determines the run frequency, and use it to create a Schedule

Next to run the pipeline in time frequencies, it is also possible to apply the pipeline after the data was changed - Triggering a pipeline

Orchestrating machine learning with pipelines Flashcards

(11 cards)