01-Getting_Started_with_Azure_ML Flashcards
Run Configuration
Defines the Python code execution environment for the script
E.g., sets a Conda environment with some default Python packages installed
# create a new RunConfig object experiment_run_config = RunConfiguration()
Script Configuration
Identifies the Python script file to be run in the experiment, and the environment in which to run it
# Create a script config src = ScriptRunConfig(source_directory=experiment_folder, script='diabetes_experiment.py', run_config=experiment_run_config )
How to setup Model Training
- Connect to Your Workspace
- Create folder for experiment files (data + training script)
- Create and generate a Training Script
- Use an Estimator to Run the Script as an Experiment
- Register the Trained Model
Training/Entry Scripts
# import libs # Get the experiment run context # Load training data # Separate features and labels # Split data into training set and test set # create and Train some model # Score / predict model # Evaluate # Save the model to experiment folder # Wait for completion (run.complete())
Estimator
You can run experiment scripts using a RunConfiguration and a ScriptRunConfig, or you can use an Estimator, which abstracts both of these configurations in a single object to run the training experiment.
An estimator runs a training script
Create an estimator
estimator = Estimator(source_directory=training_folder, entry_script=’diabetes_training.py’,
compute_target=’local’,
conda_packages=[‘scikit-learn’]
)
# Create an experiment experiment_name = 'diabetes-training' experiment = Experiment(workspace = ws, name = experiment_name)
# Run the experiment based on the estimator run = experiment.submit(config=estimator) run.wait_for_completion(show_output=True)
Create and Run an Experiment
experiment = Experiment(workspace = ws, name = experiment_name)
# Run the experiment run = experiment.submit(config=estimator)
RunDetails widget
As with any experiment run, you can use the RunDetails widget to view information about the run and get a link to it in Azure Machine Learning studio
Retrieve the metrics and outputs from the Run object.
# Get logged metrics metrics = run.get_metrics() for key in metrics.keys(): print(key, metrics.get(key)) print('\n') for file in run.get_file_names(): print(file)
Output
Regularization Rate 0.01
Accuracy 0.774
AUC 0.8483377282451863
azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/8_azureml.log
outputs/diabetes_model.pkl
Register a Trained Model
Note that the outputs of the experiment include the trained model file (diabetes_model.pkl).
You can register a model in your Azure Machine Learning workspace, making it possible to track model versions and retrieve them later.
# Register the model run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'Estimator'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})
Create a Parameterized Training Script
You can increase the flexibility of your training experiment by adding parameters to your entry script, enabling you to repeat the same training experiment with different settings
# Set regularization hyperparameter parser = argparse.ArgumentParser() parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01) args = parser.parse_args() reg = args.reg
Use a Framework-Specific Estimator
You used a generic Estimator class to run the training script, but you can also take advantage of framework-specific estimators that include environment definitions for common machine learning frameworks. In this case, you’re using Scikit-Learn, so you can use the SKLearn estimator. This means that you don’t need to specify the scikit-learn package in the configuration.
# Create an estimator estimator = SKLearn(source_directory=training_folder, entry_script='diabetes_training.py', script_params = {'--reg_rate': 0.1}, compute_target='local' )
Working with Data
Data is the foundation on which machine learning models are built. Managing data centrally in the cloud, and making it accessible to teams of data scientists who are running experiments and training models on multiple workstations and compute targets is an important part of any professional data science solution.
Datastore
In Azure ML, datastores are references to storage locations, such as Azure Storage blob containers. Every workspace has a default datastore - usually the Azure storage blob container that was created with the workspace.
If you need to work with data that is stored in different locations, you can add custom datastores to your workspace and set any of them to be the default.
You can use local data files to train a model, but when running training workloads automatically on cloud-based compute, it makes more sense to store the data centrally in the cloud and ingest it into the training script wherever it happens to be running.
Upload Data to a Datastore
You can upload files from your local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run.
default_ds.upload_files(files=[’./data/diabetes.csv’, ‘./data/diabetes2.csv’], # Upload the diabetes csv files in /data
target_path=’diabetes-data/’, # Put it in a folder path in the datastore
overwrite=True, # Replace existing files of the same name
show_progress=True)
Train a Model from a Datastore
When you uploaded the files in the code cell above, note that the code returned a data reference.
The data reference can be used to download the contents of the folder to the compute context where the data reference is being used
Downloading data works well for small volumes of data that will be processed on local compute. When working with remote compute, you can also configure a data reference to mount the datastore location and read data directly from the data source.
The entry script (via Estimator/experiment) will load the training data from the data reference passed to it as a parameter
# Set up the parameters script_params = { '--regularization': 0.1, # regularization rate '--data-folder': data_ref # data reference to download files from datastore }
# Create an estimator estimator = SKLearn(source_directory=experiment_folder, entry_script='diabetes_training.py', script_params=script_params, compute_target = 'local' )
# Create an experiment experiment_name = 'diabetes-training' experiment = Experiment(workspace = ws, name = experiment_name)
# Run the experiment run = experiment.submit(config=estimator)
Data reference
A data reference provides a way to pass the path to a folder in a datastore to a script, regardless of where the script is being run, so that the script can access data in the datastore location.
The data reference can be used to download the contents of the folder to the compute context where the data reference is being used
Downloading data works well for small volumes of data that will be processed on local compute. When working with remote compute, you can also configure a data reference to mount the datastore location and read data directly from the data source.
Datasets
While you can read data directly from datastores, Azure Machine Learning provides a further abstraction for data in the form of datasets.
A dataset is a versioned reference to a specific set of data that you may want to use in an experiment.
Datasets can be tabular or file-based.
It’s easy to convert a tabular dataset to a Pandas dataframe, enabling you to work with the data using common Python techniques.
Create a Tabular Dataset
from azureml.core import Dataset
# Get the default datastore default_ds = ws.get_default_datastore()
#Create a tabular dataset from the path on the datastore tab_data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/*.csv'))
Create a File Dataset
Some machine learning scenarios you might need to work with data that is unstructured; or you may simply want to handle reading the data from files in your own code. To accomplish this, you can use a file dataset, which creates a list of file paths in a virtual mount point, which you can use to read the data in the files.
# Create a file dataset from the path on the datastore file_data_set = Dataset.File.from_files(path=(default_ds, 'diabetes-data/*.csv'))
# Get the files in the dataset for file_path in file_data_set.to_path(): print(file_path)
Register Datasets
You can register data sets to make them easily accessible to any experiment being run in the workspace.
You can view and manage datasets on the Datasets page for your workspace in Azure ML Studio or via code.
# Register the tabular dataset try: tab_data_set = tab_data_set.register(workspace=ws, name='diabetes dataset', description='diabetes data', tags = {'format':'CSV'}, create_new_version=True) except Exception as ex: print(ex)
# Register the file dataset try: file_data_set = file_data_set.register(workspace=ws, name='diabetes file dataset', description='diabetes files', tags = {'format':'CSV'}, create_new_version=True) except Exception as ex: print(ex)
print(‘Datasets registered’)
Train a Model from a Tabular Dataset
Now that you have datasets, you’re ready to start training models from them. You can pass datasets to scripts as inputs in the estimator being used to run the script.
# Get the training dataset diabetes_ds = ws.datasets.get("diabetes dataset")
# Create an estimator estimator = SKLearn(source_directory=experiment_folder, entry_script='diabetes_training.py', script_params=script_params, compute_target = 'local', # Pass the Dataset object as an input... inputs=[diabetes_ds.as_named_input('diabetes')], pip_packages=['azureml-dataprep[pandas]'] )
Train a Model from a File Dataset
When you’re using a file dataset, the dataset input passed to the script represents a mount point containing file paths. How you read the data from these files depends on the kind of data in the files and what you want to do with it.
You can use the Python glob module to create a list of files in the virtual mount point defined by the dataset, and read them all into Pandas dataframes that are concatenated into a single dataframe.
For large volumes of data, you’d generally use the as_mount method to stream the files directly from the dataset source; but when running on local compute, you need to use the as_download option to download the dataset files to a local folder.
# Get the training dataset diabetes_ds = ws.datasets.get("diabetes file dataset")
# Create an estimator estimator = SKLearn(source_directory=experiment_folder, entry_script='diabetes_training.py', script_params=script_params, compute_target = 'local', inputs=[diabetes_ds.as_named_input('diabetes').as_download(path_on_compute='diabetes_data')], pip_packages=['azureml-dataprep[pandas]'] )
Working with Compute
When you run a script as an Azure Machine Learning experiment, you need to define the execution context for the experiment run. The execution context is made up of:
- The Python environment for the script, which must include all Python packages used in the script. The compute will require a Python environment with the necessary package dependencies installed
- The compute target on which the script will be run.
This could be the local workstation from which the experiment run is initiated, or a remote compute target such as a training cluster that is provisioned on-demand.
Define an Environment (Run Configuration)
When you run a Python script as an experiment in Azure Machine Learning, a Conda environment is automatically created to define the execution context for the script.
Azure Machine Learning provides a default environment that includes many common packages; including the azureml-defaults package that contains the libraries necessary for working with an experiment run, as well as popular packages like pandas and numpy.
You can also define your own environment and add packages by using conda or pip, to ensure your experiment has access to all the libraries it requires.
Example
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
# Create a Python environment for the experiment diabetes_env = Environment("diabetes-experiment-env") # Let Azure ML manage dependencies diabetes_env.python.user_managed_dependencies = False # Use a docker container diabetes_env.docker.enabled = True
# Create a set of package dependencies (conda or pip as required) diabetes_packages = CondaDependencies.create(conda_packages=['scikit-learn'], pip_packages=['azureml-defaults', 'azureml-dataprep[pandas]'])
# Add the dependencies to the environment diabetes_env.python.conda_dependencies = diabetes_packages
print(diabetes_env.name, ‘defined.’)
# Register the environment diabetes_env.register(workspace=ws)
Use in Estimator:
# Create an estimator estimator = Estimator(source_directory=experiment_folder, inputs=[diabetes_ds.as_named_input('diabetes')], script_params=script_params, compute_target = 'local', environment_definition = diabetes_env, entry_script='diabetes_training.py')
# Create an experiment experiment = Experiment(workspace = ws, name = 'diabetes-training')
# Run the experiment run = experiment.submit(config=estimator)