Path2.Mod1.c - Make Data Available - Creating Data Assets Flashcards
L BS DLG2 Ds
Creating a Data Asset: URI File supported paths
- Local:
./<path to file>
- Blob Storage:
wasbs://<account>.blob.core.windows.net/<container>/<folder>/<file>
- Data Lake Gen 2 storage:
abfss://<file_system>@<account>.dfs.core.windows.net/<folder>/<file>
- Datastore:
azureml://datastores/<name>/paths/<folder>/<file>
Behavior when creating a Local Data Asset
A copy of the Local Data Asset is uploaded to the default datastore workspaceblobstore
in the LocalUpload
folder, making it available even when the local device is unavailable
The context for using an MLTable Data Asset
When the schema of your data is complex or frequently changes.
For MLTable Data Assets, you specify the schema definition for reading the data. So instead of changing how to read the data for each script that uses it, you only change the schema stored in the Data Asset itself.
(T/F):
- Certain Azure ML features like Automated ML require an MLTable Data Asset to understand how to read its data
- MLTable Schemas are stored in an Azure Blob, then pulled in by your job via parameter input
- True
- False. You store the MLTable file in the same folder as the data you’re reading.
Describe what this code is doing:
from azure.ai.ml.entities import Data from azure.ai.ml.constants import AssetTypes my_data = Data( path='<supported-path>', type=AssetTypes.URI_FILE, description="<description>", name="<name>", version="<version>" ) ml_client.data.create_or_update(my_data)
Creates a URI_FILE Data Asset (the type parameter). Uses <supported-path>
to represent a local device path
Describe three things that this code is doing and give an alternative for when the input is in JSON:
import argparse import pandas as pd parser = argparse.ArgumentParser() parser.add_argument("--input_data", type=str) args = parser.parse_args() df = pd.read_csv(args.input_data) print(df.head(10))
- Uses argparse to create an input parameter called “–input_data”
- When starting up your job, set
--input_data
to your URI FILE data asset - Assuming a .csv file, it is then read into memory via
pd.read_csv
- if your data is in json, use
pd.read_json()
Describe what this code is doing:
from azure.ai.ml.entities import Data from azure.ai.ml.constants import AssetTypes my_data = Data( path='<supported-path>', type=AssetTypes.URI_FOLDER, description="<description>", name="<name>", version='<version>' ) ml_client.data.create_or_update(my_data)
- Creates a URI_FOLDER Data Asset (the type parameter)
- “supported-path” is some local device path
Describe what this code is doing:
import argparse import glob import pandas as pd parser = argparse.ArgumentParser() parser.add_argument("--input_data", type=str) args = parser.parse_args() data_path = args.input_data all_files = glob.glob(data_path + "/*.csv") df = pd.concat((pd.read_csv(f) for f in all_files), sort=False)
- Uses argparse to create an input parameter
--input_data
- When starting up your job, set
--input_data
to your URI FOLDER data asset -
glob
all the csv files together with their target path to create a collection of them - Iterate through the “glob” to create a Pandas Dataframe
Describe what this code is doing:
type: mltable paths: - pattern: ./*.txt transformations: - read_delimited: delimiter: ',' encoding: ascii header: all_files_same_headers
CLI YAML for creating an MLTable; For all the .txt files in the current folder, read them as comma separated files encoded in ascii
Describe what this code is doing:
from azure.ai.ml.entities import Data from azure.ai.ml.constants import AssetTypes my_data = Data( path= '<path-including-mltable-file>', type=AssetTypes.MLTABLE, description="<description>", name="<name>", version='<version>' ) ml_client.data.create_or_update(my_data)
- Creates an MLTABLE Data Asset (the type parameter)
- “path-including-mltable-file” is some local device path
Describe what this code is doing:
import argparse import mltable import pandas parser = argparse.ArgumentParser() parser.add_argument("--input_data", type=str) args = parser.parse_args() tbl = mltable.load(args.input_data) df = tbl.to_pandas_dataframe() print(df.head(10))
- Uses argparse to create an input parameter called “–input_data”
- When starting up your job, set –input_data to your MLTable data asset
- Loads the data through
mltable.load
then converts it to a Pandas DataFrame (a common conversion approach).