Path2.Mod1.a - Make Data Available Flashcards
“S is for abS…”
Datastores; what they are
Datastores are abstractions for cloud data sources
WS Enc Lay
Datastores; their advantages and why they promote Best Practices
- Attached to Workspaces
- Securely encapsulate and store connection info. This makes it easy to connect to storage services without having to provide connection details in code (Best Practices: No hardcoding sensitive info).
- Serves as a protective layer between users and the underlying storage service (they connect to the datastore, not directly to the storage service)
BS FS DLG1 DLG2
Four Datastore Types
- Azure Blob Storage
- Azure File Share
- Azure Data Lake (Gen 1)
- Azure Data Lake (Gen 2)
“S comes after R…”, MD
Data Assets; what they are and what you get when you create one
Data Assets are references to where data is stored (datastores, storage services, public URLs, locally stored data). When creating you create a reference to that data with a copy of its Metadata.
S&R AC V IO
Data Assets; four advantages they provide
- share and reuse data with team members
- Access data during training without worrying about connection strings or data path
- Versioning the data asset’s metadata
- Use when executing ML tasks as Azure ML Jobs; assets can be parsed as both input or output to a Job.
UFI UFO MLT
Three Data Asset Types
- URI file: specific files
- URI folder: points to a folder
- MLTable: either a folder or file that includes a schema to read as tabular data
CB IB
Two Datastore Authentication Methods
- Credential-Based - Uses a Service Principal, Shared-Access Signature (SAS) Token or an Account Key
- Identity-Based - Uses Azure AD Identity or Managed Identity (AzureDefaultIdentity())
- Azure ML Datastores create the underlying Storage Account on creation if the account doesn’t exist (T/F)
- Datastores are not required when you have access to the underlying data as you can use storage URIs directly (T/F)
- If you have direct access to your data, then connecting directly to a Storage location during Notebook experimentation is preferred because it avoids unnecessary programmatic overhead (T/F)
- False. They are an abstraction over it
- True
- True
h(s) a(s) a
Usage for Uniform Resource Identifiers (URIs) wrt Data
Three common protocols for URIs, and the one that doesn’t require authentication
Used to find and access your data. URIs are pointers to the location of your data
- http(s) - public/private data stores in Blob storage or public locations
- abfs(s) - data stores in Azure Data Lake Gen 2
- azureml - datastore (ie an existing storage account on Azure). When referring to a datastore, you won’t need to authenticate; remember that the connection info is stored with it and Azure ML will use it automatically.
- Protocol to use when connecting directly to a folder or file, and when to apply authentication
- Protocol to use when connecting to a datastore where the authentication info and connection are stored
- http(s), when the container is set to private
- azureml
BC FS, OD
The two built-in Workspace Datastores and how many are created by default.
The DataSet that creates another datastore if used
- Two Azure Storage Blob Containers and two Azure Storage File Shares. Two of each are created with the workspace
- Another datastore is added if you use Open Datasets
The default Datastore for Data Assets, the default one for Notebooks, and their respective prefix
- Data Assets =>Blob Container Datastores. Prefixed
azureml-blobstore
- Notebooks =>File Share Datastores . Prefixed
code-
Parquet Files
- What they are
- Why they are better than .csv files for ML tasks
- When reading in, all columns are convert to…
- Parquet Files are columnar file formatted files with optimizations that speed up queries
- Compared to csv files which are in row format, Parquet files are in compressed column format. Since ML tasks deal with Features, which are in tabular data the columns, ML tasks can process these more efficiently
- Nullable columns, for compatibility reasons (ex. Schema Merging)