Path2.Mod1.a - Make Data Available Flashcards

1
Q

“S is for abS…”

Datastores; what they are

A

Datastores are abstractions for cloud data sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

WS Enc Lay

Datastores; their advantages and why they promote Best Practices

A
  • Attached to Workspaces
  • Securely encapsulate and store connection info. This makes it easy to connect to storage services without having to provide connection details in code (Best Practices: No hardcoding sensitive info).
  • Serves as a protective layer between users and the underlying storage service (they connect to the datastore, not directly to the storage service)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

BS FS DLG1 DLG2

Four Datastore Types

A
  • Azure Blob Storage
  • Azure File Share
  • Azure Data Lake (Gen 1)
  • Azure Data Lake (Gen 2)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

“S comes after R…”, MD

Data Assets; what they are and what you get when you create one

A

Data Assets are references to where data is stored (datastores, storage services, public URLs, locally stored data). When creating you create a reference to that data with a copy of its Metadata.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

S&R AC V IO

Data Assets; four advantages they provide

A
  • share and reuse data with team members
  • Access data during training without worrying about connection strings or data path
  • Versioning the data asset’s metadata
  • Use when executing ML tasks as Azure ML Jobs; assets can be parsed as both input or output to a Job.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

UFI UFO MLT

Three Data Asset Types

A
  • URI file: specific files
  • URI folder: points to a folder
  • MLTable: either a folder or file that includes a schema to read as tabular data
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

CB IB

Two Datastore Authentication Methods

A
  • Credential-Based - Uses a Service Principal, Shared-Access Signature (SAS) Token or an Account Key
  • Identity-Based - Uses Azure AD Identity or Managed Identity (AzureDefaultIdentity())
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  • Azure ML Datastores create the underlying Storage Account on creation if the account doesn’t exist (T/F)
  • Datastores are not required when you have access to the underlying data as you can use storage URIs directly (T/F)
  • If you have direct access to your data, then connecting directly to a Storage location during Notebook experimentation is preferred because it avoids unnecessary programmatic overhead (T/F)
A
  • False. They are an abstraction over it
  • True
  • True
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

h(s) a(s) a

Usage for Uniform Resource Identifiers (URIs) wrt Data

Three common protocols for URIs, and the one that doesn’t require authentication

A

Used to find and access your data. URIs are pointers to the location of your data

  • http(s) - public/private data stores in Blob storage or public locations
  • abfs(s) - data stores in Azure Data Lake Gen 2
  • azureml - datastore (ie an existing storage account on Azure). When referring to a datastore, you won’t need to authenticate; remember that the connection info is stored with it and Azure ML will use it automatically.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  • Protocol to use when connecting directly to a folder or file, and when to apply authentication
  • Protocol to use when connecting to a datastore where the authentication info and connection are stored
A
  • http(s), when the container is set to private
  • azureml
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

BC FS, OD

The two built-in Workspace Datastores and how many are created by default.
The DataSet that creates another datastore if used

A
  • Two Azure Storage Blob Containers and two Azure Storage File Shares. Two of each are created with the workspace
  • Another datastore is added if you use Open Datasets
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The default Datastore for Data Assets, the default one for Notebooks, and their respective prefix

A
  • Data Assets =>Blob Container Datastores. Prefixed azureml-blobstore
  • Notebooks =>File Share Datastores . Prefixed code-
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Parquet Files
- What they are
- Why they are better than .csv files for ML tasks
- When reading in, all columns are convert to…

A
  • Parquet Files are columnar file formatted files with optimizations that speed up queries
  • Compared to csv files which are in row format, Parquet files are in compressed column format. Since ML tasks deal with Features, which are in tabular data the columns, ML tasks can process these more efficiently
  • Nullable columns, for compatibility reasons (ex. Schema Merging)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly