Design and prepare a machine learning solution (20

number of data formats

Three: Tabular/structured, Semi-structured, Unstructured

How well did you know this?

Not at all

Perfectly

Tabular/structured Data

Highly ordered data that has a schema. CSV, excel

How well did you know this?

Not at all

Perfectly

Semi Structured Data

Data is organized with key value pairs like a dictionary

How well did you know this?

Not at all

Perfectly

Unstructured Data

Follows no rules! Images, video, audio, documents

How well did you know this?

Not at all

Perfectly

What to do when your data format sucks?

Transform it to something more suitable

How well did you know this?

Not at all

Perfectly

Most common ways to store data for model training

Three: Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database

How well did you know this?

Not at all

Perfectly

Azure Blob Storage

Can store structured and unstructured; cheapest option to store unstructured; most basic storage

How well did you know this?

Not at all

Perfectly

Azure Data Lake Storage G2

Stores CSVs as unstructured data, easy to give access for specific items to people due to hierarchical namespace, capacity is limitless

How well did you know this?

Not at all

Perfectly

Azure SQL Database

Only store structured data, like an sql table. ideal for data that doesn’t change over time

How well did you know this?

Not at all

Perfectly

What does Azure offer for compatible data formats

Azdb for Mysql, Azdb for Postgresql, Azdb for Mariadb, CosmosDB Cassandra API, CosmosDB MongoDB API

How well did you know this?

Not at all

Perfectly

Need Semi-structured data with on-demand schema

CosmosDB sql API

How well did you know this?

Not at all

Perfectly

Azure Synapse Analytics for pipelines

AKA Azure Synapse Pipelines. Can make data ingestion pipelines through UI or from json format. Makes it easy to ETL data from source into a data store

How well did you know this?

Not at all

Perfectly

Azure databricks for pipelines

Code-first tool where you can use sql, python, or R to define pipelines. Fast because it uses spark clusters

How well did you know this?

Not at all

Perfectly

Azure Machine Learning for Pipelines

Provides auto scaling compute clusters. Create ETL pipelines in Designer or from multiple scripts. Not as scaleable as Synapse or Databricks

How well did you know this?

Not at all

Perfectly

Six steps to train a model

Define problem 2. Get data 3. Prepare data 4. Train model 5. Deploy model 6. Monitor Model

How well did you know this?

Not at all

Perfectly

What questions to ask when defining the problem

Study These Flashcards

Desired model output, what type of model is needed, what criteria make a model successful

Five common ML tasks

Study These Flashcards

Regression, Classification, Forecasting, Computer Vision, NLP

4 services to train ML models

Study These Flashcards

Azure ML studio, Azure Databricks, Azure Synapse Analytics, Azure Cognitive Services

Azure ML studio for training

Study These Flashcards

Can use UI, python SDK, or CLI to manage workloads

Azure Databricks for training

Study These Flashcards

Uses spark compute for efficient processing

Azure Synapse Analytics

Study These Flashcards

Primarily for ETL, but does have ML capabilities. Works well at scale with big data

Azure Cognitive Service for training

Study These Flashcards

Collection of prebuilt ML models for tasks like image recognition. Models can be customized through transfer learning

How to save time and effort with pre-built models

Study These Flashcards

Azure Cognitive Services

Keep ETL and Data Science within same service

Study These Flashcards

Azure Synapse Analytics or Azure Databricks

Need full control of training and management

Azure ML or Azure Databricks

Want to use Python SDK

Azure ML

Want to use a UI

Azure ML

CPU vs GPU

CPU is cheaper for smaller tasks, GPU is expensive for bigger tasks

General purpose or Memory optimized

General is obvious, use memory optimized for larger datasets or notebooking

Spark compute

Spark clusters are distributed so they work in parallel. can use gpu and cpu. Databricks and Azure Synapse Analytics use this

Prediction options for endpoints

Real-time or batch

Real-time predictions

Low latency, need the answer NOW! Good for customer facing services. Works on a single row input of data

batch predictions

Score new data in batches, save results as file. Good for forecasting or scheduled predictions. Works on multiple rows of data

Compute for real-time predictions

Container services like Azure Container Instance ACI or Azure Kubernetes Service AKS. Compute is always on and costing money

Compute for batch predictions

Clusters offer scoring in parallel. Compute spins down when not actively working

Design and prepare a machine learning solution (20–25%) Flashcards

(35 cards)