Design and prepare a machine learning solution (20–25%) Flashcards
number of data formats
Three: Tabular/structured, Semi-structured, Unstructured
Tabular/structured Data
Highly ordered data that has a schema. CSV, excel
Semi Structured Data
Data is organized with key value pairs like a dictionary
Unstructured Data
Follows no rules! Images, video, audio, documents
What to do when your data format sucks?
Transform it to something more suitable
Most common ways to store data for model training
Three: Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database
Azure Blob Storage
Can store structured and unstructured; cheapest option to store unstructured; most basic storage
Azure Data Lake Storage G2
Stores CSVs as unstructured data, easy to give access for specific items to people due to hierarchical namespace, capacity is limitless
Azure SQL Database
Only store structured data, like an sql table. ideal for data that doesn’t change over time
What does Azure offer for compatible data formats
Azdb for Mysql, Azdb for Postgresql, Azdb for Mariadb, CosmosDB Cassandra API, CosmosDB MongoDB API
Need Semi-structured data with on-demand schema
CosmosDB sql API
Azure Synapse Analytics for pipelines
AKA Azure Synapse Pipelines. Can make data ingestion pipelines through UI or from json format. Makes it easy to ETL data from source into a data store
Azure databricks for pipelines
Code-first tool where you can use sql, python, or R to define pipelines. Fast because it uses spark clusters
Azure Machine Learning for Pipelines
Provides auto scaling compute clusters. Create ETL pipelines in Designer or from multiple scripts. Not as scaleable as Synapse or Databricks
Six steps to train a model
- Define problem 2. Get data 3. Prepare data 4. Train model 5. Deploy model 6. Monitor Model
What questions to ask when defining the problem
Desired model output, what type of model is needed, what criteria make a model successful
Five common ML tasks
Regression, Classification, Forecasting, Computer Vision, NLP
4 services to train ML models
Azure ML studio, Azure Databricks, Azure Synapse Analytics, Azure Cognitive Services
Azure ML studio for training
Can use UI, python SDK, or CLI to manage workloads
Azure Databricks for training
Uses spark compute for efficient processing
Azure Synapse Analytics
Primarily for ETL, but does have ML capabilities. Works well at scale with big data
Azure Cognitive Service for training
Collection of prebuilt ML models for tasks like image recognition. Models can be customized through transfer learning
How to save time and effort with pre-built models
Azure Cognitive Services
Keep ETL and Data Science within same service
Azure Synapse Analytics or Azure Databricks
Need full control of training and management
Azure ML or Azure Databricks
Want to use Python SDK
Azure ML
Want to use a UI
Azure ML
CPU vs GPU
CPU is cheaper for smaller tasks, GPU is expensive for bigger tasks
General purpose or Memory optimized
General is obvious, use memory optimized for larger datasets or notebooking
Spark compute
Spark clusters are distributed so they work in parallel. can use gpu and cpu. Databricks and Azure Synapse Analytics use this
Prediction options for endpoints
Real-time or batch
Real-time predictions
Low latency, need the answer NOW! Good for customer facing services. Works on a single row input of data
batch predictions
Score new data in batches, save results as file. Good for forecasting or scheduled predictions. Works on multiple rows of data
Compute for real-time predictions
Container services like Azure Container Instance ACI or Azure Kubernetes Service AKS. Compute is always on and costing money
Compute for batch predictions
Clusters offer scoring in parallel. Compute spins down when not actively working