Design and prepare a machine learning solution (20–25%) Flashcards
number of data formats
Three: Tabular/structured, Semi-structured, Unstructured
Tabular/structured Data
Highly ordered data that has a schema. CSV, excel
Semi Structured Data
Data is organized with key value pairs like a dictionary
Unstructured Data
Follows no rules! Images, video, audio, documents
What to do when your data format sucks?
Transform it to something more suitable
Most common ways to store data for model training
Three: Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database
Azure Blob Storage
Can store structured and unstructured; cheapest option to store unstructured; most basic storage
Azure Data Lake Storage G2
Stores CSVs as unstructured data, easy to give access for specific items to people due to hierarchical namespace, capacity is limitless
Azure SQL Database
Only store structured data, like an sql table. ideal for data that doesn’t change over time
What does Azure offer for compatible data formats
Azdb for Mysql, Azdb for Postgresql, Azdb for Mariadb, CosmosDB Cassandra API, CosmosDB MongoDB API
Need Semi-structured data with on-demand schema
CosmosDB sql API
Azure Synapse Analytics for pipelines
AKA Azure Synapse Pipelines. Can make data ingestion pipelines through UI or from json format. Makes it easy to ETL data from source into a data store
Azure databricks for pipelines
Code-first tool where you can use sql, python, or R to define pipelines. Fast because it uses spark clusters
Azure Machine Learning for Pipelines
Provides auto scaling compute clusters. Create ETL pipelines in Designer or from multiple scripts. Not as scaleable as Synapse or Databricks
Six steps to train a model
- Define problem 2. Get data 3. Prepare data 4. Train model 5. Deploy model 6. Monitor Model