Design and prepare a machine learning solution (20–25%) Flashcards

1
Q

number of data formats

A

Three: Tabular/structured, Semi-structured, Unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Tabular/structured Data

A

Highly ordered data that has a schema. CSV, excel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Semi Structured Data

A

Data is organized with key value pairs like a dictionary

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Unstructured Data

A

Follows no rules! Images, video, audio, documents

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What to do when your data format sucks?

A

Transform it to something more suitable

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Most common ways to store data for model training

A

Three: Azure Blob Storage, Azure Data Lake Storage, Azure SQL Database

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Azure Blob Storage

A

Can store structured and unstructured; cheapest option to store unstructured; most basic storage

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Azure Data Lake Storage G2

A

Stores CSVs as unstructured data, easy to give access for specific items to people due to hierarchical namespace, capacity is limitless

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Azure SQL Database

A

Only store structured data, like an sql table. ideal for data that doesn’t change over time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What does Azure offer for compatible data formats

A

Azdb for Mysql, Azdb for Postgresql, Azdb for Mariadb, CosmosDB Cassandra API, CosmosDB MongoDB API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Need Semi-structured data with on-demand schema

A

CosmosDB sql API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Azure Synapse Analytics for pipelines

A

AKA Azure Synapse Pipelines. Can make data ingestion pipelines through UI or from json format. Makes it easy to ETL data from source into a data store

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Azure databricks for pipelines

A

Code-first tool where you can use sql, python, or R to define pipelines. Fast because it uses spark clusters

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Azure Machine Learning for Pipelines

A

Provides auto scaling compute clusters. Create ETL pipelines in Designer or from multiple scripts. Not as scaleable as Synapse or Databricks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Six steps to train a model

A
  1. Define problem 2. Get data 3. Prepare data 4. Train model 5. Deploy model 6. Monitor Model
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What questions to ask when defining the problem

A

Desired model output, what type of model is needed, what criteria make a model successful

17
Q

Five common ML tasks

A

Regression, Classification, Forecasting, Computer Vision, NLP

18
Q

4 services to train ML models

A

Azure ML studio, Azure Databricks, Azure Synapse Analytics, Azure Cognitive Services

19
Q

Azure ML studio for training

A

Can use UI, python SDK, or CLI to manage workloads

20
Q

Azure Databricks for training

A

Uses spark compute for efficient processing

21
Q

Azure Synapse Analytics

A

Primarily for ETL, but does have ML capabilities. Works well at scale with big data

22
Q

Azure Cognitive Service for training

A

Collection of prebuilt ML models for tasks like image recognition. Models can be customized through transfer learning

23
Q

How to save time and effort with pre-built models

A

Azure Cognitive Services

24
Q

Keep ETL and Data Science within same service

A

Azure Synapse Analytics or Azure Databricks

25
Q

Need full control of training and management

A

Azure ML or Azure Databricks

26
Q

Want to use Python SDK

A

Azure ML

27
Q

Want to use a UI

A

Azure ML

28
Q

CPU vs GPU

A

CPU is cheaper for smaller tasks, GPU is expensive for bigger tasks

29
Q

General purpose or Memory optimized

A

General is obvious, use memory optimized for larger datasets or notebooking

30
Q

Spark compute

A

Spark clusters are distributed so they work in parallel. can use gpu and cpu. Databricks and Azure Synapse Analytics use this

31
Q

Prediction options for endpoints

A

Real-time or batch

32
Q

Real-time predictions

A

Low latency, need the answer NOW! Good for customer facing services. Works on a single row input of data

33
Q

batch predictions

A

Score new data in batches, save results as file. Good for forecasting or scheduled predictions. Works on multiple rows of data

34
Q

Compute for real-time predictions

A

Container services like Azure Container Instance ACI or Azure Kubernetes Service AKS. Compute is always on and costing money

35
Q

Compute for batch predictions

A

Clusters offer scoring in parallel. Compute spins down when not actively working