Developing ML Solutions Flashcards

1
Q

What are the different stages in the ML Lifecycle?

A

The end-to-end machine learning lifecycle process includes the following phases:
- Business goal identification - what’s the business objective, what does success look like, what are the metrics? Budget? Value?
- ML problem framing - Converting the business problem into an ML problem? Is ML appropriate for this business problem?
- Data processing (data collection, data preprocessing, and feature engineering) - collect data, convert into usable format,
- Model development (training, tuning, and evaluation) - iterative, can be performed multiple times with additional feature engineering each time.
- Model deployment (inference and prediction)
- Model monitoring
- Model retraining - needed if it does not meet business goals or as new data becomes available. Needed to ensure model remains accurate over time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Feature Engineering?

A

It is a step in the data processing phase of an ML LC.

Feature engineering is the process of creating, transforming, extracting, and selecting variables from data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

In a ML model, what are hyperparameters?

A

Hyperparameters are external configuration variables that data scientists use to manage machine learning model training. Sometimes called model hyperparameters, the hyperparameters are manually set before training a model. They’re different from parameters, which are internal parameters automatically derived during the learning process and not set by data scientists.

Examples of hyperparameters include the number of nodes and layers in a neural network and the number of branches in a decision tree. Hyperparameters determine key features such as model architecture, learning rate, and model complexity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you prep data for ML model training?

A

Split your data as follows:
80% of data to train the model
10% to validate (improve the model with each training iteration)
10% to test.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How can Amazon SageMaker help with the ML Lifecycle?

A

In a single unified visual interface, you can perform the following tasks:
- Collect and prepare data.
- Build and train machine learning models.
- Deploy the models and monitor the performance of their predictions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is Amazon SM Data Wrangler?

A
  • Low Code No Code (LCNC) tool.
  • It provides an end-to-end solution to import, prepare, transform, featurize, and analyze data by using a web interface.
  • SageMaker is also integrated with EMR and AWS Glue
  • With SageMaker Processing API - customers can run scripts and notebooks to transform datasets.
  • The data analysis helps customers arrive at the features to define the model and data for the model to be trained on.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is Amazon SM Feature Store?

A

Helps data scientists, machine learning engineers, and general practitioners to create, share, and manage features for ML development.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is Amazon SM Model training and evaluation features?

A

SM offers feature to train models using built-in algorithms (SM Training Job).
SM can launch compute instances, use training code, and data to train the model. Trained models are stored in an S3 bucket once complete.
SM Jumpstart - provide pretrained models
SM Canvas - LCNC for business analysts to build ML Models.
SM Experiments - experiment with different combinations of data, algorithms, and parameters to observe impact on accuracy.
SM Model Tuning - runs many versions with different hyperparameters and measures performance using a metric
SM can also deploy the models in a production environment.
SM Model Monitor - observe the quality of SageMaker ML models in production; continuous or on-schedule monitoring
SM Registry - model registry
SM Pipelines- Model building pipelines for end-to-end workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is SM Studio?

A
  • Best way to access SM.
  • Web-based interface to develop ML applications, such as prepare data, train, deploy, and monitor models.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the different kinds of ML algorithms that SageMaker provides?

A
  1. Supervised learning (e.g. Regression, Classification, K-Nearest Neighbor)
  2. Unsupervised Learning (e.g. Clustering, Dimensionality Reduction, Embeddings, Anomaly Detection, etc.)
  3. Image Processing (e.g. Image Classification, Object Detection)
  4. Text Analysis
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How do you evaluate ML Models?

A
  1. Split the data into Training, Validation and Test sets (80-10-10 rule).
  2. Model fit - e.g. is it overfitted, underfitted, or balanced?
  3. Specific Metrics
    3a) For classification problems - this cloud be Accuracy, Precision, Recall, F1, AUC-ROC
    3b) For regression problems - Mean squared error and R squared
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is the difference between Bias and Variance?

A

Bias - difference between predicted value and true value
Variance - is how dispersed the values are.

Analogy: Bull’s eye.

An overfitted model has high variance
An underfitted model has high bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is a confusion matrix?

A

Evaluates model performance by classifying the predictions as:
True Positive
True Negative
False Positive
False Negative

So, a cat that is a cat = True Positive
So, a NOT cat that is predicted as a cat = False Positive
A NOT cat that is predicted as NOT cat = True Negative
A cat that is predicted as a NOT cat = False Negative

TPs and TNs are desirable. You want them to be as high as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are Accuracy, Precision, and Recall?

A

These are different ways by which you evaluate model performance

Accuracy - the number of times the model was right (i.e. TP and TN as proportion of the total number of predictions)

Precision - removes negative predictions from the model performance. For e.g. how good the model is in identifying a cat (TP/TP+FP) -In email spam detection, this may be important- you do not want your model labeling a legitimate email as spam and preventing your users from seeing that email. Impact of FP is high.

Recall - is sensitivity - proportion of correct sets that are identified as positive. (TP/TP+FN). Think about a model that needs to predict whether a patient has a terminal illness or not. You want to have as few FNs as possible (i.e. don’t classify someone as OK if they have a terminal illness). Impact of FN is high.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is AUC-ROC?

A

Area-Under Curve-Receiver-Operator Curve
Essentially, a probability curve that measures separability.
True positive vs false positives (e.g. email spam classification).
True positive = percentage of spam you capture
False positive = negative impact of spam filtering (users not able to see legitimate emails)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are metrics used for Regression models?

A

Mean squared error = sum of the squares of the difference between the prediction and the actual outcomes.
R squared = fraction of variance accounted for by the model. R squared provides a measure of the model’s goodness of fit to the data

17
Q

How do business metrics play a part in model evaluation?

A

Business metrics define what success is.
For example, a lot of false positives could mean an increase in operational costs.
So, if the business goal is to lower operational costs, you want a model that delivers low false positives.

18
Q

What is model deployment? What are the two types?

A

Integrating the model in to a production environment for making predictions.

a) Self-hosted API - customer deploys the model on their own infra (on prem or in the cloud). Customers set up their own VMs, load balancers, dbs etc. Offers greater control but more operational overhead
b) Fully-managed environment - e.g. SageMaker which abstracts all the underlying infrastructure. Potentially more expensive, but simplifies operational burden.

19
Q

What are SMs deployment capabilities?

A

1) One-click deployment
2) Auto scaling
3) Model hosting services
4) HTTPS endpoints

20
Q

What are the different types of inference that SM support?

A
  1. Real-time - for workloads that require low-latency, real-time predictions (e.g. chatbots)
  2. Batch transform - for inferences from large datasets. You don’t need a persistent endpoint
  3. Asynch - queues incoming data (for large payloads that require up to 1 hour of processing time)
  4. On-demand, serverless - for workloads that have lot of idle periods between spurts and is tolerant of cold-starts.
21
Q

What is ML Ops?

A

Combines people, technology, and processes to deliver collaborative ML solutions.
It is the practice of operationalizing and streamlining the end-to-end machine learning lifecycle from model development and deployment to monitoring and maintenance.
With it, you can improve delivery time, reduce defects, and make data science more productive.

22
Q

What is the Goal of MLOps?

A

Get ML workloads into production and keep them operating.
* Increase the pace of the model development lifecycle through automation.
* Improve quality metrics through testing and monitoring.
* Promote a culture of collaboration between data scientists, data engineers, software engineers, and IT operations.
* Provide transparency, explainability, audibility, and security of the models by using model governance.

23
Q

What are the benefits of MLOps?

A
  • Productivity - provides curated datasets for ML Engineers and Data Scientists.
  • Reliability - with CI/CD deployment quickly with increased quality and consistency
  • Repeatability - Automation ensures a repeatable process
  • Auditability - versioning inputs and outputs, data sources etc. you can track how model was built.
  • Data and Model Quality - Guard against model bias
24
Q

Key principles of ML Ops

A

1) Version Control - track changes to assets like code, data, etc. Rollbacks to previous versions
2) Automation - For repeatability, consistency, and scalability - all stages of life cycle from data ingestion, training, to testing.
3) CI/CD - continuously test and deploy assets
4) Model governance - Clear documentation, effective communication channels, and feedback mechanisms help align everyone and improve models over time. Checks for fairness, bias, and ethics. Governance manages all aspects of systems for efficiency. Helps with compliance.

25
Q

What are the stages in a ML Ops life cycle?

A
  1. Data Preparation
  2. Model Build
  3. Model Evaluation
  4. Model Selection
  5. Deployment
  6. Monitoring