Question 1

What are the different stages in the ML Lifecycle?

Accepted Answer

The end-to-end machine learning lifecycle process includes the following phases:
- Business goal identification - what’s the business objective, what does success look like, what are the metrics? Budget? Value?
- ML problem framing - Converting the business problem into an ML problem? Is ML appropriate for this business problem?
- Data processing (data collection, data preprocessing, and feature engineering) - collect data, convert into usable format,
- Model development (training, tuning, and evaluation) - iterative, can be performed multiple times with additional feature engineering each time.
- Model deployment (inference and prediction)
- Model monitoring
- Model retraining - needed if it does not meet business goals or as new data becomes available. Needed to ensure model remains accurate over time.

Question 2

What is Feature Engineering?

Accepted Answer

It is a step in the data processing phase of an ML LC. Feature engineering is the process of creating, transforming, extracting, and selecting variables from data.

Question 3

In a ML model, what are hyperparameters?

Accepted Answer

Hyperparameters are external configuration variables that data scientists use to manage machine learning model training. Sometimes called model hyperparameters, the hyperparameters are manually set before training a model. They’re different from parameters, which are internal parameters automatically derived during the learning process and not set by data scientists.

Examples of hyperparameters include the number of nodes and layers in a neural network and the number of branches in a decision tree. Hyperparameters determine key features such as model architecture, learning rate, and model complexity.

Question 4

How do you prep data for ML model training?

Accepted Answer

Split your data as follows: 80% of data to train the model 10% to validate (improve the model with each training iteration) 10% to test.

Question 5

How can Amazon SageMaker help with the ML Lifecycle?

Accepted Answer

In a single unified visual interface, you can perform the following tasks:
- Collect and prepare data.
- Build and train machine learning models.
- Deploy the models and monitor the performance of their predictions.

Question 6

What is Amazon SM Data Wrangler?

Accepted Answer

- Low Code No Code (LCNC) tool. - It provides an end-to-end solution to import, prepare, transform, featurize, and analyze data by using a web interface. - SageMaker is also integrated with EMR and AWS Glue - With SageMaker Processing API - customers can run scripts and notebooks to transform datasets. - The data analysis helps customers arrive at the features to define the model and data for the model to be trained on.

Question 7

What is Amazon SM Feature Store?

Accepted Answer

Helps data scientists, machine learning engineers, and general practitioners to create, share, and manage features for ML development.

Question 8

What is Amazon SM Model training and evaluation features?

Accepted Answer

SM offers feature to train models using built-in algorithms (SM Training Job).
SM can launch compute instances, use training code, and data to train the model. Trained models are stored in an S3 bucket once complete.
SM Jumpstart - provide pretrained models
SM Canvas - LCNC for business analysts to build ML Models.
SM Experiments - experiment with different combinations of data, algorithms, and parameters to observe impact on accuracy.
SM Model Tuning - runs many versions with different hyperparameters and measures performance using a metric
SM can also deploy the models in a production environment.
SM Model Monitor - observe the quality of SageMaker ML models in production; continuous or on-schedule monitoring
SM Registry - model registry
SM Pipelines- Model building pipelines for end-to-end workflows.

Question 9

What is SM Studio?

Accepted Answer

- Best way to access SM. - Web-based interface to develop ML applications, such as prepare data, train, deploy, and monitor models.

Question 10

What are the different kinds of ML algorithms that SageMaker provides?

Accepted Answer

1. Supervised learning (e.g. Regression, Classification, K-Nearest Neighbor) 2. Unsupervised Learning (e.g. Clustering, Dimensionality Reduction, Embeddings, Anomaly Detection, etc.) 3. Image Processing (e.g. Image Classification, Object Detection) 4. Text Analysis

Question 11

How do you evaluate ML Models?

Accepted Answer

1. Split the data into Training, Validation and Test sets (80-10-10 rule). 2. Model fit - e.g. is it overfitted, underfitted, or balanced? 3. Specific Metrics 3a) For classification problems - this cloud be Accuracy, Precision, Recall, F1, AUC-ROC 3b) For regression problems - Mean squared error and R squared

Question 12

What is the difference between Bias and Variance?

Accepted Answer

Bias - difference between predicted value and true value Variance - is how dispersed the values are. Analogy: Bull's eye. An overfitted model has high variance An underfitted model has high bias

Question 13

What is a confusion matrix?

Accepted Answer

Evaluates model performance by classifying the predictions as:
True Positive
True Negative
False Positive
False Negative

So, a cat that is a cat = True Positive
So, a NOT cat that is predicted as a cat = False Positive
A NOT cat that is predicted as NOT cat = True Negative
A cat that is predicted as a NOT cat = False Negative

TPs and TNs are desirable. You want them to be as high as possible.

Question 14

What are Accuracy, Precision, and Recall?

Accepted Answer

These are different ways by which you evaluate model performance

Accuracy - the number of times the model was right (i.e. TP and TN as proportion of the total number of predictions)

Precision - removes negative predictions from the model performance. For e.g. how good the model is in identifying a cat (TP/TP+FP) -In email spam detection, this may be important- you do not want your model labeling a legitimate email as spam and preventing your users from seeing that email. Impact of FP is high.

Recall - is sensitivity - proportion of correct sets that are identified as positive. (TP/TP+FN). Think about a model that needs to predict whether a patient has a terminal illness or not. You want to have as few FNs as possible (i.e. don’t classify someone as OK if they have a terminal illness). Impact of FN is high.

Question 15

What is AUC-ROC?

Accepted Answer

Area-Under Curve-Receiver-Operator Curve
Essentially, a probability curve that measures separability.
True positive vs false positives (e.g. email spam classification).
True positive = percentage of spam you capture
False positive = negative impact of spam filtering (users not able to see legitimate emails)

Developing ML Solutions Flashcards

(25 cards)