Machine Learning Engineering Associate 2 Flashcards
Data Transformation, Integrity and Feature Engineering
Data Wrangler
Visual data preparation tool in Amazon SageMaker for exploring; transforming; and analyzing data
Glue
Fully managed extract; transform; and load (ETL) service
Glue DataBrew
Visual data preparation tool that makes it easy to clean and normalize data
Kinesis
Platform for streaming data on AWS
Lambda
Serverless compute service for running code without provisioning servers
SageMaker Ground Truth
Fully managed data labeling service for building accurate training datasets
Class imbalance
Situation where classes in a dataset are not represented equally
Server-side encryption
Data encryption performed by the storage service
Client-side encryption
Data encryption performed by the client before sending to storage
Data anonymization
Removing or encrypting personally identifiable information from datasets
Supervised learning
ML approach where the model is trained on labeled data
Unsupervised learning
ML approach where the model is trained on unlabeled data
Reinforcement learning
ML approach where an agent learns to make decisions by interacting with an environment
Feature importance
Measure of how much each feature contributes to the model’s predictions
SHAP values
Shapley Additive exPlanations; a game theoretic approach to explain machine learning model outputs
XGBoost
Gradient boosting algorithm known for speed and performance
Epoch
One complete pass through the entire training dataset
Early stopping
Technique to stop training when performance on a validation set stops improving
Distributed training
Spreading the training process across multiple compute resources
Hyperparameter tuning
Process of finding the best combination of hyperparameters for a model
Transfer learning
Using knowledge gained from solving one problem to solve a related problem
Dropout
Technique where randomly selected neurons are ignored during training
Weight decay
Adding a penalty term to the loss function to prevent overfitting
Random search
Randomly sampling hyperparameters from a defined search space
Bayesian optimization
Using probabilistic model to guide the search for optimal hyperparameters
Confusion matrix
Table showing correct and incorrect predictions for each class
F1 score
Harmonic mean of precision and recall
ROC
Graph showing the performance of a classification model at all classification thresholds
AUC
Measure of the ability of a classifier to distinguish between classes
Overfitting
Model performs well on training data but poorly on unseen data
Underfitting
Model performs poorly on both training and unseen data
Concept drift
Changes in the underlying relationships between input and output variables
Data drift
Changes in the statistical properties of the input data
A/B testing
Experiment where two variants of a model are compared to determine which performs better
CloudTrail
Service that records API calls and other account activity in AWS
Cost Explorer
Tool for visualizing; understanding; and managing AWS costs and usage over time
IAM roles
Set of permissions that define what actions are allowed or denied in AWS
Security groups
Virtual firewalls for controlling inbound and outbound traffic to AWS resources
Network ACLs
Optional layer of security that acts as a firewall for controlling traffic in and out of subnets
Least privilege access
Principle of giving users the minimum levels of access necessary to complete their tasks