ML Practice Test #1 Flashcards
Amazon SageMaker Feature Store
Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference. For example, in an application that recommends a music playlist, features could include song ratings, listening duration, and listener demographics.
You can ingest data into SageMaker Feature Store from a variety of sources, such as application and service logs, clickstreams, sensors, and tabular data from Amazon Simple Storage Service (Amazon S3), Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks Delta Lake.
How Feature Store works: via - https://aws.amazon.com/sagemaker/feature-store/
Amazon SageMaker Clarify
SageMaker Clarify helps identify potential bias during data preparation without writing code. You specify input features, such as gender or age, and SageMaker Clarify runs an analysis job to detect potential bias in those features.
Amazon SageMaker Data Wrangler
Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface.
Amazon SageMaker Ground Truth
Amazon SageMaker Ground Truth offers the most comprehensive set of human-in-the-loop capabilities, allowing you to harness the power of human feedback across the ML lifecycle to improve the accuracy and relevancy of models. You can complete a variety of human-in-the-loop tasks with SageMaker Ground Truth, from data generation and annotation to model review, customization, and evaluation, either through a self-service or an AWS-managed offering.
Data Science Strategies - Data augmentation
data augmentation to increase the diversity of the training data
Data augmentation is a machine learning technique that changes the sample data slightly every time the model processes it. You can do this by changing the input data in small ways. When done in moderation, data augmentation makes the training sets appear unique to the model and prevents the model from learning their characteristics. For example, applying transformations such as translation, flipping, and rotation to input images.
Data Science Strategies - Early stopping
early stopping to prevent overfitting
Early stopping pauses the training phase before the machine learning model learns the noise in the data. However, getting the timing right is important; else the model will still not give accurate results.
Data Science Strategies - Ensembling
use ensembling to average predictions from multiple models
Ensembling combines predictions from several separate machine learning algorithms. Some models are called weak learners because their results are often inaccurate. Ensemble methods combine all the weak learners to get more accurate results. They use multiple models to analyze sample data and pick the most accurate outcomes. The two main ensemble methods are bagging and boosting. Boosting trains different machine learning models one after another to get the final result, while bagging trains them in parallel.
Data Science Strategies - Pruning
You might identify several features or parameters that impact the final prediction when you build a model. Feature selection—or pruning—identifies the most important features within the training set and eliminates irrelevant ones. For example, to predict if an image is an animal or human, you can look at various input parameters like face shape, ear position, body structure, etc. You may prioritize face shape and ignore the shape of the eyes.
Data Science Strategies - Regularization
Regularization is a collection of training/optimization techniques that seek to reduce overfitting. These methods try to eliminate those factors that do not impact the prediction outcomes by grading features based on importance. For example, mathematical calculations apply a penalty value to features with minimal impact. Consider a statistical model attempting to predict the housing prices of a city in 20 years. Regularization would give a lower penalty value to features like population growth and average annual income but a higher penalty value to the average annual temperature of the city.
AWS CloudFormation
Use AWS CloudFormation with nested stacks to automate the provisioning of SageMaker, EC2, and RDS resources, and configure outputs from one stack as inputs to another to enable communication between them
AWS CloudFormation with nested stacks allows you to modularize your infrastructure, making it easier to manage and reuse components. By passing outputs from one stack as inputs to another, you can automate the provisioning of resources while ensuring that all stacks can communicate effectively. This approach also enables consistent and scalable deployments across environments.
AWS CDK (Cloud Development Kit)
AWS CDK allows you to define infrastructure using high-level programming languages, which is flexible and powerful. However, failing to configure inter-stack communication would lead to a disjointed deployment, where services may not function together as required.
AWS Elastic Beanstalk
AWS Elastic Beanstalk is a managed service for deploying applications, but it is not designed for orchestrating complex ML workflows with multiple resource types like SageMaker, EC2, and RDS. It also lacks fine-grained control over resource provisioning and inter-stack communication.
Boosting
Boosting is a method used in machine learning to reduce errors in predictive data analysis. Data scientists train machine learning software, called machine learning models, on labeled data to make guesses about unlabeled data. A single machine learning model might make prediction errors depending on the accuracy of the training dataset.
https://aws.amazon.com/what-is/boosting/
Boosting - Extreme Gradient Boosting (XGBoost)
Apply Extreme Gradient Boosting (XGBoost) for its ability to handle imbalanced datasets effectively through regularization, weighted classes, and optimized computational efficiency
The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
Its robust handling of a variety of data types, relationships, distributions.
The variety of hyperparameters that you can fine-tune.
XGBoost is an extension of Gradient Boosting that includes additional features such as regularization, handling of missing values, and support for weighted classes, making it particularly well-suited for imbalanced datasets like fraud detection. It also offers significant computational efficiency, which is beneficial when working with large datasets.
via - https://aws.amazon.com/what-is/boosting/
https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
XGBoost is known for its ability to deliver high performance with relatively efficient training times, especially with techniques like early stopping and hyperparameter tuning. This approach balances the need for accuracy with reduced computational cost and training time, making it an ideal choice for this scenario.
The XGBoost (eXtreme Gradient Boosting) is a popular and efficient open-source implementation of the gradient boosted trees algorithm. Gradient boosting is a supervised learning algorithm that tries to accurately predict a target variable by combining multiple estimates from a set of simpler models. The XGBoost algorithm performs well in machine learning competitions for the following reasons:
Its robust handling of a variety of data types, relationships, distributions.
The variety of hyperparameters that you can fine-tune.
XGBoost is a powerful gradient boosting algorithm that excels in structured data problems, such as fraud detection. It allows for custom objective functions, making it highly suitable for optimizing precision and recall, which are critical in imbalanced datasets. Additionally, XGBoost has built-in techniques for handling class imbalance, such as scale_pos_weight.
Boosting - Adaptive Boosting (AdaBoost)
AdaBoost works by focusing on correcting the errors of weak classifiers, assigning more weight to misclassified instances in each iteration. However, it may struggle with noisy data and extreme class imbalance, as it can overemphasize hard-to-classify instances.
Boosting - Gradient Boosting
Gradient Boosting is a powerful technique that uses the gradient of the loss function to improve the model iteratively. While it can be adapted to handle class imbalance, it does not inherently provide the same level of flexibility and computational optimization as XGBoost for this specific problem.
Amazon SageMaker JumpStart
Use SageMaker JumpStart to deploy a pre-trained NLP model and use the built-in fine-tuning functionality with your custom dataset to create a customized sentiment analysis model
Amazon Bedrock is the easiest way to build and scale generative AI applications with foundation models. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.
Amazon SageMaker JumpStart is a machine learning (ML) hub that can help you accelerate your ML journey. With SageMaker JumpStart, you can evaluate, compare, and select FMs quickly based on pre-defined quality and responsibility metrics to perform tasks like article summarization and image generation. SageMaker JumpStart provides managed infrastructure and tools to accelerate scalable, reliable, and secure model building, training, and deployment of ML models.
Fine-tuning trains a pretrained model on a new dataset without training from scratch. This process, also known as transfer learning, can produce accurate models with smaller datasets and less training time.
SageMaker JumpStart is specifically designed for scenarios like this, where you can quickly deploy a pre-trained model and fine-tune it using your custom dataset. This approach allows you to leverage existing NLP models, reducing both development time and computational resources needed for training from scratch.
via - https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-fine-tune.html
confusion matrix
The confusion matrix illustrates in a table the number or percentage of correct and incorrect predictions for each class by comparing an observation’s predicted class and its true class. The confusion matrix is crucial for understanding the detailed performance of your model, especially in an imbalanced dataset. It allows you to calculate additional metrics such as precision, recall, and F1 score, which are essential for understanding how well your model handles false positives and false negatives.
Amazon SageMaker Autopilot
Automatically create machine learning models with full visibility. Autopilot is now in SageMaker Canvas with integrated data preparation, multi-modality support, built-in visualizations and what-if analysis, and automation support for predictions.
Amazon SageMaker Autopilot is a feature set that simplifies and accelerates various stages of the machine learning workflow by automating the process of building and deploying machine learning models (AutoML).
accuracy
While accuracy is a common metric, it is not suitable for imbalanced datasets because it can be misleading. A model predicting the majority class most of the time can achieve high accuracy without effectively capturing the minority class (e.g., customers who make a purchase).
Precision and recall
Precision and recall are particularly important in an imbalanced dataset. Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of actual positives that are correctly identified. Focusing on these metrics helps in assessing how well the model avoids false positives and false negatives, which is critical in your scenario.
RMSE
Root mean squared error
RMSE is a regression metric, not suitable for classification problems. In this scenario, you are dealing with a classification task, so metrics like precision, recall, and F1 score are more appropriate.
AUC
The area under the curve (AUC) metric is used to compare and evaluate binary classification by algorithms that return probabilities, such as logistic regression. To map the probabilities into classifications, these are compared against a threshold value.
The relevant curve is the receiver operating characteristic curve. The curve plots the true positive rate (TPR) of predictions (or recall) against the false positive rate (FPR) as a function of the threshold value, above which a prediction is considered positive. Increasing the threshold results in fewer false positives, but more false negatives.
AUC is the area under this receiver operating characteristic curve. Therefore, AUC provides an aggregated measure of the model performance across all possible classification thresholds. AUC scores vary between 0 and 1. A score of 1 indicates perfect accuracy, and a score of one half (0.5) indicates that the prediction is not better than a random classifier.
Deep Neural Network
A deep neural network may provide high accuracy but typically requires significant computational resources and longer training times, leading to higher costs. This approach may not be feasible within a limited budget, especially with the need for frequent retraining.
Logistic regression
Logistic regression is simple and cost-effective but may not achieve the level of accuracy required for a critical application like fraud detection. This tradeoff might be too significant if accuracy is compromised.
SVM
support vector machine
SVMs with nonlinear kernels can be very accurate but are computationally intensive, particularly with large datasets. The increased training time and cost might outweigh the benefits, especially when there are more cost-effective alternatives like XGBoost.
Reference:
https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html
Data drift
Data drift occurs when the distribution of the input data changes over time, which can lead to the model receiving data that is different from what it was trained on.
To address data drift, you should use SageMaker Model Monitor to track changes in input data distribution
Model drift
Model drift happens when the model’s underlying assumptions or parameters become outdated
Model drift occurs when the model’s performance degrades because its assumptions or parameters no longer align with the real-world data.
For model drift, you should periodically retrain the model using the latest data
A transparent and explainable machine learning model benefits?
They facilitate easier debugging and optimization
Transparent models allow developers to understand how inputs are transformed into outputs, making it easier to identify and correct errors or inefficiencies in the model. This capability is crucial for optimizing the model’s performance and ensuring it behaves as expected.
They foster trust and confidence in model predictions
When stakeholders can understand the decision-making process of a model, it builds trust in its predictions. Transparency is key in high-stakes scenarios, such as healthcare or finance, where understanding the rationale behind predictions is critical for acceptance and trust.
Opaque models
Opaque models, not transparent ones, are typically associated with enhanced security through obscurity.
Amazon SageMaker Model Registry
Leverage the SageMaker Model Registry to register, track, and manage different versions of models, capturing all relevant metadata, including data sources, hyperparameters, and training code
The SageMaker Model Registry is specifically designed for managing model versions in a systematic and organized manner. It allows you to register different versions of a model, track metadata such as data sources, hyperparameters, and training code, and ensure that each version is easily reproducible. This approach is ideal for regulatory environments where audit trails and model governance are critical.
With the Amazon SageMaker Model Registry you can do the following:
Catalog models for production.
Manage model versions.
Associate metadata, such as training metrics, with a model.
View information from Amazon SageMaker Model Cards in your registered models.
Manage the approval status of a model.
Deploy models to production.
Automate model deployment with CI/CD.
Share models with other users.
Incorrect options: