ML Practice Test #1 Flashcards

Question

Logistic regression

Answer 1

Logistic regression is simple and cost-effective but may not achieve the level of accuracy required for a critical application like fraud detection. This tradeoff might be too significant if accuracy is compromised.

Answer 2

support vector machine SVMs with nonlinear kernels can be very accurate but are computationally intensive, particularly with large datasets. The increased training time and cost might outweigh the benefits, especially when there are more cost-effective alternatives like XGBoost. Reference: https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

Answer 3

Data drift occurs when the distribution of the input data changes over time, which can lead to the model receiving data that is different from what it was trained on. To address data drift, you should use SageMaker Model Monitor to track changes in input data distribution

Answer 4

Model drift happens when the model’s underlying assumptions or parameters become outdated Model drift occurs when the model’s performance degrades because its assumptions or parameters no longer align with the real-world data. For model drift, you should periodically retrain the model using the latest data

Answer 5

They facilitate easier debugging and optimization Transparent models allow developers to understand how inputs are transformed into outputs, making it easier to identify and correct errors or inefficiencies in the model. This capability is crucial for optimizing the model’s performance and ensuring it behaves as expected. They foster trust and confidence in model predictions When stakeholders can understand the decision-making process of a model, it builds trust in its predictions. Transparency is key in high-stakes scenarios, such as healthcare or finance, where understanding the rationale behind predictions is critical for acceptance and trust.

Answer 6

Opaque models, not transparent ones, are typically associated with enhanced security through obscurity.

Answer 7

Leverage the SageMaker Model Registry to register, track, and manage different versions of models, capturing all relevant metadata, including data sources, hyperparameters, and training code The SageMaker Model Registry is specifically designed for managing model versions in a systematic and organized manner. It allows you to register different versions of a model, track metadata such as data sources, hyperparameters, and training code, and ensure that each version is easily reproducible. This approach is ideal for regulatory environments where audit trails and model governance are critical. With the Amazon SageMaker Model Registry you can do the following: Catalog models for production. Manage model versions. Associate metadata, such as training metrics, with a model. View information from Amazon SageMaker Model Cards in your registered models. Manage the approval status of a model. Deploy models to production. Automate model deployment with CI/CD. Share models with other users. Incorrect options:

Answer 8

Script mode enables you to write custom training and inference code while still utilizing common ML framework containers maintained by AWS. SageMaker supports most of the popular ML frameworks through pre-built containers, and has taken the extra step to optimize them to work especially well on AWS compute and network infrastructure in order to achieve near-linear scaling efficiency. These pre-built containers also provide some additional Python packages, such as Pandas and NumPy, so you can write your own code for training an algorithm. These frameworks also allow you to install any Python package hosted on PyPi by including a requirements.txt file with your training code or to include your own code directories. This is the correct approach for using the BYOC strategy with SageMaker. You build a Docker container that includes the required TensorFlow version and custom dependencies, then push the image to Amazon ECR. SageMaker can reference this image to create training jobs and deploy endpoints. By using Script Mode, you can execute your custom training script within the container, ensuring compatibility with your specific environment. via - https://aws.amazon.com/blogs/machine-learning/bring-your-own-model-with-amazon-sagemaker-script-mode/

Answer 9

Amazon SageMaker supports model versioning, which is crucial for tracking different iterations of your model. A canary release allows you to deploy the new model version to a small portion of users, minimizing risk by limiting exposure in case of issues. If the new version performs well, you can gradually increase traffic to it.

Answer 10

A blue/green deployment strategy is a best practice in model deployment. It allows you to deploy the new model version in parallel with the existing one, gradually shifting traffic to the new version while monitoring its performance. If issues are detected, you can quickly roll back to the previous version without disrupting service. In a blue/green deployment, SageMaker provisions a new fleet with the updates (the green fleet). Then, SageMaker shifts traffic from the old fleet (the blue fleet) to the green fleet. Once the green fleet operates smoothly for a set evaluation period (called the baking period), SageMaker terminates the blue fleet. You can specify Amazon CloudWatch alarms that SageMaker uses to monitor the green fleet. If an issue with the updated code trips any of the alarms, SageMaker initiates an auto-rollback to the blue fleet in order to maintain availability thereby minimizing risk. via - https://docs.aws.amazon.com/sagemaker/latest/dg/deployment-guardrails-blue-green.html

Answer 11

Amazon SageMaker Data Wrangler reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes. With SageMaker Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow (including data selection, cleansing, exploration, visualization, and processing at scale) from a single visual interface. You can use SQL to select the data that you want from various data sources and import it quickly. Next, you can use the data quality and insights report to automatically verify data quality and detect anomalies, such as duplicate rows and target leakage. SageMaker Data Wrangler contains over 300 built-in data transformations, so you can quickly transform data without writing code. With the SageMaker Data Wrangler data selection tool, you can quickly access and select your tabular and image data from various popular sources - such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, AWS Lake Formation, Snowflake, and Databricks - and over 50 other third-party sources - such as Salesforce, SAP, Facebook Ads, and Google Analytics. You can also write queries for data sources using SQL and import data directly into SageMaker from various file formats, such as CSV, Parquet, JSON, and database tables. How Data Wrangler works: via - https://aws.amazon.com/sagemaker/data-wrangler/

Answer 12

Amazon SageMaker Model Dashboard is a centralized portal, accessible from the SageMaker console, where you can view, search, and explore all of the models in your account. You can track which models are deployed for inference and if they are used in batch transform jobs or hosted on endpoints.

Answer 13

SageMaker Clarify helps identify potential bias during data preparation without writing code. You specify input features, such as gender or age, and SageMaker Clarify runs an analysis job to detect potential bias in those features.

Answer 14

Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference.

Answer 15

For a model to predict accurately, the data that it is making predictions on must have a similar distribution as the data on which the model was trained. Because data distributions can be expected to drift over time, deploying a model is not a one-time exercise but rather a continuous process. It is a good practice to continuously monitor the incoming data and retrain your model on newer data if you find that the data distribution has deviated significantly from the original training data distribution. If monitoring data to detect a change in the data distribution has a high overhead, then a simpler strategy is to retrain the model periodically, for example, daily, weekly, or monthly.

Answer 16

Area Under the (Receiver Operating Characteristic) Curve (AUC) represents an industry-standard accuracy metric for binary classification models. AUC measures the ability of the model to predict a higher score for positive examples as compared to negative examples. Because it is independent of the score cut-off, you can get a sense of the prediction accuracy of your model from the AUC metric without picking a threshold. The AUC metric returns a decimal value from 0 to 1. AUC values near 1 indicate an ML model that is highly accurate. Values near 0.5 indicate an ML model that is no better than guessing at random. Values near 0 are unusual to see, and typically indicate a problem with the data. Essentially, an AUC near 0 says that the ML model has learned the correct patterns, but is using them to make predictions that are flipped from reality ('0's are predicted as '1's and vice versa). The ROC curve is the plot of the true positive rate (TPR) against the false positive rate (FPR) at each threshold setting. via - https://aws.amazon.com/blogs/machine-learning/is-your-model-good-a-deep-dive-into-amazon-sagemaker-canvas-advanced-metrics/ An AUC close to 1.0 signifies that the model has excellent discriminatory power, meaning it can effectively distinguish between the positive class (defaulters) and the negative class (non-defaulters) across all thresholds. This is desirable in a classification task, especially in scenarios with class imbalance. via - https://docs.aws.amazon.com/machine-learning/latest/dg/binary-model-insights.html

Answer 17

A ROC curve close to the diagonal line (AUC ~ 0.5) indicates that the model has no discriminatory power and is performing no better than random guessing. This suggests poor model performance, not that the model performs well across all thresholds.

Answer 18

CDD evaluates the disparity in positive prediction rates across demographic groups, conditioned on a specific feature like income, to detect bias that may not be apparent when only considering overall outcomes Conditional Demographic Disparity (CDD) measures the difference in positive prediction rates between demographic groups, while conditioning on relevant features like income. This allows you to identify subtle biases that might be masked when looking only at overall predictions, ensuring that the model's decisions are fair across different groups given their specific circumstances. via - https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-data-bias-metric-cddl.html

Answer 19

Implement the entire ML workflow using Amazon SageMaker Pipelines, which provides integrated orchestration for data processing, model training, tuning, and deployment Amazon SageMaker Pipelines is a purpose-built workflow orchestration service to automate machine learning (ML) development. SageMaker Pipelines is specifically designed to orchestrate end-to-end ML workflows, integrating data processing, model training, hyperparameter tuning, and deployment in a seamless manner. It provides built-in versioning, lineage tracking, and support for continuous integration and delivery (CI/CD), making it the best choice for this use case. via - https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html

Answer 20

AWS Step Functions is a powerful service for orchestrating workflows, and it can integrate with SageMaker and Lambda. However, using Step Functions for the entire ML workflow adds complexity since it requires coordinating multiple services, whereas SageMaker Pipelines provides a more seamless, integrated solution for ML-specific workflows.

Answer 21

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. https://aws.amazon.com/glue/

Answer 22

Use Amazon SageMaker Managed Spot Training to dynamically allocate Spot Instances for the training job, automatically retrying any interrupted instances via checkpoints Managed Spot Training uses Amazon EC2 Spot instance to run training jobs instead of on-demand instances. You can specify which training jobs use spot instances and a stopping condition that specifies how long SageMaker waits for a job to run using Amazon EC2 Spot instances. Spot instances can be interrupted, causing jobs to take longer to start or finish. You can configure your managed spot training job to use checkpoints. SageMaker copies checkpoint data from a local path to Amazon S3. When the job is restarted, SageMaker copies the data from Amazon S3 back into the local path. The training job can then resume from the last checkpoint instead of restarting. via - https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/

Answer 23

Deploy the real-time recommendation model using Amazon SageMaker endpoints to ensure low-latency, high-availability, and managed infrastructure for real-time inference Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling. This makes it an ideal choice for the recommendation model, which must provide fast responses to user interactions with minimal downtime.

Answer 24

Deploy the generative AI model using Amazon Elastic Kubernetes Service (Amazon EKS) to leverage containerized microservices for high scalability and control over the deployment environment Amazon EKS is designed for containerized applications that need high scalability and flexibility. It is suitable for the generative AI model, which may require complex orchestration and scaling in response to varying demand, while giving you full control over the deployment environment. via - https://aws.amazon.com/blogs/containers/deploy-generative-ai-models-on-amazon-eks/

Answer 25

While AWS Lambda is excellent for serverless applications, it may not be the best choice for a fraud detection model if it requires continuous, low-latency processing or needs to handle very high throughput. Lambda is better suited for lightweight, event-driven tasks rather than long-running, complex inference jobs.

Answer 26

Amazon ECS is a good choice for containerized workloads but is generally more appropriate for batch processing or large-scale, stateless applications. It might not provide the low-latency and real-time capabilities needed for the recommendation model.

Answer 27

Use Amazon SageMaker’s multi-model endpoint to deploy multiple models on a single instance, reducing costs by sharing resources Amazon SageMaker’s multi-model endpoint allows you to deploy multiple models on a single instance. This can significantly reduce costs by sharing resources among models, but it may introduce slight increases in latency due to the need to load the correct model into memory. This tradeoff can be acceptable if cost savings are a priority and latency requirements are not ultra-strict. via - https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html

Answer 28

Implement auto-scaling on a fleet of medium-sized instances, allowing the system to adjust resources based on real-time demand, balancing cost and performance dynamically Auto-scaling allows you to dynamically adjust the number of instances based on demand, which helps balance performance and cost. During peak times, more instances can be provisioned to maintain low latency, while during off-peak times, fewer instances are used, reducing costs. This strategy offers a flexible way to manage the tradeoffs between performance, cost, and latency. https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html

Answer 29

While Amazon SageMaker Neo can optimize models for deployment on edge devices, it is not the best fit for this scenario. Neo is more suitable for low-latency, cost-effective deployments on devices with limited resources. In this scenario, the need for scalable, cloud-based infrastructure is more important.

Answer 30

The Linear Learner algorithm can handle classification tasks, and weighting classes can help with imbalance. However, it may not be as effective in capturing complex patterns in the data as more sophisticated algorithms like XGBoost. Linear Learner could be used for classification tasks, but predicting maintenance needs often involves detecting subtle anomalies rather than simple classification. Additionally, a binary classification model might not capture the complex patterns associated with potential failures.

Answer 31

Random Cut Forest (RCF) is designed for anomaly detection, which can be relevant for fraud detection. However, RCF is unsupervised and may not leverage the labeled data effectively, leading to suboptimal results in a supervised classification task like this. Random Cut Forest (RCF) Algorithm to detect anomalies in sensor data that may indicate impending failures Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data. Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points. They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the "regular" data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the "regular" data can often be described with a simple model. Random Cut Forest (RCF) is specifically designed for detecting anomalies in data. This algorithm excels at identifying unexpected patterns in sensor data that could indicate the early stages of equipment failure. It’s particularly well-suited for scenarios where you need to react to unusual behaviors in near-real-time. Mapping use cases to built-in algorithms: via - https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

Answer 32

K-Nearest Neighbors (k-NN) can classify based on similarity, but it does not scale well with large datasets and may struggle with the high-dimensional, imbalanced nature of the data in this context.

Answer 33

Conduct exploratory data analysis (EDA) to understand the data distribution, address missing values, and assess the class imbalance before determining if an ML solution is feasible Conducting exploratory data analysis (EDA) is the most appropriate first step. EDA allows you to understand the data distribution, identify and address missing values, and assess the extent of the class imbalance. This process helps determine whether the available data is sufficient to build a reliable model and what preprocessing steps might be necessary. via - https://aws.amazon.com/blogs/machine-learning/exploratory-data-analysis-feature-engineering-and-operationalizing-your-data-flow-into-your-ml-pipeline-with-amazon-sagemaker-data-wrangler/

Answer 34

https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html

Answer 35

DeepAR is designed for forecasting future time series data, which could be useful for predicting future equipment behavior. However, it is not primarily used for anomaly detection, which is critical for identifying unusual patterns that precede failures.

Answer 36

Time Series K-Means can cluster similar time series patterns, but clustering alone does not provide the precision needed for real-time anomaly detection, which is crucial for predictive maintenance.

Answer 37

Bayesian Optimization is a technique based on Bayes’ theorem, which describes the probability of an event occurring related to current knowledge. When this is applied to hyperparameter optimization, the algorithm builds a probabilistic model from a set of hyperparameters that optimizes a specific metric. It uses regression analysis to iteratively choose the best set of hyperparameters. Bayesian Optimization is more efficient than Random Search for hyperparameter tuning, especially when dealing with complex models and large hyperparameter spaces. It learns from previous trials to predict the best set of hyperparameters, thus focusing the search more effectively. Narrowing the range of critical hyperparameters can further improve the chances of finding the optimal values, leading to better model convergence and performance. https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html

Answer 38

Random Search selects groups of hyperparameters randomly on each iteration. It works well when a relatively small number of the hyperparameters primarily determine the model outcome.

Answer 39

Grid Search works well, but it’s relatively tedious and computationally intensive, especially with large numbers of hyperparameters. It is less efficient than Bayesian Optimization for complex models. A wide range of hyperparameters without focus would result in more trials, but it is not guaranteed to find the best values, especially with a larger search space.

Answer 40

Use Amazon SageMaker Serverless Inference that minimizes costs during low-traffic periods while managing large infrequent spikes of requests efficiently via - https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-deployment.html Serverless Inference is designed to automatically scale the compute resources based on incoming requests, making it highly efficient for handling varying levels of traffic. It is cost-effective because you only pay for the compute time used when requests are being processed. This makes it an excellent choice for scenarios where traffic is unpredictable, with periods of low or no traffic. It is ideal for workloads that have idle periods between traffic spikes and can tolerate cold starts. via - https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html

Answer 41

Asynchronous Inference is ideal for handling large and long-running inference requests that do not require an immediate response. However, it may not be as cost-effective for handling fluctuating traffic where immediate scaling and low-latency are priorities.

Answer 42

Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements.

Answer 43

To get predictions for an entire dataset, you can use Batch transform with Amazon SageMaker.

Answer 44

Use Amazon SageMaker Pipelines to orchestrate the entire ML workflow, leveraging its built-in integration with SageMaker features like training, tuning, and deployment Amazon SageMaker Pipelines is a purpose-built workflow orchestration service to automate machine learning (ML) development. SageMaker Pipelines is specifically designed for orchestrating ML workflows. It provides native integration with SageMaker features like model training, tuning, and deployment. It also supports versioning, lineage tracking, and automatic execution of workflows, making it the ideal choice for managing end-to-end ML workflows in AWS. via - https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html

Answer 45

Airflow is a powerful orchestration tool that allows you to define complex workflows using custom DAGs. However, it requires significant setup and maintenance, and while it can integrate with AWS services, it does not provide the seamless, built-in integration with SageMaker that SageMaker Pipelines offers. Amazon Managed Workflows for Apache Airflow (Amazon MWAA): via - https://aws.amazon.com/managed-workflows-for-apache-airflow/

Answer 46

AWS Step Functions is a serverless orchestration service that can integrate with SageMaker and other AWS services. However, it is more general-purpose and lacks some of the ML-specific features, such as model lineage tracking and hyperparameter tuning, that are built into SageMaker Pipelines.

Answer 47

AWS Lambda is useful for triggering specific tasks, but manually managing each step of a complex ML workflow without a comprehensive orchestration tool is not scalable or maintainable. It does not provide the task dependency management, monitoring, and versioning required for an end-to-end ML workflow.

Answer 48

Using on-demand instances for training offers flexibility, allowing you to allocate resources only when needed, which is ideal for sporadic training jobs. For inference, provisioned resources with auto-scaling ensure that the system can handle varying traffic while controlling costs, as it can scale down during periods of low demand. via - https://aws.amazon.com/ec2/pricing/

Answer 49

Architecting a defense-in-depth security approach involves implementing multiple layers of security to protect generative AI applications. This includes input validation to prevent malicious data inputs, strict access controls to limit who can interact with the AI models, and continuous monitoring to detect and respond to security incidents. These measures can help address common vulnerabilities and meet the best practices for securing generative AI applications on AWS.

Answer 50

Select the Pipe input mode to stream the data directly from Amazon S3 to the training instances, allowing the model to start processing data immediately without requiring local storage for the entire dataset In pipe mode, data is pre-fetched from Amazon S3 at high concurrency and throughput, and streamed into a named pipe, which also known as a First-In-First-Out (FIFO) pipe for its behavior. Each pipe may only be read by a single process. Pipe input mode is designed for large datasets, allowing data to be streamed directly from Amazon S3 into the training instances. This minimizes disk usage and allows training to begin immediately as the data streams in, making it ideal for your scenario where high throughput and efficiency are critical. via - https://docs.aws.amazon.com/sagemaker/latest/dg/model-access-training-data.html

Answer 51

The File input mode downloads the entire dataset to the training instance before starting the training job.

Answer 52

FastFile mode is useful for scenarios where you need rapid access to data with low latency, but it is best suited for workloads with many small files. You should note that FastFile mode can be used only while accessing data from Amazon S3 and not with Amazon FSx for Lustre.

Answer 53

Use Amazon SageMaker for both training and deployment, leverage auto-scaling endpoints for real-time inference, and apply SageMaker Pipelines for orchestrating end-to-end ML workflows, ensuring scalability and automation Amazon SageMaker provides a managed service for both training and deployment, which simplifies the infrastructure and reduces operational overhead. Auto-scaling endpoints in SageMaker ensure the system can handle increasing demand without manual intervention. SageMaker Pipelines automates the entire ML workflow, enabling continuous integration and delivery (CI/CD) practices, making the infrastructure scalable, maintainable, and cost-effective.

Answer 54

Use Amazon SageMaker Debugger to debug and improve model performance by addressing underlying problems such as overfitting, saturated activation functions, and vanishing gradients A machine learning (ML) training job can have problems such as overfitting, saturated activation functions, and vanishing gradients, which can compromise model performance. SageMaker Debugger provides tools to debug training jobs and resolve such problems to improve the performance of your model. Debugger also offers tools to send alerts when training anomalies are found, take actions against the problems, and identify the root cause of them by visualizing collected metrics and tensors. SageMaker Debugger: via - https://docs.aws.amazon.com/sagemaker/latest/dg/train-debugger.html

Answer 55

Focus on feature engineering by creating domain-specific features and use SageMaker Clarify to evaluate feature importance Feature engineering is one of the most effective ways to boost model performance, particularly in domain-specific applications like credit risk modeling. By creating more informative features, you can provide the model with better signals for prediction. SageMaker Clarify can be used to evaluate feature importance, helping you identify the most impactful features and further refine the model. via - https://aws.amazon.com/sagemaker/clarify/

ML Practice Test #1 Flashcards

(79 cards)