AWS ML Associate Flashcards
Performance metric: Measure the imbalance of positive outcomes between different facet values.
Difference in proportions of labels (DPL)
Performance metric: Identify the difference in the predicted outcome as an input feature changes.
Partial dependence plots (PDPs)
Performance metric: Quantify the contribution of each feature in a prediction.
Shapley values
What should you use for data processing if it involves Tensorflow or Pytorch?
SageMaker
What is the simplest way to prevent internet and data access to inference containers?
Sagemaker network isolation mode
Create a baseline to monitor a Sagemaker model’s bias drift. For instance, you want it to weigh personal income over credit history for loan approval. How do you do this?
Create a SHAP baseline using the ‘ModelExplainabilityMonitor’ class. Generate a feature attribution baseline which will trigger when the observed feature attribution occurs.
tool used to check for bias and explainability in datasets and models
SageMaker Clarify
used to visualize and analyze intermediate tensors. Identify specific poor classifications in a CNN and make adjustments to improve model performance.
SageMaker with TensorBoard
How do you strip PII from text-based user interactions
Amazon Comprehend
RNN training: Exploding gradients causing a convergence issue. What feature can help address this issue?
Sagemaker Training Compiler. Optomises DL models to accelerate training by more efficiently using ML GPU instances.
What instance types are supported by AWS Neuron SDKs for real-time inference on streaming video?
Inferentia instances (Inf2 family)
What are used to centralize and standardize model documentation.
SageMaker Model Cards
SageMaker Serverless Inference: What is the biggest consideration when deciding whether to use provisioned concurrency?
low-latency (avoiding cold-starts)
(CloudWatch) What feature in the Logs Insights page is helpful in finding infrastructure monitoring through-lines in your query results?
The Patterns tab
What is the primary purpose of Capacity Blocks for machine learning (ML)?
Reserve GPU instances for short-duration machine learning workloads on a future date.
When using an embedded question to query a vector database for RAG, what should be returned?
The full text - not embeddings - of the nearest neighbor documents to enhance the query
How can you use SageMaker Model Monitor to re-train your model?
Enable Data Capture, and use that data to retrain the model.
Exploratory data visualization that can be used to identify hidden patterns, (ralationship analysis) such as an increase in specific item purchases or periods of frequent transactions
Heat Map
Exploratory data visualization that helps with distribution analysis by binning ranges of data
Histogram
Which storage service should be used when you need concurrent access from multiple Amazon EC2 instances to a Windows File Server for distributed training of the model.
FSx
Best way to improve Kinesis ingestion performance
Batching
What type of variables should be on the x-axis of a bar chart?
Categorical variables
Data format for ingesting streamed, unstructured data
JSON lines
When to choose Lustre over EFS
Only when super high performace, high volumes and extremely low latency is the requirement. EFS is go-to for distributed training and S3 is go-to for general storage