Data Engineering Practice Exam 1 Flashcards
Confusion matrix
A table that illustrates the number or percentage of correct and incorrect predictions for each class by comparing an observation’s predicted class and its true class
Precision
The proportion of true positive predictions among all positive predictions
Recall
The proportion of actual positives that are correctly identified
Accuracy
The percentage of correct predictions out of all predictions made by the model
Root mean squared error (RMSE)
A regression metric that measures the average magnitude of the errors between predicted and actual values
AUC-ROC curve
A tool to evaluate a model’s ability to distinguish between classes across various thresholds; particularly useful in the presence of class imbalance
Blue/green deployment
A strategy that deploys a new version of a model in parallel with the existing one; gradually shifting traffic to the new version while monitoring its performance
Canary release
A deployment strategy where a small percentage of traffic is redirected to a new model version initially
Amazon SageMaker Pipelines
A purpose-built workflow orchestration service to automate machine learning (ML) development
Amazon SageMaker Data Wrangler
A service that reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes
Pipe input mode
A data streaming method where data is pre-fetched from Amazon S3 at high concurrency and throughput; and streamed into a named pipe
File input mode
A data input method that downloads the entire dataset to the training instance before starting the training job
FastFile mode
A data access method for scenarios where rapid access to data with low latency is needed; best suited for workloads with many small files
Amazon SageMaker Serverless Inference
A deployment option that automatically scales compute resources based on incoming requests; cost-effective for workloads with idle periods between traffic spikes
Amazon Bedrock
A fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API
SageMaker JumpStart
A machine learning (ML) hub that provides managed infrastructure and tools to accelerate scalable; reliable; and secure model building; training; and deployment of ML models
XGBoost
An efficient implementation of gradient boosted trees algorithm that performs well in handling a variety of data types; relationships; and distributions
Random Cut Forest (RCF)
An unsupervised algorithm for detecting anomalous data points within a data set
Amazon SageMaker Feature Store
A fully managed; purpose-built repository to store; share; and manage features for machine learning (ML) models
Script mode
A SageMaker feature that enables writing custom training and inference code while still utilizing common ML framework containers maintained by AWS
Stacking
An ensemble learning technique where predictions from several base models are used as inputs to a meta-model to make the final prediction
Area Under the (Receiver Operating Characteristic) Curve (AUC)
An industry-standard accuracy metric for binary classification models that measures the ability of the model to predict a higher score for positive examples compared to negative examples
Amazon SageMaker Model Registry
A service for cataloging; managing versions; and tracking metadata of models; as well as managing approval status and deployment of models
Bayesian Optimization
A technique based on Bayes’ theorem for hyperparameter optimization that builds a probabilistic model from a set of hyperparameters to optimize a specific metric
Data drift
Changes in the distribution of the input data over time
Model drift
Degradation in model performance because its assumptions or parameters no longer align with the real-world data
Amazon SageMaker Model Monitor
A service for detecting data drift by tracking changes in data distribution
Conditional Demographic Disparity (CDD)
A metric that measures the difference in positive prediction rates between demographic groups; while conditioning on relevant features
Exploratory Data Analysis (EDA)
A process to understand data distribution; identify and address missing values; and assess the extent of class imbalance
Amazon SageMaker Debugger
A tool that provides debugging capabilities for training jobs; addressing problems such as overfitting; saturated activation functions; and vanishing gradients
AWS CloudFormation with nested stacks
A method to modularize infrastructure; making it easier to manage and reuse components while enabling communication between stacks