Data Engineering Practice Exam 1 Flashcards by Yitzchak Meirovich

Confusion matrix

A table that illustrates the number or percentage of correct and incorrect predictions for each class by comparing an observation’s predicted class and its true class

How well did you know this?

Not at all

Perfectly

Precision

The proportion of true positive predictions among all positive predictions

How well did you know this?

Not at all

Perfectly

Recall

The proportion of actual positives that are correctly identified

How well did you know this?

Not at all

Perfectly

Accuracy

The percentage of correct predictions out of all predictions made by the model

How well did you know this?

Not at all

Perfectly

Root mean squared error (RMSE)

A regression metric that measures the average magnitude of the errors between predicted and actual values

How well did you know this?

Not at all

Perfectly

AUC-ROC curve

A tool to evaluate a model’s ability to distinguish between classes across various thresholds; particularly useful in the presence of class imbalance

How well did you know this?

Not at all

Perfectly

Blue/green deployment

A strategy that deploys a new version of a model in parallel with the existing one; gradually shifting traffic to the new version while monitoring its performance

How well did you know this?

Not at all

Perfectly

Canary release

A deployment strategy where a small percentage of traffic is redirected to a new model version initially

How well did you know this?

Not at all

Perfectly

Amazon SageMaker Pipelines

A purpose-built workflow orchestration service to automate machine learning (ML) development

How well did you know this?

Not at all

Perfectly

Amazon SageMaker Data Wrangler

A service that reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes

How well did you know this?

Not at all

Perfectly

Pipe input mode

A data streaming method where data is pre-fetched from Amazon S3 at high concurrency and throughput; and streamed into a named pipe

How well did you know this?

Not at all

Perfectly

File input mode

A data input method that downloads the entire dataset to the training instance before starting the training job

How well did you know this?

Not at all

Perfectly

FastFile mode

A data access method for scenarios where rapid access to data with low latency is needed; best suited for workloads with many small files

How well did you know this?

Not at all

Perfectly

Amazon SageMaker Serverless Inference

A deployment option that automatically scales compute resources based on incoming requests; cost-effective for workloads with idle periods between traffic spikes

How well did you know this?

Not at all

Perfectly

Amazon Bedrock

A fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API

How well did you know this?

Not at all

Perfectly

SageMaker JumpStart

Study These Flashcards

A machine learning (ML) hub that provides managed infrastructure and tools to accelerate scalable; reliable; and secure model building; training; and deployment of ML models

XGBoost

Study These Flashcards

An efficient implementation of gradient boosted trees algorithm that performs well in handling a variety of data types; relationships; and distributions

Random Cut Forest (RCF)

Study These Flashcards

An unsupervised algorithm for detecting anomalous data points within a data set

Amazon SageMaker Feature Store

Study These Flashcards

A fully managed; purpose-built repository to store; share; and manage features for machine learning (ML) models

Script mode

Study These Flashcards

A SageMaker feature that enables writing custom training and inference code while still utilizing common ML framework containers maintained by AWS

Stacking

Study These Flashcards

An ensemble learning technique where predictions from several base models are used as inputs to a meta-model to make the final prediction

Area Under the (Receiver Operating Characteristic) Curve (AUC)

Study These Flashcards

An industry-standard accuracy metric for binary classification models that measures the ability of the model to predict a higher score for positive examples compared to negative examples

Amazon SageMaker Model Registry

Study These Flashcards

A service for cataloging; managing versions; and tracking metadata of models; as well as managing approval status and deployment of models

Bayesian Optimization

Study These Flashcards

A technique based on Bayes’ theorem for hyperparameter optimization that builds a probabilistic model from a set of hyperparameters to optimize a specific metric

Data drift

Changes in the distribution of the input data over time

Model drift

Degradation in model performance because its assumptions or parameters no longer align with the real-world data

Amazon SageMaker Model Monitor

A service for detecting data drift by tracking changes in data distribution

Conditional Demographic Disparity (CDD)

A metric that measures the difference in positive prediction rates between demographic groups; while conditioning on relevant features

Exploratory Data Analysis (EDA)

A process to understand data distribution; identify and address missing values; and assess the extent of class imbalance

Amazon SageMaker Debugger

A tool that provides debugging capabilities for training jobs; addressing problems such as overfitting; saturated activation functions; and vanishing gradients

AWS CloudFormation with nested stacks

A method to modularize infrastructure; making it easier to manage and reuse components while enabling communication between stacks

Data Engineering Practice Exam 1 Flashcards

(31 cards)