Data Engineering Practice Exam 1 Flashcards

1
Q

Confusion matrix

A

A table that illustrates the number or percentage of correct and incorrect predictions for each class by comparing an observation’s predicted class and its true class

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Precision

A

The proportion of true positive predictions among all positive predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Recall

A

The proportion of actual positives that are correctly identified

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Accuracy

A

The percentage of correct predictions out of all predictions made by the model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Root mean squared error (RMSE)

A

A regression metric that measures the average magnitude of the errors between predicted and actual values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

AUC-ROC curve

A

A tool to evaluate a model’s ability to distinguish between classes across various thresholds; particularly useful in the presence of class imbalance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Blue/green deployment

A

A strategy that deploys a new version of a model in parallel with the existing one; gradually shifting traffic to the new version while monitoring its performance

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Canary release

A

A deployment strategy where a small percentage of traffic is redirected to a new model version initially

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Amazon SageMaker Pipelines

A

A purpose-built workflow orchestration service to automate machine learning (ML) development

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Amazon SageMaker Data Wrangler

A

A service that reduces the time it takes to aggregate and prepare tabular and image data for ML from weeks to minutes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Pipe input mode

A

A data streaming method where data is pre-fetched from Amazon S3 at high concurrency and throughput; and streamed into a named pipe

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

File input mode

A

A data input method that downloads the entire dataset to the training instance before starting the training job

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

FastFile mode

A

A data access method for scenarios where rapid access to data with low latency is needed; best suited for workloads with many small files

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Amazon SageMaker Serverless Inference

A

A deployment option that automatically scales compute resources based on incoming requests; cost-effective for workloads with idle periods between traffic spikes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Amazon Bedrock

A

A fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies through a single API

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

SageMaker JumpStart

A

A machine learning (ML) hub that provides managed infrastructure and tools to accelerate scalable; reliable; and secure model building; training; and deployment of ML models

17
Q

XGBoost

A

An efficient implementation of gradient boosted trees algorithm that performs well in handling a variety of data types; relationships; and distributions

18
Q

Random Cut Forest (RCF)

A

An unsupervised algorithm for detecting anomalous data points within a data set

19
Q

Amazon SageMaker Feature Store

A

A fully managed; purpose-built repository to store; share; and manage features for machine learning (ML) models

20
Q

Script mode

A

A SageMaker feature that enables writing custom training and inference code while still utilizing common ML framework containers maintained by AWS

21
Q

Stacking

A

An ensemble learning technique where predictions from several base models are used as inputs to a meta-model to make the final prediction

22
Q

Area Under the (Receiver Operating Characteristic) Curve (AUC)

A

An industry-standard accuracy metric for binary classification models that measures the ability of the model to predict a higher score for positive examples compared to negative examples

23
Q

Amazon SageMaker Model Registry

A

A service for cataloging; managing versions; and tracking metadata of models; as well as managing approval status and deployment of models

24
Q

Bayesian Optimization

A

A technique based on Bayes’ theorem for hyperparameter optimization that builds a probabilistic model from a set of hyperparameters to optimize a specific metric

25
Q

Data drift

A

Changes in the distribution of the input data over time

26
Q

Model drift

A

Degradation in model performance because its assumptions or parameters no longer align with the real-world data

27
Q

Amazon SageMaker Model Monitor

A

A service for detecting data drift by tracking changes in data distribution

28
Q

Conditional Demographic Disparity (CDD)

A

A metric that measures the difference in positive prediction rates between demographic groups; while conditioning on relevant features

29
Q

Exploratory Data Analysis (EDA)

A

A process to understand data distribution; identify and address missing values; and assess the extent of class imbalance

30
Q

Amazon SageMaker Debugger

A

A tool that provides debugging capabilities for training jobs; addressing problems such as overfitting; saturated activation functions; and vanishing gradients

31
Q

AWS CloudFormation with nested stacks

A

A method to modularize infrastructure; making it easier to manage and reuse components while enabling communication between stacks