1. Question response 2. Create original content (text/images) 3. Quickly process vast amounts of data 4. Solve complex problems (fraud detection) 5. Perform repetitive/monotonous tasks 6. Finding patterns in data 7. Forecasting trends

1. Predict pandemics 2. Monitor assembly lines 3. Monitor sensor data to determine when equipment might fail 4. Product recommendation and support info (search to solution) 5. Personalized content recommendations 6. Forecast demand 7. Detect fraud 8. HR 9. Translate language text

Domain 1: AI/ML Fundamentals 20% Flashcards by Natasha WrightPope

The field of computer science dedicated to solving cognitive problems commonly associated with human intelligence, such as learning, creation, and image recognition.

Artificial intelligence

How well did you know this?

Not at all

Perfectly

_____ is to create self-learning system that derives meaning from data.

The goal of AI

How well did you know this?

Not at all

Perfectly

Uses of AI

Question response
Create original content (text/images)
Quickly process vast amounts of data
Solve complex problems (fraud detection)
Perform repetitive/monotonous tasks
Finding patterns in data
Forecasting trends

How well did you know this?

Not at all

Perfectly

_____ is a branch of AI and computer science that focuses on use of data and algorithms to imitate the way humans learn. It gradually improves its accuracy to build computer systems that learn from data.

Machine learning

How well did you know this?

Not at all

Perfectly

How are ML models trained?

By using large datasets to identify patterns and make predictions

How well did you know this?

Not at all

Perfectly

_____ is a type of machine learning model that is inspired by human brains using layers of neural networks to process information.

Deep learning

How well did you know this?

Not at all

Perfectly

_____ are some of the things that deep learning models can do.

Recognizing human speech and objects and images

How well did you know this?

Not at all

Perfectly

AI uses

Predict pandemics
Monitor assembly lines
Monitor sensor data to determine when equipment might fail
Product recommendation and support info (search to solution)
Personalized content recommendations
Forecast demand
Detect fraud
HR
Translate language text

How well did you know this?

Not at all

Perfectly

Using a technique called _____, an AI model can process historical data, also known as time series data and predict future values.

regression analysis

How well did you know this?

Not at all

Perfectly

Predictions that AI makes are called _____, which is an educated guess, so the model gives a probabilistic result.

inferences

How well did you know this?

Not at all

Perfectly

A deviation from the expected pattern.

anomaly

How well did you know this?

Not at all

Perfectly

_____ use AI to process images and video for object identification and facial recognition, as well as classification, recommendation, monitoring, and detection.

Computer vision applications

How well did you know this?

Not at all

Perfectly

_____ is what allows machines to understand, interpret, and generate human language in a natural-sounding way.

Natural language processing

How well did you know this?

Not at all

Perfectly

_____ can have seemingly intelligent conversations and generate original content like stories, images, videos, and even music.

Generative AI

How well did you know this?

Not at all

Perfectly

_____ is the science of developing algorithms and statistical models that computer systems use to perform complex tasks without explicit instructions.

Machine learning

How well did you know this?

Not at all

Perfectly

Computer systems use ML algorithms to _____ and _____.

process large quantities of historical data, and identify data patterns

How well did you know this?

Not at all

Perfectly

Machine learning starts with a _____ that takes data as inputs, and generates an output.

mathematical algorithm

How well did you know this?

Not at all

Perfectly

To train the ML algorithm to produce the output we expect, we give it known data, which consists of _____.

features

How well did you know this?

Not at all

Perfectly

What is the task of the ML algorithm?

to find the correlation between the input data features and the known expected output

How well did you know this?

Not at all

Perfectly

Adjustments are made to the ML model by changing _____ until the model reliably produces the expected output.

internal parameter values

How well did you know this?

Not at all

Perfectly

When a trained model is able to make accurate predictions and produce output from new data that it hasn’t seen during training.

inference

How well did you know this?

Not at all

Perfectly

This type of data is stored as rows in a table with columns, which can serve as the features for an ML model.

structured data

How well did you know this?

Not at all

Perfectly

_____ can be text files like CSV, or stored in relational databases like Amazon Relational Database Service, Amazon RDS, or Amazon Redshift.

structured data

How well did you know this?

Not at all

Perfectly

_____ can be queried using structured query language, or SQL.

structured data

How well did you know this?

Not at all

Perfectly

_____ is the primary source for training data because it can store any type of data, is lower cost, and has virtually unlimited storage capacity.

Amazon S3

How well did you know this?

Not at all

Perfectly

Unlike data in a table, _____ elements can have different attributes or missing attributes. An example is a text file that contains JSON, which stands for JavaScript Object Notation.

semi-structured data

How well did you know this?

Not at all

Perfectly

_____ and _____ with MongoDB compatibility, are two examples of transactional databases built specifically for semi-structured data.

Amazon DynamoDB and Amazon DocumentDB

How well did you know this?

Not at all

Perfectly

_____ is data that doesn’t conform to any specific data model and can’t be stored in table format. Some examples include images, video, and text files, or social media posts. It is typically stored as objects in an object storage system like Amazon S3.

Unstructured data

How well did you know this?

Not at all

Perfectly

Breaks down text into individual units of words or phrases

tokenization

How well did you know this?

Not at all

Perfectly

_____ is important for training models that need to predict future trends. Each data record is labeled with a timestamp, and stored sequentially.

Time series data

How well did you know this?

Not at all

Perfectly

Depending on the sampling rate, time series data captured for long periods can get quite large and be stored in _____ for model training.

Amazon S3

How well did you know this?

Not at all

Perfectly

To create a machine learning model, we need to start with an algorithm which defines the _____.

mathematical relationship between outputs and inputs

How well did you know this?

Not at all

Perfectly

The simple linear equation _____, defines the linear relationship between our independent variable, x, and the dependent variable, y.

y=mx+b

How well did you know this?

Not at all

Perfectly

The slope, m, and intercept, b, are the model parameters that are adjusted iteratively during the training process to _____.

find the best-fitting model

How well did you know this?

Not at all

Perfectly

To determine the best fitting model, we look for the parameter values that _____.

minimize the errors

How well did you know this?

Not at all

Perfectly

This training process produces model artifacts, which typically consists of trained parameters, a model definition that describes how to compute inferences, and other metadata.

model training

How well did you know this?

Not at all

Perfectly

The _____, which are normally stored in Amazon S3, are packaged together with inference code to make a deployable model.

model artifacts

How well did you know this?

Not at all

Perfectly

_____ is the software that implements the model, by reading the artifacts.

Inference code

How well did you know this?

Not at all

Perfectly

The first is where an endpoint is always available to accept inference requests in real time. And the second is where a batch job is performing inference.

Two options for hosting a model

How well did you know this?

Not at all

Perfectly

_____ is ideal for online inferences that have low latency and high throughput requirements. For this, your model is deployed on a persistent endpoint to handle a sustained flow of requests.

Real-time inference

How well did you know this?

Not at all

Perfectly

_____ is suitable for offline processing when large amounts of data are available upfront, and you don’t need a persistent endpoint.

Batch

How well did you know this?

Not at all

Perfectly

When you need a large number of inferences, and it’s okay to wait for the results, _____can be more cost-effective.

batch processing

How well did you know this?

Not at all

Perfectly

T/F: The main difference between real-time and batch is that with batch, the computing resources only run when processing the batch, and then they shut down.

True

How well did you know this?

Not at all

Perfectly

T/F: With real-time inferencing, some compute resources are always running and available to process requests.

True

How well did you know this?

Not at all

Perfectly

With _____, you train your model with data that is pre-labeled.

supervised learning

How well did you know this?

Not at all

Perfectly

T/F: Training data specifies both, the input and the desired output of the algorithm.

True

How well did you know this?

Not at all

Perfectly

What is the challenge with supervised learning?

labeling

How well did you know this?

Not at all

Perfectly

What solution helps with the challenge of labeling?

Amazon SageMaker Ground Truth

How well did you know this?

Not at all

Perfectly

SageMaker Ground Truth can leverage crowdsourcing service called _____that provides access to a large pool of affordable labor spread across the globe.

Amazon Mechanical Turk

How well did you know this?

Not at all

Perfectly

_____ algorithms train on data that has features but is not labeled. They can spot patterns, group the data into clusters, and split the data into a certain number of groups.

Unsupervised learning

How well did you know this?

Not at all

Perfectly

_____ is useful for use cases such as pattern recognition, anomaly detection, and automatically grouping data into categories.

Unsupervised learning

How well did you know this?

Not at all

Perfectly

T/F: Unsupervised learning algorithms can also be used to clean and process data for further modeling automatically.

True

How well did you know this?

Not at all

Perfectly

T/F: Unsupervised learning is often used for anomaly detection?

True

How well did you know this?

Not at all

Perfectly

_____ is a machine learning method that is focused on autonomous decision making by an agent. The agent takes actions within an environment to achieve specific goals. The model learns through trial and error, and training does not require labeled input. Actions that an agent takes that move it closer to achieving the goal are rewarded.

Reinforcement learning

How well did you know this?

Not at all

Perfectly

T/F: To encourage learning during training, the learning agent must be allowed to sometimes pursue actions that might not result in rewards with reinforcement learning.

True

How well did you know this?

Not at all

Perfectly

To teach developers about developing a reinforcement learning model, Amazon offers a model race car called _____ that you can teach to drive on a racetrack. With this, the car is the agent, and the track is the environment.

AWS DeepRacer

How well did you know this?

Not at all

Perfectly

T/F: Both unsupervised and reinforcement learning work without labeled data.

True

How well did you know this?

Not at all

Perfectly

T/F: Unsupervised learning algorithms receive inputs with no specified outputs during the training process.

True

How well did you know this?

Not at all

Perfectly

T/F: Reinforcement learning has a predetermined end goal. While it takes an exploratory approach, the explorations are continuously validated and improved to increase the probability of reaching the end goal.

True

How well did you know this?

Not at all

Perfectly

When a model performs better on training data than it does on new data, it is called _____, and it is said that the model does not recognize well.

overfitting

How well did you know this?

Not at all

Perfectly

The best way to correct a model that is overfitting _____

is to train it with data that is more diverse

How well did you know this?

Not at all

Perfectly

If you train your model for too long, it will start to overemphasize unimportant features called _____, which is another way of overfitting.

noise

How well did you know this?

Not at all

Perfectly

_____ is a type of error that occurs when the model cannot determine a meaningful relationship between the input and output data.

Underfitting

How well did you know this?

Not at all

Perfectly

_____ models give inaccurate results for both the training dataset and new data.

Underfit

How well did you know this?

Not at all

Perfectly

_____ is when there are disparities in the performance of a model across different groups. The results are skewed in favor of or against an outcome for a particular class.

Bias

How well did you know this?

Not at all

Perfectly

The quality of a model depends on _____ and _____.

the underlying data quality and quantity

How well did you know this?

Not at all

Perfectly

T/F: If a model is showing bias, the weight of features that are introducing noise can be directly adjusted by the data scientists.

True

How well did you know this?

Not at all

Perfectly

_____, such as age and sex discrimination, should be identified at the beginning before creating a model.

Fairness constraints

How well did you know this?

Not at all

Perfectly

Training data should be inspected and evaluated for potential bias, and models need to be continually evaluated by checking their results for _____.

fairness

How well did you know this?

Not at all

Perfectly

Deep learning is a type of machine learning that uses algorithmic structures called _____.

neural networks

How well did you know this?

Not at all

Perfectly

In deep learning models, we use software modules called _____to simulate the behavior of neurons.

nodes

How well did you know this?

Not at all

Perfectly

_____ comprise layers of nodes, including an input layer, several hidden layers, and an output layer of nodes.

Deep neural networks

How well did you know this?

Not at all

Perfectly

Every node in the neural network autonomously assigns _____to each feature.

weights

How well did you know this?

Not at all

Perfectly

With neural networks, information flows through the network in a _____direction from input to output.

forward

How well did you know this?

Not at all

Perfectly

Every node autonomously assigns weights to each feature.
Info flows forward thru network from input to output.
During training, diff b/w predicted output and actual output is calculated.
Weights of neurons repeatedly adjusted to minimize error.

How neural networks work

How well did you know this?

Not at all

Perfectly

_____ can excel at tasks like image classification and natural language processing where there is a need to identify the complex relationship between data objects.

Deep learning

How well did you know this?

Not at all

Perfectly

What made deep learning a viable option?

low-cost cloud computing

How well did you know this?

Not at all

Perfectly

Because anyone can now readily use powerful computing resources in the cloud, _____ have become the standard algorithmic approach to computer vision.

neural networks

How well did you know this?

Not at all

Perfectly

A big advantage of deep learning models for computer vision is that _____.

they don’t need the relevant features given to them. They can identify patterns in images and extract the important features on their own.

How well did you know this?

Not at all

Perfectly

The decision to use traditional machine learning or deep learning depends on _____.

the type of data you need to process

How well did you know this?

Not at all

Perfectly

Traditional machine learning algorithms will generally perform well and be efficient when it comes to _____.

identifying patterns from structured data and labeled data

How well did you know this?

Not at all

Perfectly

Deep learning solutions are more suitable for _____data like images, videos, and text.

unstructured

How well did you know this?

Not at all

Perfectly

Tasks for deep learning include_____.

image classification and natural language processing

How well did you know this?

Not at all

Perfectly

Both types of machine learning use statistical algorithms, but only deep learning uses_____ to simulate human intelligence.

neural networks

How well did you know this?

Not at all

Perfectly

Do deep learning models require a lot of work on selecting/extracting features?

No, b/c they’re self-learning.

How well did you know this?

Not at all

Perfectly

_____ is accomplished by using deep learning models that are pre-trained on extremely large datasets containing strings of text or, in AI terms, _____.

Generative AI /sequences

How well did you know this?

Not at all

Perfectly

Gen AI deep learning models use transformer neural networks, which change an input sequence, in Gen AI known as _____, into an output sequence, which is the response to your _____.

prompt

How well did you know this?

Not at all

Perfectly

Neural networks process the elements of a sequence sequentially _____.

one word at a time

How well did you know this?

Not at all

Perfectly

Transformers process the sequence in _____, which speeds up the training and allows much bigger datasets to be used.

parallel

How well did you know this?

Not at all

Perfectly

They outperform other ML approaches to natural language processing. They excel at understanding human language so they can read long articles and summarize them. They are also great at generating text that’s similar to the way a human would. As a result, they are good at language translation and even writing original stories, letters, articles, and poetry. They even know computer programming languages and can write code for software developers.

Large language models

How well did you know this?

Not at all

Perfectly

T/F: Complex models generally present a tradeoff of compatibility compared with interpretability.

True

How well did you know this?

Not at all

Perfectly

T/F: Less complex models mean lower performance.

True

How well did you know this?

Not at all

Perfectly

If a software application always produces the same output for the same input, it is said to be _____.

deterministic

How well did you know this?

Not at all

Perfectly

A rule-based application is deterministic unless _____.

someone changes the rules

How well did you know this?

Not at all

Perfectly

T/F: Identical sets of input values will result in a variety of results that aren’t consistent.

True

If determinism is necessary, then a _____ is a better option.

rule-based system

If your dataset consists of features or attributes as inputs with labeled target values as outputs, then you have a _____learning problem.

supervised

For a supervised learning problem, you train your model with _____.

data containing known inputs and outputs

If your target values are categorical, for example, one or more discrete values, then you have a _____ problem.

classification

If the target values you’re trying to predict are mathematically continuous, then you have a _____problem.

regression

If your dataset consists of features or attributes as inputs that do not contain labels or target values, then you have an _____ problem.

unsupervised learning

How should patterns be predicted in unsupervised learning problems?

Based on the pattern discovered in the input data.

The goal in unsupervised learning problems is to _____, such as groupings, within the data.

discover patterns

When your data needs to be separated into discrete groups, you have a _____problem.

clustering

If you are seeking to spot outliers in your data, then you have an _____ problem.

anomaly detection

Classification problems are normally distinguished as _____ or _____.

binary or multiclass

_____ assigns an input to one of two predefined and mutually exclusive classes based on its attributes.

Binary classification

_____ estimates the value of a dependent target variable based on one or more other variables, or attributes that are correlated with it.

Regression

_____ is when there is a direct linear relationship between the inputs and output.

Linear regression

_____ uses a single independent variable, such as weight, to predict someone’s height.

Simple linear regression

If we have multiple independent variables, such as weight and age, then we have a _____ problem.

multiple linear regression

_____ can create a model that takes one or more features as an input to predict the price of a house.

Regression analysis

_____ is used to measure the probability of an event occurring.

Logistic regression

A logistic regression prediction is a value between zero and one, where zero indicates _____, and one indicates _____.

an event that is unlikely to happen / a maximum likelihood that it will happen

Logistic equations use _____functions to compute the regression line and one or more independent variables.

logarithmic

Both logistic regression and linear regression require _____ for the models to become accurate in predictions.

a significant amount of labeled data

_____ is a class of techniques that are used to classify data objects into groups, called clusters. It attempts to find discrete groupings within data.

Cluster analysis

Members are similar as possible to each other and different as possible from members of other groups.
Define features/attributes you want the algorithm to use to determine similarity.
Select a distance function to measure similarity and specify number of clusters/groups you want to analyze.

Cluster analysis

_____ is the identification of rare items, events, or observations in the data, which raise suspicions, because they differ significantly from the rest of the data.

Anomaly detection

_____ is a pre-trained deep learning service for computer vision. It meets the needs of several common computer vision use cases without requiring customers to train their own models. Images, videos, streaming videos, facial recognition.

Amazon Rekognition

Uses for Amazon Rekognition

detect/label objects
security systems to id objects in real-time streaming video
add labels for any text it sees, ex. street sign
flag questionable content for human review

_____extracts text, handwriting, forms, and tabular data from scanned documents.

Amazon Textract

_____ is a natural language processing service that helps discover insights and relationships in text. For customer feedback.

Amazon Comprehend

Common use case for Comprehend and Textract

detecting personal identifiable information, PII, in text

_____ helps build voice and text interfaces to engage with customers. Used for customer service chatbots and interactive voice response systems.

Amazon Lex

_____ is an automatic speech recognition service that supports over 100 languages. This is designed to process live and recorded audio or video input to provide high quality transcriptions for search and analysis. A common use case is to caption streaming audio in real time.

Amazon Transcribe

_____ turns text into natural-sounding speech in dozens of languages. It uses deep learning technologies to synthesize human speech.

Amazon Polly

Common use cases include converting articles to speech and prompting callers in interactive voice response systems.

Amazon Polly

_____ uses machine learning to perform an intelligent search of enterprise systems to quickly find content. It uses natural language processing to understand questions.

Amazon Kendra

_____ allows businesses to automatically generate personalized recommendations for their customers in industries such as retail, media and, entertainment.

Amazon Personalize

_____ fluently translates text between 75 different languages. It is built on a neural network that considers the entire context of the source sentence and the translation it has generated so far. It uses this information to create more accurate and fluent translations.

Amazon Translate

_____ is an AI service for time series forecasting. By providing it with historical time series data, you can predict future points in the series. Time series forecasting is useful in multiple domains.

Amazon Forecast

_____ helps to identify potentially fraudulent online activities such as online payment fraud and creation of fake accounts. It features pre-trained data models to detect fraud in online transactions, product reviews, checkout and payments, new accounts, and account takeovers.

Amazon Fraud Detector

_____ is a fully managed service to build generative AI applications on AWS, and it lets you choose from high performing foundation models trained by Amazon, Meta, and leading AI startups. You can customize a foundation model by providing your own training data or creating a knowledge base for the model to query.

Amazon Bedrock

When a generative AI model calls an external knowledge system to retrieve information outside its training data, this is called _____.

Retrieval Augmented Generation

Use the _____ foundation model from Amazon to generate an image in response to a prompt.

Titan Image Generator

Use the _____ family of services when you need more customized machine learning models or workflows that go beyond the prebuilt functionalities offered by the core AI services.

Amazon SageMaker

Provides machine learning capabilities for data scientists and developers to prepare, build, train, and deploy high-quality ML models efficiently.
It comprises several services that are optimized for building and training custom machine learning models, which include data preparation and labeling, large-scale parallel training on multiple instances or GPU clusters, model deployment, and real-time inference endpoints.
To accelerate the development process, it offers pre-trained models that you can use as a starting point and reduce the resources needed for data preparation and model training.

Amazon SageMaker

What makes generative AI models more accurate and current with their responses?

Retrieval augmented generation

A _____ is a series of interconnected steps that start with a business goal and finish with operating a deployed ML model. It starts with defining the problem, collecting and preparing training data, training the model, deploying, and finally, monitoring it.

machine learning pipeline

Clear idea of problem
Be able to measure business value against objectives and success criteria
Align stakeholders to gain concensus on goal
Evaluate org’s ability to move forward w/ target
Evaluate all options to achieving goal
Considering cost, determine how accurate outcomes will be
Ensure enough good training data is available
Perform cost benefit analysis

Determine if ML is best solution

With _____, you can create a custom classifier that uses your own categories by supplying it with your training data.

Amazon Comprehend

_____lets you start with a fully trained foundation model. You can fine-tune this model with your own data using transfer learning.

Amazon Bedrock

_____ provides pre-trained AI foundation models and task-specific models for computer vision and natural language processing problem types. These are pre-trained on large public datasets.

SageMaker JumpStart

Fine-tuning the model with incremental training using your own dataset.

transfer learning

Identify the data needed and determine the options for collecting the data.
Know what training data you will need to develop your model and where it is generated and stored.
Know if it’s streaming data or whether you can load it in batch process.
Configure a process known as extract, transform, and load, ETL, to collect the data from possibly multiple sources and store it in a centralized repository.
Know if the data is labeled or how you will be able to label it.
Determine which characteristics of the dataset should be used as features to train the model.

Collecting/processing training data

___% of the data should be used for training the model, ___% should be set aside for model evaluation, and ___% for performing the final test before deploying the model to production.

80/10/10

T/F: You should reduce the features in your training data to only those that are needed for inference.

True

T/F: Features can be combined to further reduce the number of features.

True

_____ is a fully managed ETL service. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point this to your data stored on AWS.

AWS Glue

AWS Glue discovers your data and stores the associated metadata, the table definition, and schema in the _____.

AWS Glue Data Catalog

Generates the code to execute your data transformations and data loading processes.
Has built-in transformations for things like dropping duplicate records, filling in missing values, and splitting your dataset.
Can extract, transform, and load data from a large variety of data stores, which include relational databases, data warehouses, and other cloud, or even streaming services.
Can crawl your data sources and automatically determine the data schema by using classifiers.
Writes the schema to tables in the Data Catalog.

AWS Glue

The _____tables include an index to the location, schema, and runtime metrics of your data. You use the information in this to create and monitor your ETL jobs.

AWS Glue Data Catalog

A visual data preparation tool that enables users to clean and normalize data without writing any code.
You can interactively discover, visualize, clean, and transform raw data.
Makes smart suggestions to help you identify data quality issues that can be difficult to find and time-consuming to fix.
Save transformation steps in a recipe, which you can update or reuse later with other datasets and deploy on a continuing basis.
Provides more than 250 built-in transformations, with a visual point-and-click interface for creating and managing data transformation jobs. These include removing nulls, replacing missing values, fixing schema inconsistencies, creating column-based functions and more.
Use to evaluate the quality of your data by defining rule sets and running profiling jobs.

AWS Glue DataBrew

Helps you build high-quality training datasets for your machine learning models.
Uses machine learning model to label your training data. It will automatically label data that it can label, and the rest is given to a human workforce.

SageMaker Ground Truth

You can use _____ to prepare, featurize, and analyze your data, and you can simplify the feature engineering process by using a single visual interface. Contains over 300 built-in transformations so that you can quickly normalize, transform, and combine features without having to write any code.

Amazon SageMaker Canvas

Using the ____ data selection tool, you can choose the raw data that you want from various data sources and import it with a single click.

SageMaker Data Wrangler

Is a centralized store for features and associated metadata, so features can be easily discovered and reused.
Makes it easy to create, share, and manage features for ML development.
Accelerates this process by reducing repetitive data processing and curation work required to convert raw data into features for training an ML algorithm.
4 Create workflow pipelines that convert raw data into features and add them to feature groups.

Amazon SageMaker Feature Store

During training, the machine learning algorithm updates a set of numbers, known as _____. The goal is to update the parameters in the model in such a way that the inference matches the expected output.

parameters or weights

T/F: When teaching the model, the ML algorithm watches the weights and outputs from previous iterations, and shifts the weights to a direction that lowers the error in generated output.

True

What are the two conditions that stops the iterative ML algorithm process?

When a defined number of iterations have been run.
When the change in error is below a target value.

When there are multiple algorithms for a model, the best practice is to:

run many training jobs in parallel, by using different algorithms and settings (running experiments).

Each algorithm has a set of external parameters that affect its performance, known as _____.

hyperparameters

Who sets the hyperparameters and when?

data scientists before training the model

The optimal values for the hyperparameters can only be determined by _____.

running multiple experiments with different settings

Specify the URL of the S3 bucket containing your training data.
Specify the compute resources you want to use for training, and the output bucket for the model artifacts.
Specify the algorithm by giving SageMaker the path to a Docker container image that contains the training algorithm.

How to create a training job on SageMaker

In the _____, you can specify the location of SageMaker provided algorithms and deep learning containers, or the location of your custom container, containing a custom algorithm, and set the hyperparameters required by the algorithm.

Amazon Elastic Container Registry, Amazon ECR

An _____ is a group of training runs, each with different inputs, parameters, and configurations. It features a visual interface to browse your active and past experiments, compare runs on key performance metrics, and identify the best-performing models.

experiment

_____ also known as hyperparameter tuning, finds the best version of a model, by running many training jobs on your dataset. To do this, it uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that create a model that performs it best, as measured by a metric that you choose.

Amazon SageMaker automatic model tuning, AMT

Figure a tuning job that runs several training jobs inside a loop.
Specify completion criteria as the number of jobs that are no longer improving the metric, and the job will run until the completion criteria are satisfied.

How to use automatic model tuning:

Determine whether you need batch or real-time inferencing or both
Configure and manage the inference endpoint

How to deploy a model so it can be used for inferences

API Gateway can serve as the interface with the clients and forward requests to an _____, which is running the model.

AWS Lambda function

How do you use SageMaker inferencing?

Point SageMaker to your model artifacts in an S3 bucket and a Docker container image in Amazon ECR.

For real-time, asynchronous, and batch inference, SageMaker runs the model on _____, which can be inside an auto scaling group.

EC2 ML instances

For the serverless inference option, SageMaker runs your code on _____.

Lambda functions

_____ is ideal when you want to queue incoming requests and have large payloads with long processing times.

Amazon SageMaker Asynchronous Inference

_____ can be used to serve model inference requests in real time without directly provisioning compute instances, or configuring scaling policies to handle traffic variations.

Serverless inference

_____ is ideal for inference workloads where you need real-time interactive responses from your model. Use this for a persistent and fully managed endpoint REST API that can handle sustained traffic backed by the instance type of your choice.

Real-time inference

What are some reasons model quality can degrade over time?

data quality, model quality, and model bias

The model monitoring system must:

capture data
compare the data to the training set
define rules to detect issues
send alerts

T/F: For most ML models, a simple scheduled approach for re-training daily, weekly, or monthly is usually enough.

True

What should the monitoring system do?

detect data and concept drifts
initiate an alert
send it to an alarm manager system, which could automatically start a re-training cycle.

_____ is when there are significant changes to the data distribution compared to the data used for training.

Data drift

_____ is when the properties of the target variables change.

Concept drift

_____ which is a capability of Amazon SageMaker, monitors models in production and detects errors so you can take remedial actions.

Amazon SageMaker Model Monitor

_____ is about using these established best practices of software engineering and applying them to machine learning model development.
It’s about automating manual tasks, testing, and evaluating code before release, and responding automatically to incidents.
It can streamline model delivery across the machine learning development lifecycle.

MLOps

T/F: With MLOps, everything gets versioned, including the training data.

True

Monitoring deployments to detect potential issues
Automating re-training because of issues or data and code changes.

key MLOps principles

What’s are major benefit of MLOps?

Productivity
Repeatability
Reliability
Auditability
Data/model quality

For _____, MLOps can improve auditability by versioning all inputs and outputs from data science experiments to source data to trained models.

compliance

_____ offers the ability to orchestrate SageMaker jobs and author reproducible ML pipelines.
These can deploy custom built models for inference in real time with low latency, run offline inferences with batch transform and track lineage of artifacts.
They can institute sound operational practices in deploying and monitoring production workflows, deploying model artifacts, and tracking artifact lineage through a simple interface.

Amazon SageMaker Pipelines

You can create a pipeline using the _____ or _____. The pipeline can contain all the steps to build and deploy a model, and can also include conditional branches based on the output of a previous step.

SageMaker SDK for Python or define the pipeline using JSON

Pipelines can be viewed in _____.

SageMaker Studio

_____ is a source code repository that you can use for storing your inference code. It is comparable to GitHub, a third-party source code repository.

AWS CodeCommit

What is a repository for the feature definitions of your training data?

SageMaker Feature Store

_____ is a centralized repository for your trained models and history.

SageMaker Model Registry

_____ lets you define a workflow with a visual drag-and-drop interface. It gives you the ability to build serverless workflows that integrate various AWS services and custom application logic.

AWS Step Functions

_____ is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as workflows.

Apache Airflow

With _____, you can use Apache Airflow and Python to create workflows without having to manage the underlying infrastructure for scalability, availability, and security.

Amazon Managed Workflows for Apache Airflow

A _____ is used to summarize the performance of a classification model when it’s evaluated against task data, and it is a table with actual data typically across the top and the predicted values on the left.

confusion matrix

One metric that is sometimes used to judge a model’s performance is _____, which is simply the percentage of correct predictions. This measures how close the predicted class values are to the actual values.

accuracy

Values for accuracy metrics vary between _____. A value of _____ indicates perfect accuracy and _____ indicates complete inaccuracy.

zero and one / one / zero

The formula for accuracy is:

the number of true positives plus true negatives divided by the total number of predictions.

_____ measures how well an algorithm predicts true positives out of all the positives that it identifies.

Precision

The formula for precision is:

the number of true positives divided by the number of true positives, plus the number of false positives.

If we want to minimize the false negatives, then we can use a metric known as ___.

recall

The formula for recall is:

the number of true positives divided by the number of true positives plus the number of false negatives.

Recall is also known as _____ or the true positive rate.

sensitivity

False positives divided by the sum of the false positives and true negatives.

false positive rate (how many measured as fish out of images that weren’t fish)

The ratio of the true negatives to the sum of the false positives and true negatives.

true negative rate (how many measured as not fish of those that weren’t fish)

The _____ is used to compare and evaluate binary classification by algorithms that return probabilities, such as logistic regression.

area under the curve, also known as AUC metric

A _____ is a value that the model uses to make a decision between the two possible classes. It can converts the probability of a sample being part of a class into a binary decision.

threshold

The _____ is called the receiver operating characteristic curve.

relevant curve

AUC provides an aggregated measure of the model performance across the full range of thresholds, and the AUC scores vary between _____.

zero and one

With AUC, a score of one indicates _____ and a score of one half, or 0.5, indicates that _____.

perfect accuracy / the prediction is no better than a random classifier

The distance between the line and the actual values in linear regression is the _____.

error

A metric that we can use to evaluate a linear regression model is called the _____. To compute it, we take the difference between the prediction and actual value, square the difference, and then compute the average of all square differences. These values are always positive.

mean squared error, MSE

The square root of the mean squared error. The advantage of using this is that the units match the dependent variable.

root mean squared error

Averages the absolute values of the errors, so it doesn’t emphasize the large errors.

mean absolute error

____ help us quantify the value of a machine learning model to the business.

Business metrics

If you need a balance b/w precision and recall, b/c normally you have one or the other, what formula should you use?

F1 = Precision * Recall / Precision + Recall

What are the two major components of a deployable ML model?

Model artifacts, which are the output of model training. Inference code, which is the software that implements the model.

Having a deep understanding of a model’s inner mechanics and how and why it makes a prediction.

Interpretability

Which AWS AI service could you use to filter uploaded images that contain inappropriate content?

Amazon Rekognition

Business goal setting
Data preparation
Train and tune
Deploy and monitor

ML pipeline steps

What is the primary purpose of the AWS Glue Data Catalog?

It stores metadata for the data sources and targets for ETL jobs.

_____ runs your model on a Lambda function that incurs charges only for the length of time that it runs, and it’s most cost-effective for real-time inference when there are also periods of no or intermittent traffic.

Serverless inference

_____ is a natural language processing, or NLP, service that extracts insights and relationships from text data by using ML.

Amazon Comprehend

What is the ML lifecycle?

Business goal identification
ML problem framing
Data processing (data collection, data preprocessing, feature engineering)
Model development (training, tuning, evaluation)
Model deployment (inference, prediction)
Model monitoring

Be proud dear, don’t do mediocre.

During ____, you perform explainability techniques and evaluate the accuracy and performance of the model. The goal of this stage is to determine if the model requires additional data fine-tuning, ML algorithm fine-tuning, or if the model is ready for deployment.

model evaluation

_____ is a stage in the ML development lifecycle that occurs after model deployment. During this stage, you monitor the model to identify issues that relate to data or model quality, and issues that relate to bias or feature attribution drift. The goal of this stage is to identify if the model maintains the necessary performance levels and identify when there is drift or model degradation.

Model monitoring

_____ is a stage in the ML development lifecycle that occurs after a model is trained, tuned, and evaluated. During this stage, you deploy the model into production to begin making predictions.

Model deployment

_____, which is a step in the ML development lifecycle that occurs during the data preparation stage. During this stage, you select and transform variables to create features or attributes.

Feature engineering

_____ creates features or variables that can help the model generate more accurate results and improve overall performance during model training.

Feature engineering