Domain 1: AI/ML Fundamentals 20% Flashcards

1
Q

The field of computer science dedicated to solving cognitive problems commonly associated with human intelligence, such as learning, creation, and image recognition.

A

Artificial intelligence

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

_____ is to create self-learning system that derives meaning from data.

A

The goal of AI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Uses of AI

A
  1. Question response
  2. Create original content (text/images)
  3. Quickly process vast amounts of data
  4. Solve complex problems (fraud detection)
  5. Perform repetitive/monotonous tasks
  6. Finding patterns in data
  7. Forecasting trends
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

_____ is a branch of AI and computer science that focuses on use of data and algorithms to imitate the way humans learn. It gradually improves its accuracy to build computer systems that learn from data.

A

Machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

How are ML models trained?

A

By using large datasets to identify patterns and make predictions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

_____ is a type of machine learning model that is inspired by human brains using layers of neural networks to process information.

A

Deep learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

_____ are some of the things that deep learning models can do.

A

Recognizing human speech and objects and images

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

AI uses

A
  1. Predict pandemics
  2. Monitor assembly lines
  3. Monitor sensor data to determine when equipment might fail
  4. Product recommendation and support info (search to solution)
  5. Personalized content recommendations
  6. Forecast demand
  7. Detect fraud
  8. HR
  9. Translate language text
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Using a technique called _____, an AI model can process historical data, also known as time series data and predict future values.

A

regression analysis

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Predictions that AI makes are called _____, which is an educated guess, so the model gives a probabilistic result.

A

inferences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

A deviation from the expected pattern.

A

anomaly

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

_____ use AI to process images and video for object identification and facial recognition, as well as classification, recommendation, monitoring, and detection.

A

Computer vision applications

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

_____ is what allows machines to understand, interpret, and generate human language in a natural-sounding way.

A

Natural language processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

_____ can have seemingly intelligent conversations and generate original content like stories, images, videos, and even music.

A

Generative AI

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

_____ is the science of developing algorithms and statistical models that computer systems use to perform complex tasks without explicit instructions.

A

Machine learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Computer systems use ML algorithms to _____ and _____.

A

process large quantities of historical data, and identify data patterns

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Machine learning starts with a _____ that takes data as inputs, and generates an output.

A

mathematical algorithm

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

To train the ML algorithm to produce the output we expect, we give it known data, which consists of _____.

A

features

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is the task of the ML algorithm?

A

to find the correlation between the input data features and the known expected output

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Adjustments are made to the ML model by changing _____ until the model reliably produces the expected output.

A

internal parameter values

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

When a trained model is able to make accurate predictions and produce output from new data that it hasn’t seen during training.

A

inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

This type of data is stored as rows in a table with columns, which can serve as the features for an ML model.

A

structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

_____ can be text files like CSV, or stored in relational databases like Amazon Relational Database Service, Amazon RDS, or Amazon Redshift.

A

structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

_____ can be queried using structured query language, or SQL.

A

structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

_____ is the primary source for training data because it can store any type of data, is lower cost, and has virtually unlimited storage capacity.

A

Amazon S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Unlike data in a table, _____ elements can have different attributes or missing attributes. An example is a text file that contains JSON, which stands for JavaScript Object Notation.

A

semi-structured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

_____ and _____ with MongoDB compatibility, are two examples of transactional databases built specifically for semi-structured data.

A

Amazon DynamoDB and Amazon DocumentDB

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

_____ is data that doesn’t conform to any specific data model and can’t be stored in table format. Some examples include images, video, and text files, or social media posts. It is typically stored as objects in an object storage system like Amazon S3.

A

Unstructured data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

Breaks down text into individual units of words or phrases

A

tokenization

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

_____ is important for training models that need to predict future trends. Each data record is labeled with a timestamp, and stored sequentially.

A

Time series data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Depending on the sampling rate, time series data captured for long periods can get quite large and be stored in _____ for model training.

A

Amazon S3

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

To create a machine learning model, we need to start with an algorithm which defines the _____.

A

mathematical relationship between outputs and inputs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

The simple linear equation _____, defines the linear relationship between our independent variable, x, and the dependent variable, y.

A

y=mx+b

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

The slope, m, and intercept, b, are the model parameters that are adjusted iteratively during the training process to _____.

A

find the best-fitting model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

To determine the best fitting model, we look for the parameter values that _____.

A

minimize the errors

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

This training process produces model artifacts, which typically consists of trained parameters, a model definition that describes how to compute inferences, and other metadata.

A

model training

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

The _____, which are normally stored in Amazon S3, are packaged together with inference code to make a deployable model.

A

model artifacts

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

_____ is the software that implements the model, by reading the artifacts.

A

Inference code

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

The first is where an endpoint is always available to accept inference requests in real time. And the second is where a batch job is performing inference.

A

Two options for hosting a model

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

_____ is ideal for online inferences that have low latency and high throughput requirements. For this, your model is deployed on a persistent endpoint to handle a sustained flow of requests.

A

Real-time inference

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

_____ is suitable for offline processing when large amounts of data are available upfront, and you don’t need a persistent endpoint.

A

Batch

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q

When you need a large number of inferences, and it’s okay to wait for the results, _____can be more cost-effective.

A

batch processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
43
Q

T/F: The main difference between real-time and batch is that with batch, the computing resources only run when processing the batch, and then they shut down.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
44
Q

T/F: With real-time inferencing, some compute resources are always running and available to process requests.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
45
Q

With _____, you train your model with data that is pre-labeled.

A

supervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
46
Q

T/F: Training data specifies both, the input and the desired output of the algorithm.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
47
Q

What is the challenge with supervised learning?

A

labeling

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
48
Q

What solution helps with the challenge of labeling?

A

Amazon SageMaker Ground Truth

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
49
Q

SageMaker Ground Truth can leverage crowdsourcing service called _____that provides access to a large pool of affordable labor spread across the globe.

A

Amazon Mechanical Turk

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
50
Q

_____ algorithms train on data that has features but is not labeled. They can spot patterns, group the data into clusters, and split the data into a certain number of groups.

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
51
Q

_____ is useful for use cases such as pattern recognition, anomaly detection, and automatically grouping data into categories.

A

Unsupervised learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
52
Q

T/F: Unsupervised learning algorithms can also be used to clean and process data for further modeling automatically.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
53
Q

T/F: Unsupervised learning is often used for anomaly detection?

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
54
Q

_____ is a machine learning method that is focused on autonomous decision making by an agent. The agent takes actions within an environment to achieve specific goals. The model learns through trial and error, and training does not require labeled input. Actions that an agent takes that move it closer to achieving the goal are rewarded.

A

Reinforcement learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
55
Q

T/F: To encourage learning during training, the learning agent must be allowed to sometimes pursue actions that might not result in rewards with reinforcement learning.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
56
Q

To teach developers about developing a reinforcement learning model, Amazon offers a model race car called _____ that you can teach to drive on a racetrack. With this, the car is the agent, and the track is the environment.

A

AWS DeepRacer

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
57
Q

T/F: Both unsupervised and reinforcement learning work without labeled data.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
58
Q

T/F: Unsupervised learning algorithms receive inputs with no specified outputs during the training process.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
59
Q

T/F: Reinforcement learning has a predetermined end goal. While it takes an exploratory approach, the explorations are continuously validated and improved to increase the probability of reaching the end goal.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
60
Q

When a model performs better on training data than it does on new data, it is called _____, and it is said that the model does not recognize well.

A

overfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
61
Q

The best way to correct a model that is overfitting _____

A

is to train it with data that is more diverse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
62
Q

If you train your model for too long, it will start to overemphasize unimportant features called _____, which is another way of overfitting.

A

noise

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
63
Q

_____ is a type of error that occurs when the model cannot determine a meaningful relationship between the input and output data.

A

Underfitting

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
64
Q

_____ models give inaccurate results for both the training dataset and new data.

A

Underfit

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
65
Q

_____ is when there are disparities in the performance of a model across different groups. The results are skewed in favor of or against an outcome for a particular class.

A

Bias

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
66
Q

The quality of a model depends on _____ and _____.

A

the underlying data quality and quantity

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
67
Q

T/F: If a model is showing bias, the weight of features that are introducing noise can be directly adjusted by the data scientists.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
68
Q

_____, such as age and sex discrimination, should be identified at the beginning before creating a model.

A

Fairness constraints

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
69
Q

Training data should be inspected and evaluated for potential bias, and models need to be continually evaluated by checking their results for _____.

A

fairness

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
70
Q

Deep learning is a type of machine learning that uses algorithmic structures called _____.

A

neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
71
Q

In deep learning models, we use software modules called _____to simulate the behavior of neurons.

A

nodes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
72
Q

_____ comprise layers of nodes, including an input layer, several hidden layers, and an output layer of nodes.

A

Deep neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
73
Q

Every node in the neural network autonomously assigns _____to each feature.

A

weights

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
74
Q

With neural networks, information flows through the network in a _____direction from input to output.

A

forward

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
75
Q
  1. Every node autonomously assigns weights to each feature.
  2. Info flows forward thru network from input to output.
  3. During training, diff b/w predicted output and actual output is calculated.
  4. Weights of neurons repeatedly adjusted to minimize error.
A

How neural networks work

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
76
Q

_____ can excel at tasks like image classification and natural language processing where there is a need to identify the complex relationship between data objects.

A

Deep learning

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
77
Q

What made deep learning a viable option?

A

low-cost cloud computing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
78
Q

Because anyone can now readily use powerful computing resources in the cloud, _____ have become the standard algorithmic approach to computer vision.

A

neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
79
Q

A big advantage of deep learning models for computer vision is that _____.

A

they don’t need the relevant features given to them. They can identify patterns in images and extract the important features on their own.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
80
Q

The decision to use traditional machine learning or deep learning depends on _____.

A

the type of data you need to process

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
81
Q

Traditional machine learning algorithms will generally perform well and be efficient when it comes to _____.

A

identifying patterns from structured data and labeled data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
82
Q

Deep learning solutions are more suitable for _____data like images, videos, and text.

A

unstructured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
83
Q

Tasks for deep learning include_____.

A

image classification and natural language processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
84
Q

Both types of machine learning use statistical algorithms, but only deep learning uses_____ to simulate human intelligence.

A

neural networks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
85
Q

Do deep learning models require a lot of work on selecting/extracting features?

A

No, b/c they’re self-learning.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
86
Q

_____ is accomplished by using deep learning models that are pre-trained on extremely large datasets containing strings of text or, in AI terms, _____.

A

Generative AI /sequences

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
87
Q

Gen AI deep learning models use transformer neural networks, which change an input sequence, in Gen AI known as _____, into an output sequence, which is the response to your _____.

A

prompt

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
88
Q

Neural networks process the elements of a sequence sequentially _____.

A

one word at a time

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
89
Q

Transformers process the sequence in _____, which speeds up the training and allows much bigger datasets to be used.

A

parallel

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
90
Q

They outperform other ML approaches to natural language processing. They excel at understanding human language so they can read long articles and summarize them. They are also great at generating text that’s similar to the way a human would. As a result, they are good at language translation and even writing original stories, letters, articles, and poetry. They even know computer programming languages and can write code for software developers.

A

Large language models

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
91
Q

T/F: Complex models generally present a tradeoff of compatibility compared with interpretability.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
92
Q

T/F: Less complex models mean lower performance.

A

True

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
93
Q

If a software application always produces the same output for the same input, it is said to be _____.

A

deterministic

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
94
Q

A rule-based application is deterministic unless _____.

A

someone changes the rules

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
95
Q

T/F: Identical sets of input values will result in a variety of results that aren’t consistent.

A

True

96
Q

If determinism is necessary, then a _____ is a better option.

A

rule-based system

97
Q

If your dataset consists of features or attributes as inputs with labeled target values as outputs, then you have a _____learning problem.

A

supervised

98
Q

For a supervised learning problem, you train your model with _____.

A

data containing known inputs and outputs

99
Q

If your target values are categorical, for example, one or more discrete values, then you have a _____ problem.

A

classification

100
Q

If the target values you’re trying to predict are mathematically continuous, then you have a _____problem.

A

regression

101
Q

If your dataset consists of features or attributes as inputs that do not contain labels or target values, then you have an _____ problem.

A

unsupervised learning

102
Q

How should patterns be predicted in unsupervised learning problems?

A

Based on the pattern discovered in the input data.

103
Q

The goal in unsupervised learning problems is to _____, such as groupings, within the data.

A

discover patterns

104
Q

When your data needs to be separated into discrete groups, you have a _____problem.

A

clustering

105
Q

If you are seeking to spot outliers in your data, then you have an _____ problem.

A

anomaly detection

106
Q

Classification problems are normally distinguished as _____ or _____.

A

binary or multiclass

107
Q

_____ assigns an input to one of two predefined and mutually exclusive classes based on its attributes.

A

Binary classification

108
Q

_____ estimates the value of a dependent target variable based on one or more other variables, or attributes that are correlated with it.

A

Regression

109
Q

_____ is when there is a direct linear relationship between the inputs and output.

A

Linear regression

110
Q

_____ uses a single independent variable, such as weight, to predict someone’s height.

A

Simple linear regression

111
Q

If we have multiple independent variables, such as weight and age, then we have a _____ problem.

A

multiple linear regression

112
Q

_____ can create a model that takes one or more features as an input to predict the price of a house.

A

Regression analysis

113
Q

_____ is used to measure the probability of an event occurring.

A

Logistic regression

114
Q

A logistic regression prediction is a value between zero and one, where zero indicates _____, and one indicates _____.

A

an event that is unlikely to happen / a maximum likelihood that it will happen

115
Q

Logistic equations use _____functions to compute the regression line and one or more independent variables.

A

logarithmic

116
Q

Both logistic regression and linear regression require _____ for the models to become accurate in predictions.

A

a significant amount of labeled data

117
Q

_____ is a class of techniques that are used to classify data objects into groups, called clusters. It attempts to find discrete groupings within data.

A

Cluster analysis

118
Q
  1. Members are similar as possible to each other and different as possible from members of other groups.
  2. Define features/attributes you want the algorithm to use to determine similarity.
  3. Select a distance function to measure similarity and specify number of clusters/groups you want to analyze.
A

Cluster analysis

119
Q

_____ is the identification of rare items, events, or observations in the data, which raise suspicions, because they differ significantly from the rest of the data.

A

Anomaly detection

120
Q

_____ is a pre-trained deep learning service for computer vision. It meets the needs of several common computer vision use cases without requiring customers to train their own models. Images, videos, streaming videos, facial recognition.

A

Amazon Rekognition

121
Q

Uses for Amazon Rekognition

A
  1. detect/label objects
  2. security systems to id objects in real-time streaming video
  3. add labels for any text it sees, ex. street sign
  4. flag questionable content for human review
122
Q

_____extracts text, handwriting, forms, and tabular data from scanned documents.

A

Amazon Textract

123
Q

_____ is a natural language processing service that helps discover insights and relationships in text. For customer feedback.

A

Amazon Comprehend

124
Q

Common use case for Comprehend and Textract

A

detecting personal identifiable information, PII, in text

125
Q

_____ helps build voice and text interfaces to engage with customers. Used for customer service chatbots and interactive voice response systems.

A

Amazon Lex

126
Q

_____ is an automatic speech recognition service that supports over 100 languages. This is designed to process live and recorded audio or video input to provide high quality transcriptions for search and analysis. A common use case is to caption streaming audio in real time.

A

Amazon Transcribe

127
Q

_____ turns text into natural-sounding speech in dozens of languages. It uses deep learning technologies to synthesize human speech.

A

Amazon Polly

128
Q

Common use cases include converting articles to speech and prompting callers in interactive voice response systems.

A

Amazon Polly

129
Q

_____ uses machine learning to perform an intelligent search of enterprise systems to quickly find content. It uses natural language processing to understand questions.

A

Amazon Kendra

130
Q

_____ allows businesses to automatically generate personalized recommendations for their customers in industries such as retail, media and, entertainment.

A

Amazon Personalize

131
Q

_____ fluently translates text between 75 different languages. It is built on a neural network that considers the entire context of the source sentence and the translation it has generated so far. It uses this information to create more accurate and fluent translations.

A

Amazon Translate

132
Q

_____ is an AI service for time series forecasting. By providing it with historical time series data, you can predict future points in the series. Time series forecasting is useful in multiple domains.

A

Amazon Forecast

133
Q

_____ helps to identify potentially fraudulent online activities such as online payment fraud and creation of fake accounts. It features pre-trained data models to detect fraud in online transactions, product reviews, checkout and payments, new accounts, and account takeovers.

A

Amazon Fraud Detector

134
Q

_____ is a fully managed service to build generative AI applications on AWS, and it lets you choose from high performing foundation models trained by Amazon, Meta, and leading AI startups. You can customize a foundation model by providing your own training data or creating a knowledge base for the model to query.

A

Amazon Bedrock

135
Q

When a generative AI model calls an external knowledge system to retrieve information outside its training data, this is called _____.

A

Retrieval Augmented Generation

136
Q

Use the _____ foundation model from Amazon to generate an image in response to a prompt.

A

Titan Image Generator

137
Q

Use the _____ family of services when you need more customized machine learning models or workflows that go beyond the prebuilt functionalities offered by the core AI services.

A

Amazon SageMaker

138
Q
  1. Provides machine learning capabilities for data scientists and developers to prepare, build, train, and deploy high-quality ML models efficiently.
  2. It comprises several services that are optimized for building and training custom machine learning models, which include data preparation and labeling, large-scale parallel training on multiple instances or GPU clusters, model deployment, and real-time inference endpoints.
  3. To accelerate the development process, it offers pre-trained models that you can use as a starting point and reduce the resources needed for data preparation and model training.
A

Amazon SageMaker

139
Q

What makes generative AI models more accurate and current with their responses?

A

Retrieval augmented generation

140
Q

A _____ is a series of interconnected steps that start with a business goal and finish with operating a deployed ML model. It starts with defining the problem, collecting and preparing training data, training the model, deploying, and finally, monitoring it.

A

machine learning pipeline

141
Q
  1. Clear idea of problem
  2. Be able to measure business value against objectives and success criteria
  3. Align stakeholders to gain concensus on goal
  4. Evaluate org’s ability to move forward w/ target
  5. Evaluate all options to achieving goal
  6. Considering cost, determine how accurate outcomes will be
  7. Ensure enough good training data is available
  8. Perform cost benefit analysis
A

Determine if ML is best solution

142
Q

With _____, you can create a custom classifier that uses your own categories by supplying it with your training data.

A

Amazon Comprehend

143
Q

_____lets you start with a fully trained foundation model. You can fine-tune this model with your own data using transfer learning.

A

Amazon Bedrock

144
Q

_____ provides pre-trained AI foundation models and task-specific models for computer vision and natural language processing problem types. These are pre-trained on large public datasets.

A

SageMaker JumpStart

145
Q

Fine-tuning the model with incremental training using your own dataset.

A

transfer learning

146
Q
  1. Identify the data needed and determine the options for collecting the data.
  2. Know what training data you will need to develop your model and where it is generated and stored.
  3. Know if it’s streaming data or whether you can load it in batch process.
  4. Configure a process known as extract, transform, and load, ETL, to collect the data from possibly multiple sources and store it in a centralized repository.
  5. Know if the data is labeled or how you will be able to label it.
  6. Determine which characteristics of the dataset should be used as features to train the model.
A

Collecting/processing training data

147
Q

___% of the data should be used for training the model, ___% should be set aside for model evaluation, and ___% for performing the final test before deploying the model to production.

A

80/10/10

148
Q

T/F: You should reduce the features in your training data to only those that are needed for inference.

A

True

149
Q

T/F: Features can be combined to further reduce the number of features.

A

True

150
Q

_____ is a fully managed ETL service. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point this to your data stored on AWS.

A

AWS Glue

151
Q

AWS Glue discovers your data and stores the associated metadata, the table definition, and schema in the _____.

A

AWS Glue Data Catalog

152
Q
  1. Generates the code to execute your data transformations and data loading processes.
  2. Has built-in transformations for things like dropping duplicate records, filling in missing values, and splitting your dataset.
  3. Can extract, transform, and load data from a large variety of data stores, which include relational databases, data warehouses, and other cloud, or even streaming services.
  4. Can crawl your data sources and automatically determine the data schema by using classifiers.
  5. Writes the schema to tables in the Data Catalog.
A

AWS Glue

153
Q

The _____tables include an index to the location, schema, and runtime metrics of your data. You use the information in this to create and monitor your ETL jobs.

A

AWS Glue Data Catalog

154
Q
  1. A visual data preparation tool that enables users to clean and normalize data without writing any code.
  2. You can interactively discover, visualize, clean, and transform raw data.
  3. Makes smart suggestions to help you identify data quality issues that can be difficult to find and time-consuming to fix.
  4. Save transformation steps in a recipe, which you can update or reuse later with other datasets and deploy on a continuing basis.
  5. Provides more than 250 built-in transformations, with a visual point-and-click interface for creating and managing data transformation jobs. These include removing nulls, replacing missing values, fixing schema inconsistencies, creating column-based functions and more.
  6. Use to evaluate the quality of your data by defining rule sets and running profiling jobs.
A

AWS Glue DataBrew

155
Q
  1. Helps you build high-quality training datasets for your machine learning models.
  2. Uses machine learning model to label your training data. It will automatically label data that it can label, and the rest is given to a human workforce.
A

SageMaker Ground Truth

156
Q

You can use _____ to prepare, featurize, and analyze your data, and you can simplify the feature engineering process by using a single visual interface. Contains over 300 built-in transformations so that you can quickly normalize, transform, and combine features without having to write any code.

A

Amazon SageMaker Canvas

157
Q

Using the ____ data selection tool, you can choose the raw data that you want from various data sources and import it with a single click.

A

SageMaker Data Wrangler

158
Q
  1. Is a centralized store for features and associated metadata, so features can be easily discovered and reused.
  2. Makes it easy to create, share, and manage features for ML development.
  3. Accelerates this process by reducing repetitive data processing and curation work required to convert raw data into features for training an ML algorithm.
    4 Create workflow pipelines that convert raw data into features and add them to feature groups.
A

Amazon SageMaker Feature Store

159
Q

During training, the machine learning algorithm updates a set of numbers, known as _____. The goal is to update the parameters in the model in such a way that the inference matches the expected output.

A

parameters or weights

160
Q

T/F: When teaching the model, the ML algorithm watches the weights and outputs from previous iterations, and shifts the weights to a direction that lowers the error in generated output.

A

True

161
Q

What are the two conditions that stops the iterative ML algorithm process?

A
  1. When a defined number of iterations have been run.
  2. When the change in error is below a target value.
162
Q

When there are multiple algorithms for a model, the best practice is to:

A

run many training jobs in parallel, by using different algorithms and settings (running experiments).

163
Q

Each algorithm has a set of external parameters that affect its performance, known as _____.

A

hyperparameters

164
Q

Who sets the hyperparameters and when?

A

data scientists before training the model

165
Q

The optimal values for the hyperparameters can only be determined by _____.

A

running multiple experiments with different settings

166
Q
  1. Specify the URL of the S3 bucket containing your training data.
  2. Specify the compute resources you want to use for training, and the output bucket for the model artifacts.
  3. Specify the algorithm by giving SageMaker the path to a Docker container image that contains the training algorithm.
A

How to create a training job on SageMaker

167
Q

In the _____, you can specify the location of SageMaker provided algorithms and deep learning containers, or the location of your custom container, containing a custom algorithm, and set the hyperparameters required by the algorithm.

A

Amazon Elastic Container Registry, Amazon ECR

168
Q

An _____ is a group of training runs, each with different inputs, parameters, and configurations. It features a visual interface to browse your active and past experiments, compare runs on key performance metrics, and identify the best-performing models.

A

experiment

169
Q

_____ also known as hyperparameter tuning, finds the best version of a model, by running many training jobs on your dataset. To do this, it uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that create a model that performs it best, as measured by a metric that you choose.

A

Amazon SageMaker automatic model tuning, AMT

170
Q
  1. Figure a tuning job that runs several training jobs inside a loop.
  2. Specify completion criteria as the number of jobs that are no longer improving the metric, and the job will run until the completion criteria are satisfied.
A

How to use automatic model tuning:

171
Q
  1. Determine whether you need batch or real-time inferencing or both
  2. Configure and manage the inference endpoint
A

How to deploy a model so it can be used for inferences

172
Q

API Gateway can serve as the interface with the clients and forward requests to an _____, which is running the model.

A

AWS Lambda function

173
Q

How do you use SageMaker inferencing?

A

Point SageMaker to your model artifacts in an S3 bucket and a Docker container image in Amazon ECR.

174
Q

For real-time, asynchronous, and batch inference, SageMaker runs the model on _____, which can be inside an auto scaling group.

A

EC2 ML instances

175
Q

For the serverless inference option, SageMaker runs your code on _____.

A

Lambda functions

176
Q

_____ is ideal when you want to queue incoming requests and have large payloads with long processing times.

A

Amazon SageMaker Asynchronous Inference

177
Q

_____ can be used to serve model inference requests in real time without directly provisioning compute instances, or configuring scaling policies to handle traffic variations.

A

Serverless inference

178
Q

_____ is ideal for inference workloads where you need real-time interactive responses from your model. Use this for a persistent and fully managed endpoint REST API that can handle sustained traffic backed by the instance type of your choice.

A

Real-time inference

179
Q

What are some reasons model quality can degrade over time?

A

data quality, model quality, and model bias

180
Q

The model monitoring system must:

A
  1. capture data
  2. compare the data to the training set
  3. define rules to detect issues
  4. send alerts
181
Q

T/F: For most ML models, a simple scheduled approach for re-training daily, weekly, or monthly is usually enough.

A

True

182
Q

What should the monitoring system do?

A
  1. detect data and concept drifts
  2. initiate an alert
  3. send it to an alarm manager system, which could automatically start a re-training cycle.
183
Q

_____ is when there are significant changes to the data distribution compared to the data used for training.

A

Data drift

184
Q

_____ is when the properties of the target variables change.

A

Concept drift

185
Q

_____ which is a capability of Amazon SageMaker, monitors models in production and detects errors so you can take remedial actions.

A

Amazon SageMaker Model Monitor

186
Q
  1. _____ is about using these established best practices of software engineering and applying them to machine learning model development.
  2. It’s about automating manual tasks, testing, and evaluating code before release, and responding automatically to incidents.
  3. It can streamline model delivery across the machine learning development lifecycle.
A

MLOps

187
Q

T/F: With MLOps, everything gets versioned, including the training data.

A

True

188
Q
  1. Monitoring deployments to detect potential issues
  2. Automating re-training because of issues or data and code changes.
A

key MLOps principles

189
Q

What’s are major benefit of MLOps?

A

Productivity
Repeatability
Reliability
Auditability
Data/model quality

190
Q

For _____, MLOps can improve auditability by versioning all inputs and outputs from data science experiments to source data to trained models.

A

compliance

191
Q
  1. _____ offers the ability to orchestrate SageMaker jobs and author reproducible ML pipelines.
  2. These can deploy custom built models for inference in real time with low latency, run offline inferences with batch transform and track lineage of artifacts.
  3. They can institute sound operational practices in deploying and monitoring production workflows, deploying model artifacts, and tracking artifact lineage through a simple interface.
A

Amazon SageMaker Pipelines

192
Q

You can create a pipeline using the _____ or _____. The pipeline can contain all the steps to build and deploy a model, and can also include conditional branches based on the output of a previous step.

A

SageMaker SDK for Python or define the pipeline using JSON

193
Q

Pipelines can be viewed in _____.

A

SageMaker Studio

194
Q

_____ is a source code repository that you can use for storing your inference code. It is comparable to GitHub, a third-party source code repository.

A

AWS CodeCommit

195
Q

What is a repository for the feature definitions of your training data?

A

SageMaker Feature Store

196
Q

_____ is a centralized repository for your trained models and history.

A

SageMaker Model Registry

197
Q

_____ lets you define a workflow with a visual drag-and-drop interface. It gives you the ability to build serverless workflows that integrate various AWS services and custom application logic.

A

AWS Step Functions

198
Q

_____ is an open source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as workflows.

A

Apache Airflow

199
Q

With _____, you can use Apache Airflow and Python to create workflows without having to manage the underlying infrastructure for scalability, availability, and security.

A

Amazon Managed Workflows for Apache Airflow

200
Q

A _____ is used to summarize the performance of a classification model when it’s evaluated against task data, and it is a table with actual data typically across the top and the predicted values on the left.

A

confusion matrix

201
Q

One metric that is sometimes used to judge a model’s performance is _____, which is simply the percentage of correct predictions. This measures how close the predicted class values are to the actual values.

A

accuracy

202
Q

Values for accuracy metrics vary between _____. A value of _____ indicates perfect accuracy and _____ indicates complete inaccuracy.

A

zero and one / one / zero

203
Q

The formula for accuracy is:

A

the number of true positives plus true negatives divided by the total number of predictions.

204
Q

_____ measures how well an algorithm predicts true positives out of all the positives that it identifies.

A

Precision

205
Q

The formula for precision is:

A

the number of true positives divided by the number of true positives, plus the number of false positives.

206
Q

If we want to minimize the false negatives, then we can use a metric known as ___.

A

recall

207
Q

The formula for recall is:

A

the number of true positives divided by the number of true positives plus the number of false negatives.

208
Q

Recall is also known as _____ or the true positive rate.

A

sensitivity

209
Q

False positives divided by the sum of the false positives and true negatives.

A

false positive rate (how many measured as fish out of images that weren’t fish)

210
Q

The ratio of the true negatives to the sum of the false positives and true negatives.

A

true negative rate (how many measured as not fish of those that weren’t fish)

211
Q

The _____ is used to compare and evaluate binary classification by algorithms that return probabilities, such as logistic regression.

A

area under the curve, also known as AUC metric

212
Q

A _____ is a value that the model uses to make a decision between the two possible classes. It can converts the probability of a sample being part of a class into a binary decision.

A

threshold

213
Q

The _____ is called the receiver operating characteristic curve.

A

relevant curve

214
Q

AUC provides an aggregated measure of the model performance across the full range of thresholds, and the AUC scores vary between _____.

A

zero and one

215
Q

With AUC, a score of one indicates _____ and a score of one half, or 0.5, indicates that _____.

A

perfect accuracy / the prediction is no better than a random classifier

216
Q

The distance between the line and the actual values in linear regression is the _____.

A

error

217
Q

A metric that we can use to evaluate a linear regression model is called the _____. To compute it, we take the difference between the prediction and actual value, square the difference, and then compute the average of all square differences. These values are always positive.

A

mean squared error, MSE

218
Q

The square root of the mean squared error. The advantage of using this is that the units match the dependent variable.

A

root mean squared error

219
Q

Averages the absolute values of the errors, so it doesn’t emphasize the large errors.

A

mean absolute error

220
Q

____ help us quantify the value of a machine learning model to the business.

A

Business metrics

221
Q

If you need a balance b/w precision and recall, b/c normally you have one or the other, what formula should you use?

A

F1 = Precision * Recall / Precision + Recall

222
Q

What are the two major components of a deployable ML model?

A

Model artifacts, which are the output of model training. Inference code, which is the software that implements the model.

223
Q

Having a deep understanding of a model’s inner mechanics and how and why it makes a prediction.

A

Interpretability

224
Q

Which AWS AI service could you use to filter uploaded images that contain inappropriate content?

A

Amazon Rekognition

225
Q

Business goal setting
Data preparation
Train and tune
Deploy and monitor

A

ML pipeline steps

226
Q

What is the primary purpose of the AWS Glue Data Catalog?

A

It stores metadata for the data sources and targets for ETL jobs.

227
Q

_____ runs your model on a Lambda function that incurs charges only for the length of time that it runs, and it’s most cost-effective for real-time inference when there are also periods of no or intermittent traffic.

A

Serverless inference

228
Q

_____ is a natural language processing, or NLP, service that extracts insights and relationships from text data by using ML.

A

Amazon Comprehend

229
Q

What is the ML lifecycle?

A
  1. Business goal identification
  2. ML problem framing
  3. Data processing (data collection, data preprocessing, feature engineering)
  4. Model development (training, tuning, evaluation)
  5. Model deployment (inference, prediction)
  6. Model monitoring

Be proud dear, don’t do mediocre.

230
Q

During ____, you perform explainability techniques and evaluate the accuracy and performance of the model. The goal of this stage is to determine if the model requires additional data fine-tuning, ML algorithm fine-tuning, or if the model is ready for deployment.

A

model evaluation

231
Q

_____ is a stage in the ML development lifecycle that occurs after model deployment. During this stage, you monitor the model to identify issues that relate to data or model quality, and issues that relate to bias or feature attribution drift. The goal of this stage is to identify if the model maintains the necessary performance levels and identify when there is drift or model degradation.

A

Model monitoring

232
Q

_____ is a stage in the ML development lifecycle that occurs after a model is trained, tuned, and evaluated. During this stage, you deploy the model into production to begin making predictions.

A

Model deployment

233
Q

_____, which is a step in the ML development lifecycle that occurs during the data preparation stage. During this stage, you select and transform variables to create features or attributes.

A

Feature engineering

234
Q

_____ creates features or variables that can help the model generate more accurate results and improve overall performance during model training.

A

Feature engineering

235
Q
A
236
Q
A