Domain 1 Flashcards
Explain the AI relationship ven diagram
Artificial Intelligence, Machine Learning, Deep Learning
Predictions that AI makes based on historical data
Inference
When AI recognizes a change in what has happened in the past
Anomaly detection
What are some AWS services that could provide structured input data for training ML models?
RDS, Redshift
What are some AWS services that could provide semi-structured input data for training ML models?
DynamoDB, MongoDB
For semi-structured, structured data, unstructured data, and time-series, where should you export data for training models?
S3
In machine learning, what describes the relationship between inputs and outputs?
An algorithm
Describe the machine learning training process
Known data -> features -> algorithm -> output
Describe the machine learning inference process, which comes after training
new data -> features -> model -> output
What are the two artifacts produced that create a model?
Inference code + model artifacts
What type of inferencing provides low-latency, high throughput, and a persistent endpoint (also usually more expensive)?
Real-time
What type of inferencing is performed offline, uses large datasets, and either happens on an infrequent schedule?
Batch transform
Training your model with data that is pre-labeled (pictures with fish/not fish)
Supervised Learning
What is the challenge with supervised learning?
You need a lot of data, people to label…takes time and money
What is Amazon Ground Truth?
A service that helps you provided labeling
What process uses data that has features but is not labeled and is good for pattern recognition, anomaly detection, and grouping data into categories?
Unsupervised learning
What process uses both supervised and unsupervised learning, provides rewards to an agent when criteria are ment, uses trial and error, and allows the agent to make mistakes to learn, and has and end goal?
Reinforcement learning
What sub service of Ground Truth uses crowdsourcing to label
via affordable labor
AWS Mechanical Turk
A model telling you a fish is not a fish because it is out of water, a result of training being to specific and not having enough varied examples, is called what?
Overfitting
What is called when a model cannot determine a meaningful relationship between the input and output data, happens when you haven’t trained the model long enough or with a large enough set?
Underfitting
What is bias?
When a model discriminates against a specific group because of a lack of fair representation in the data used to train the model
Also, if a model is showing bias, what can be done with features?
the weight of features that are introducing noise can be directly adjusted by the data scientists. For example, it could completely remove gender consideration
Items such as age and sex discrimination, should be identified at the beginning before creating a model.
Fairness constraints
A type of machine learning that uses algorithmic structures called neural networks.
Deep learning
The three layers of deep neural networks
input layer, several hidden layers, and an output layer of nodes
Deep learning can excel at tasks like
image classification and natural language processing where there is a need to identify the complex relationship between data objects
A big advantage of deep learning models for computer vision is that
they don’t need the relevant features given to them.
Traditional machine learning algorithms will generally perform well and be efficient when
It comes to identifying patterns from structured data and labeled data. Examples include classification and recommendation systems.
On the other hand, deep learning solutions are more suitable for
unstructured data like images, videos, and text. Tasks for deep learning include image classification and natural language processing, where the is a need to identify the complex relationships between pixels and words.
but only deep learning uses neural networks to simulate human intelligence.
Gen AI use transformer neural networks, which change an input sequence, in Gen AI known as prompt, into an output sequence, which is the response to your prompt. Neural networks process the elements of a sequence sequentially one word at a time. Transformers process the sequence in parallel, which speeds up the training and allows much bigger datasets to be used. They outperform other ML approaches to natural language processing. They excel at understanding human language so they can read long articles and summarize them. They are also great at generating text that’s similar to the way a human would. As a result, they are good at language translation and even writing original stories, letters, articles, and poetry. They even know computer programming languages and can write code for software developers.
Gen AI Notes
Consider these use cases for AI/ML
Increasing business efficiency
Solving complex problems
Making better decisions
Consider AI/ML alternatives when
Costs outweigh benefits
Models cannot meet interpretability requirements
(can’t know how a neural network made a decision, so instead use a rules based system)
Systems must be deterministic (produces same output with the same input) rather than probabilistic
If your dataset consists of features or attributes as inputs with labeled target values as outputs, then you have a supervised learning problem. In this type of problem, you train your model with data containing known inputs and outputs.
supervised learning problem
If your target values are categorical, for example, one or more discrete values, then you have a
classification problem. (supervision)
If these target values you’re trying to predict are mathematically continuous, then you have a
regression problem.
Binary classification classification
assigns an input to one of several classes based on the input attributes.
Multiclass classification
assigns an input to one of several classes based on the input attributes. An example is the prediction of the topic most relevant to a tax documen
When your target values are mathematically continuous, then you have a
egression problem. Regression estimates the value of dependent target variable based on one or more other variables,
multiple independent variables,
If we have such as weight and age, then we have a multiple linear regression problem. A
s. Cluster analysis is
a class of techniques that are used to classify data objects into groups, called clusters. It attempts to find discrete groupings within data. Members of a group are similar as possible to one another, and as different as possible from members of other gro
you define the features or attributes that you want the algorithm to use to determine similarity. Then you select a distance function to measure similarity and specify the number of clusters, or groups, you want for the analysis.
Clustered analysis
Is the identification of rare items, events, or observations in the data, which raise suspicions, because they differ significantly from the rest of the data
Anomaly detection
This service provides facial recognition, object detection, text detection, and content moderation
Amazon Rekognition
Extracts text, handwriting, etc from scanned documents
Amazon Textract
Extracts key phrases, entities, and sentiment
Amazon Comprehend
This service is pretrained to find PII
Amazon Comprehend
Converts Text to Speech
Polly
Converts Speech (Live and recorded) to Text
Transcribe
This AWS services has Intelligent document search , responds to questions with appropriate context
Amazon Kendra
Personalized product recommendations
Amazon Personalize
Translates between 75 languages, built on a neural network
Amazon Translate
Provided with historical time series data, this AWS service predicts future points in time series
Amazon Forecast
Detects fraud through checking online transactions, product reviews, checkout and payments, new accounts, and account takeover
Amazon Fraud Detector
What is the first step in the AI/ML process?
Identify the business goal
When identifying the business goal, what two things should youdo
define success criteria
align stakeholders
Second step in the ai/ml process
Frame the ML problem
When framing the ML problem, what four things should you do
Define the ML task, including inputs, outputs and metrics
Determine feasibility
Start with the simplest model options
Do a cost benefit analysis
When approach model selection, what should yo udo?
Start with the simplest, things AI/ML hosted services and pre-trained models. Fully customize only if needed.
To collect training data, you need to know these three things
Data sources
Data ingestion, including ETL
Labels
ETL includes
Gathering transforming and storing data in a new central location
What is likely one of the most time intensive parts of processing data?
Labeling, as you likely don’t already have the data labeled and need to do that
When pre-processing data, what types of things are you doing?
Looking for missing data, masking PII data, cleaning it, and splitting it.
What are the recommended splits for data?
80% for training the model
10% for model eval
10% for final testing before prod deploy
Feature engineering
which characteristics of the dataset should be used as features to train the model. This is the subset that is relevant and contributes to minimizing the error rate of a trained model. You should reduce the features in your training data to only those that are needed for inference. Features can be combined to further reduce the number of features. Reducing the number of features reduces the amount of memory and computing power required for training
What service is a cloud optimized ETL service, contains it’s own data catalog, and has built in transformations (dropping duplicate records, splitting data, etc)?
AWS Glue
Describe the AWS Glue Data Catalog
Crawls source systems, discovers metadata and schemas, understands the source data. Only metadata is stored in the data catalog
For AWS Glue ETL jobs, what is a common destination location for transformed data?
S3
What service has data quality rules, visualization and data preparation.
AWS Glue DataBrew
What service helps you prepare a well labeled dataset for use in supervised learning? It uses machine learning to label those things it can, then Turk for those it cant.
Amazon SageMaker Ground Truth
What service can you use to simplify the feature engineering process, to import/prepare/transform/visualize and analyze features?
Amazon SageMaker Canvas
Amazon Feature Store
Amazon SageMaker Feature Store is a centralized store for features and associated metadata, so features can be easily discovered and reused. Feature Store makes it easy to create, share, and manage features for ML development. Feature Store accelerates this process by reducing repetitive data processing and curation work required to convert raw data into features for training an ML algorithm. You can create workflow pipelines that convert raw data into features and add them to feature groups.
A machine learning algorithm updates a set of numbers in such a way that the inference matches an expected output. These numbers are
Parameters
True or false: The training process requires you to run one training run.
False.This can’t be done in one iteration, because the algorithm has not learned yet. It has no knowledge of how changing weights will shift the output closer toward the expected value. Therefore, it watches the weights and outputs from previous iterations, and shifts the weights to a direction that lowers the error in generated output. This iterative process stops either when a defined number of iterations have been run, or when the change in error is below a target value.
What is known as running experiments?
There are usually multiple algorithms to consider for a model. The best practice is to run many training jobs in parallel, by using different algorithms and settings. This is known as running experiments, which helps you land on the best-performing solution
Each algorithm has a set of external parameters that affect its performance. These are set by the data scientists before training the model. These include adjusting things like how many neural layers and nodes there will be in a deep learning model. The optimal values can only be determined by running multiple experiments with different settings.
known as hyperparameters
To run a training job, what do you give Sagemaker?
the URL of the S3 bucket containing your training data. You also specify the compute resources you want to use for training, and the output bucket for the model artifacts. You specify the algorithm by giving SageMaker the path to a Docker container image that contains the training algorithm. In the Amazon Elastic Container Registry, Amazon ECR, you can specify the location of SageMaker provided algorithms and deep learning containers, or the location of your custom container, containing a custom algorithm. You also need to set the hyperparameters required by the algorithm.
A capability of Amazon SageMaker that lets you create, manage, analyze, and compare your machine learning experiments. An experiment is a group of training runs, each with different inputs, parameters, and configurations. It features a visual interface to browse your active and past experiments, compare runs on key performance metrics, and identify the best-performing models.
Amazon SageMaker experiments
Amazon Sagemaker automatic model tuning (AMT)
also known as hyperparameter tuning, finds the best version of a model, by running many training jobs on your dataset. To do this, AMT uses the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that create a model that performs it best, as measured by a metric that you choose.
What is the most cost efficient way to run your model?
Batch inference
What are the ways you can deploy your model?
Batch inference
Real-time inference
Self-managed
Hosted (sagemaker inference)
What options are available for Amazon sagemaker inference?
Batch transform (offline line inference, large datasets)
Asynchronous (long processing times, large payloads)
Serverless
(intermittent traffic, periods of no traffic)
Real-time
(live predictions, sustained traffic, low latency, consistent performance)
What service can you use to monitor your model and be notified of suspected drift in your deployed model?
Amazon SageMaker Model Monitor
What is MLOPs
IAC
Rapid Experimentation
Version Control
Active perf mon
Automatic model retraining and validation when there is data and code changes
What are the benefits of MLOps
Productivity
Repeatability
Reliability
Auditability
Data and model quality
What service allows you to manage and build model pipelines, defining them with the python SDK or JSON, automated data processing, training jobs, creating models, and registering models?
Amazon SageMaker Model Building Pipelines
Name four repository options
CodeCommit
SageMaker Model Registry
SageMaker Feature Store
Third party
Name four options for orchestration
SageMaker Pipelines
Amazon Managed Worklows for Apache Airflow
AWS Step Functions
Third party
What is a confusion matrix?
A confusion matrix is a table with actual data typically across the top and the predicted values on the left.used to summarize the performance of a classification model when it’s evaluated against task data
What is accuracy?
which is simply the percentage of correct predictions
What is precision?
Precision measures how well an algorithm predicts true positives out of all the positives that it identifies. The formula is the number of true positives divided by the number of true positives, plus the number of false positives.
What is Recall (TPR)?
If we want to minimize the false negatives, then we can use a metric known as recall. For example, we want to make sure that we don’t miss if someone has a disease and we say they don’t. The formula is the number of true positives divided by the number of true positives plus the number of false negatives.
Can you optimize a model for both precision and recall?
No, but you can use F1
What is F1?
Combines recall and precision into one figure, allowing you to optimize on both of these
What is False Positive Rate
which is the false positives divided by the sum of the false positives and true negatives. In our example, this metric shows us how the model is handling the images that are not fish. It is a measure of how many of the predictions were of fish out of the images that were not fish
What is the True Negative Rate
Closely related to the false positive rate is the true negative rate, which is the ratio of the true negatives to the sum of the false positives and true negatives. It is a measure of how many of the predictions were of not fish out of the images that were not fish.
What is Receiver operating characteristics