ML Fundamentals Flashcards
allows people to store objects (files) in “buckets”
(directories)
Amazon S3
What pathway is this called: * <my_bucket>/my_folder1/another_folder/my_file.txt</my_bucket>
S3 Bucket Key
- Pattern for speeding up range queries (ex: AWS Athena)
- By Date: s3://bucket/my-dataset/year/month/day/hour/data_00.csv
- By Product: s3://bucket/my-data-set/product-id/data_32.csv
Amazon S3 Data Partitioning
Durability or availability:
* If you store 10,000,000 objects with Amazon S3, you can on average
expect to incur a loss of a single object once every 10,000 years
* Same for all storage classes
Durability
Durability or availability:
* Measures how readily available a service is
* Varies depending on storage class
Availability
What S3 storage class is the below:
* 99.99% Availability
* Used for frequently accessed data
* Low latency and high throughput
* Sustain 2 concurrent facility failures
* Use Cases: Big Data analytics, mobile & gaming applications,
content distribution…
S3 Standard – General Purpose
What S3 Storage class:
*For data that is less frequently accessed, but requires rapid access
when needed
* Lower cost than S3 Standard
** 99.9% Availability
* Use cases: Disaster Recovery, backups
- Amazon S3 Standard-Infrequent Access (S3 Standard-IA)
What S3 Storage class:
*For data that is less frequently accessed, but requires rapid access
when needed
* Lower cost than S3 Standard
* High durability (99.999999999%) in a single AZ; data lost when AZ is destroyed
* 99.5% Availability
* Use Cases: Storing secondary backup copies of on-premise data, or data you
can recreate
- Amazon S3 One Zone-Infrequent Access (S3 One Zone-IA)
What S3 Storage class:
Small monthly monitoring and auto-tiering fee
* Moves objects automatically between Access Tiers based on usage
* There are no retrieval charges in S3 Intelligent-Tiering
S3 Intelligent-Tiering
Describe the S3 storage Intelligent Tiering classes below:
*__________: default tier
* Infrequent Access tier (automatic): objects not accessed for 30 days
* ______: objects not accessed for 90 days
* _________: configurable from 90 days to 700+ days
* ________: config. from 180 days to 700+ days
Frequent Access tier (automatic): default tier
* Infrequent Access tier (automatic): objects not accessed for 30 days
* Archive Instant Access tier (automatic): objects not accessed for 90 days
* Archive Access tier (optional): configurable from 90 days to 700+ days
* Deep Archive Access tier (optional): config. from 180 days to 700+ days
- Help you decide when to transition objects
to the right storage class - Recommendations for Standard and
Standard IA - Does NOT work for One-Zone IA or Glacier
- Report is updated daily
- 24 to 48 hours to start seeing data analysis
- Good first step to put together Lifecycle
Rules (or improve them)!
Amazon S3 Analytics
bucket wide rules from the S3 console - allows cross account
S3 Bucket policies
_____ is a managed alternative to Apache Kafka
* Great for application logs, metrics, IoT, clickstreams
* Great for “real-time” big data
* Great for streaming processing frameworks (Spark, NiFi, etc…)
* Data is automatically replicated synchronously to 3 AZ
Amazon Kinesis
__________ low latency streaming ingest at scale
Kinesis Streams
________ perform real-time analytics on streams using SQL
Kinesis Analytics
_________ load streams into S3, Redshift, ElasticSearch & Splunk
Kinesis Firehose
______ meant for streaming video in real-time
Kinesis Video Streams
Kinesis Streams are divided in ordered ______
Shards
What are the two capacity modes for Kinesis Data streams?
Provisioned and On-Demand modes
What Kinesis data stream capacity mode is below:
*You choose the number of shards provisioned, scale manually or using API
* Each shard gets 1MB/s in (or 1000 records per second)
* Each shard gets 2MB/s out (classic or enhanced fan-out consumer)
* You pay per shard provisioned per hour
Provisioned
What Kinesis data stream capacity mode is below:
* No need to provision or manage the capacity
* Default capacity provisioned (4 MB/s in or 4000 records per second)
* Scales automatically based on observed throughput peak during the last 30
days
* Pay per stream per hour & data in/out per GB
On-demand mode
What Kinesis service is this:
*Fully Managed Service, no administration
* Near Real Time (60 seconds latency minimum for non full batches)
* Data Ingestion into Redshift / Amazon S3 / ElasticSearch / Splunk
* Automatic scaling
* Supports many data formats
* Data Conversions from CSV / JSON to Parquet / ORC (only for S3)
* Data Transformation through AWS Lambda (ex: CSV => JSON)
* Supports compression when target is Amazon S3 (GZIP, ZIP, and
SNAPPY
Kinesis data firehose
Whats the difference between kinesis data streams and firehose?
*Streams
* Going to write custom code (producer / consumer)
* Real time (~200 ms latency for classic, ~70 ms latency for enhanced fan-out)
* Automatic scaling with On-demand Mode
* Data Storage for 1 to 365 days, replay capability, multi consumers
*Firehose
* Fully managed, send to S3, Splunk, Redshift, ElasticSearch
* Serverless data transformations with Lambda
* Near real time (lowest buffer time is 1 minute)
* Automated Scaling
* No data storage
What Kinesis tool is this:
Use cases
* Streaming ETL: select columns, make simple transformations, on streaming
data
* Continuous metric generation: live leaderboard for a mobile game
* Responsive analytics: look for certain criteria and build alerting (filtering)
* Features
* Pay only for resources consumed (but it’s not cheap)
* Serverless; scales automatically
* Use IAM permissions to access streaming source and destination(s)
* SQL or Flink to write the computation
* Schema discovery
* Lambda can be used for pre-processing
Kinesis data analytics
For Kinesis Analytics, you Pay only for ______ (but it’s not cheap)
resources consumed
Is amazon kinesis serverless?
Yes
What amazon data product has the below characteristics:
- Producers:
- security camera, body-worn camera,
AWS DeepLens, smartphone
camera, audio feeds, images,
RADAR data, RTSP camera. - One producer per video stream
- Video playback capability
- Consumers
- build your own (MXNet, Tensorflow)
- AWS SageMaker
- Amazon Rekognition Video
- Keep data for 1 hour to 10 years
Kinesis video stream
__________ create real-time machine learning
applications
Kinesis Data Stream
_____ ingest massive data near-real time
Kinesis Data Firehose
___________ real-time ETL / ML algorithms on
streams
Kinesis Data Analytics
___________ real-time video stream to create ML
applications
Kinesis Video Stream
- Metadata repository for all
your tables - Automated Schema
Inference - Schemas are versioned
- Integrates with Athena or
Redshift Spectrum
(schema & data discovery)
Glue data catalog
____ go through your data to infer schemas and partitions
* Works JSON, Parquet, CSV, relational store
Glue crawlers
Transform data, Clean Data, Enrich Data (before doing analysis)
* Generate ETL code in Python or Scala, you can modify the code
* Can provide your own Spark or PySpark scripts
* Target can be S3, JDBC (RDS, Redshift), or in Glue Data Catalog
* Fully managed, cost effective, pay only for the resources consumed
* Jobs are run on a serverless Spark platform
Glue ETL
What type of data store is this:
Data Warehousing, SQL
analytics (OLAP - Online
analytical processing)
Redshift
What type of data store is this:
Relational Store, SQL (OLTP -
Online Transaction Processing)
* Must provision servers in
advance
- RDS, Aurora:
What type of data store is this:
NoSQL data store, serverless,
provision read/write capacity
* Useful to store a machine
learning model served by your
application
- DynamoDB:
What type of data store is this:
Object storage
* Serverless, infinite storage
* Integration with most AWS
Services
S3
What type of data storoe is this:
- Indexing of data
- Search amongst data points
- Clickstream Analytics
OpenSearch (previously
ElasticSearch)
What type of data store is this:
- Caching mechanism
- Not really used for Machine
Learning
- ElastiCache
What are these below features identifying what AWS data service:
Destinations include S3, RDS,
DynamoDB, Redshift and EMR
* Manages task dependencies
* Retries and notifies on failures
* Data sources may be on-premises
* Highly available
AWS Data Pipeline
What are the differences between AWS Data Pipeline and AWS Glue?
Glue:
* Glue ETL - Run Apache Spark code, Scala or Python based, focus on the
ETL
* Glue ETL - Do not worry about configuring or managing the resources
* Data Catalog to make the data available to Athena or Redshift Spectrum
* Data Pipeline:
* Orchestration service
* More control over the environment, compute resources that run code, & code
* Allows access to EC2 or EMR instances (creates resources in your own
account)
What AWS data service is below:
- Run batch jobs as Docker images
- Dynamic provisioning of the instances (EC2 & Spot Instances)
- Optimal quantity and type based on volume and requirements
- No need to manage clusters, fully serverless
- You just pay for the underlying EC2 instances
AWS Batch
What is the difference between AWS Batch and Glue?
- Glue:
- Glue ETL - Run Apache Spark code, Scala or Python based, focus on
the ETL - Glue ETL - Do not worry about configuring or managing the resources
- Data Catalog to make the data available to Athena or Redshift
Spectrum - Batch:
- For any computing job regardless of the job (must provide Docker
image) - Resources are created in your account, managed by Batch
- For any non-ETL related work, Batch is probably better
What AWS data service has the below features:
- Quickly and securely migrate databases
to AWS, resilient, self healing - The source database remains available
during the migration - Supports:
- Homogeneous migrations: ex Oracle to
Oracle - Heterogeneous migrations: ex Microsoft SQL
Server to Aurora - Continuous Data Replication using CDC
- You must create an EC2 instance to
perform the replication tasks
AWS Database Migration Service - DMS
What is the difference between AWS DMS and Glue?
Glue:
* Glue ETL - Run Apache Spark code, Scala or Python based, focus on
the ETL
* Glue ETL - Do not worry about configuring or managing the resources
* Data Catalog to make the data available to Athena or Redshift
Spectrum
* AWS DMS:
* Continuous Data Replication
* No data transformation
* Once the data is in AWS, you can use Glue to transform it
What AWS Data service has the below features:
For data migrations from on-premises to AWS storage services
* A DataSync Agent is deployed as a VM and connects to your
internal storage
* NFS, SMB, HDFS
* Encryption and data validation
AWS DataSync
- An Internet of Things (IOT) thing
- Standard messaging protocol
- Think of it as how lots of sensor
data might get transferred to your
machine learning model - The AWS IoT Device SDK can
connect via ____
MQTT
What are the three major types of data?
- Numerical
- Categorical
- Ordinal
______ Represents some sort of quantitative
measurement
* Heights of people, page load times, stock
prices, etc.
Numerical
_______ is Integer based; often counts of some event.
* How many purchases did a customer make in a
year?
* How many times did I flip “heads”?
Discrete data
__________
* Has an infinite number of possible values
* How much time did it take for a user to check
out?
* How much rain fell on a given day?
- Continuous Data
___________ is Qualitative data that has no
inherent mathematical meaning
* Gender, Yes/no (binary data),
Race, State of Residence, Product
Category, Political Party, etc.
Categorical data
A mixture of numerical and
categorical
* Categorical data that has
mathematical meaning
* Example: movie ratings on a 1-5
scale.
* Ratings must be 1, 2, 3, 4, or 5
* But these values have mathematical
meaning; 1 means it’s a worse movie
than a 2.
Ordinal data
What AWS service has the below characteristics:
- Interactive query service for S3 (SQL)
- No need to load data, it stays in S3
- Presto under the hood
- Serverless!
- Supports many data formats
- CSV (human readable)
- JSON (human readable)
- ORC (columnar, splittable)
- Parquet (columnar, splittable)
- Avro (splittable)
- Unstructured, semi-structured, or structured
Amazon athena
What AWS service uses the below scenarios?
- Ad-hoc queries of web logs
- Querying staging data before
loading to Redshift - Analyze CloudTrail / CloudFront /
VPC / ELB etc logs in S3 - Integration with Jupyter, Zeppelin,
RStudio notebooks - Integration with QuickSight
- Integration via ODBC / JDBC with
other visualization tools
amazon athena
What AWS service has the below cost model?
Pay-as-you-go
* $5 per TB scanned
* Successful or cancelled queries
count, failed queries do not.
* No charge for DDL
(CREATE/ALTER/DROP etc.)
* Save LOTS of money by using
columnar formats
* ORC, Parquet
* Save 30-90%, and get better
performance
Athena
What AWS Service has the below characteristics:
- Fast, easy, cloud-powered business
analytics service - Allows all employees in an organization
to: - Build visualizations
- Perform ad-hoc analysis
- Quickly get business insights from data
- Anytime, on any device (browsers, mobile)
- Serverless
Quicksight
What is the in memory database that is used by quicksight?
SPICE
What quicksight service is below:
Machine learning-powered
* Answers business questions with Natural
Language Processing
* “What are the top-selling items in Florida?”
* Offered as an add-on for given regions
* Personal training on how to use it is
required
* Must set up topics associated with
datasets
* Datasets and their fields must be NLP-friendly
* How to handle dates must be defined
Quicksight Q
What quicksight service is below:
Reports designed to
be printed
* May span many pages
* Can be based on
existing Quicksight
dashboards
* New in Nov 2022
Paginated Reports
What AWS Service is this:
- Managed Hadoop framework on EC2
instances - Includes Spark, HBase, Presto, Flink,
Hive & more - EMR Notebooks
- Several integration points with AWS
Amazon EMR (Elastic Map Reduce)
What is this called:
Applying your knowledge of the data – and the model you’re
using - to create better features to train your model with.
* Which features should I use?
* Do I need to transform these features in some way?
* How do I handle missing data?
* Should I create new features from the existing ones?
Feature engineering
What is The Curse of Dimensionality
?
Too many features can be a problem –
leads to sparse data
* Every feature is a new dimension
* Much of feature engineering is selecting
the features most relevant to the
problem at hand
* This often is where domain knowledge
comes into play
What AI data cleansing concept is below:
Replace missing values with the mean value
from the rest of the column (columns, not rows!
A column represents a single feature; it only
makes sense to take the mean from other
samples of the same feature.)
* Fast & easy, won’t affect mean or sample size
of overall data set
* Median may be a better choice than mean
when outliers are present
Mean replacement
What are the cons of mean replacement?
Only works on column level, misses correlations
between features
* Can’t use on categorical features (imputing with
most frequent value can work in this case, though)
* Not very accurate
What solution to missing data is this :
If not many rows contain missing data…
* …and dropping those rows doesn’t bias your
data…
* …and you don’t have a lot of time…
* …maybe it’s a reasonable thing to do.
* But, it’s never going to be the right
answer for the “best” approach.
Dropping data
What are the three ways to solve missing data with machine learning techniques?
*KNN: Find K “nearest” (most similar) rows and average their values
* Assumes numerical data, not categorical
* There are ways to handle categorical data (Hamming distance), but
categorical data is probably better served by…
* Deep Learning
* Build a machine learning model to impute data for your machine learning
model!
* Works well for categorical data. Really well. But it’s complicated.
* Regression
* Find linear or non-linear relationships between the missing feature and other
features
* Most advanced technique: MICE (Multiple Imputation by Chained Equations)
What kind of data is this:
Large discrepancy between
“positive” and “negative”
cases
* i.e., fraud detection. Fraud is
rare, and most rows will be notfraud
* Don’t let the terminology
confuse you; “positive” doesn’t
mean “good”
* It means the thing you’re testing
for is what happened.
* If your machine learning model
is made to detect fraud, then
fraud is the positive case.
* Mainly a problem with neural
networks
unbalanced data
To improve AI Data quality, what is the term below:
Artificially generate new samples of the minority class using
nearest neighbors
* Run K-nearest-neighbors of each sample of the minority class
* Create a new sample from the KNN result (mean of the neighbors)
* Both generates new samples and undersamples majority class
* Generally better than just oversampling
SMOTE (* Synthetic Minority Over-sampling TEchnique)
If you have too many false positives, one
way to fix that is to simply increase that
_________
threshold
_____ is simply the average of the squared
differences from the mean
Variance
_____ is just the square root
of the variance.
Standard Deviation 𝜎
Bucket observations together based
on ranges of values.
* Example: estimated ages of people
* Put all 20-somethings in one
classification, 30-somethings in another,
etc
Binning
Applying some function to a feature to make it
better suited for training
Transforming
Transforming data into some new
representation required by the
model
encoding
Some models prefer feature data to be
normally distributed around 0 (most
neural nets)
* Most models require feature data to at
least be scaled to comparable values
* Otherwise features with larger magnitudes
will have more weight than they should
* Example: modeling age and income as
features – incomes will be much higher
values than ages
Scaling/normalization
Many algorithms benefit from
_____ their training data
* Otherwise they may learn from
residual signals in the training
data resulting from the order in
which they were collected
shuffling
What is Ground Truth?
- Ground Truth manages humans who
will label your data for training
purposes - Ground Truth creates its own model as images are labeled by
people - As this model learns, only images the model isn’t sure about are
sent to human labelers
Turnkey solution
* “Our team of AWS Experts”
manages the workflow and team of labelers
* You fill out an intake form
* They contact you and discuss
pricing
Ground truth plus
- AWS service for image recognition
- Automatically classify images
Rekognition
- AWS service for text analysis and topic modeling
- Automatically classify text by topics, sentiment
Comprehend
- Important data for search – figures out what terms are most relevant for a document
*
TF-IDF
* Stands for Term Frequency and Inverse Document Frequency
- just measures how often a word occurs in a
document - A word that occurs frequently is probably important to that document’s
meaning
Term Frequency
_____ is how often a word occurs in an entire
set of documents, i.e., all of Wikipedia or every web page
* This tells us about common words that just appear everywhere no
matter what the topic, like “a”, “the”, “and”, et
Document Frequency
Can you explain bi grams and tri grams?
An extension of TF-IDF is to not only compute relevancy for
individual words (terms) but also for bi-grams or, more
generally, n-grams.
* “I love certification exams”
* Unigrams: “I”, “love”, “certification”, “exams”
* Bi-grams: “I love”, “love certification”, “certification exams”
* Tri-grams: “I love certification”, “love certification exams”
What are the three types of neural networks?
- Feedforward Neural Network
- Convolutional Neural Networks
(CNN) - Recurrent Neural Networks
(RNNs)
What kind of activation function is this:
It doesn’t really do
anything
* Can’t do backpropagation
Linear
What kind of activation function is this:
- It’s on or off
- Can’t handle multiple
classification – it’s binary
after all - Vertical slopes don’t
work well with calculus!
Binary step function