Data Engineering - ML data repositories compared Flashcards
What are the three characteristics of storage most relevant to ML?
- Cost
- Availability -
- Usability - can the preferred ML + preprocessing tools access the storage + how quickly
What does availability mean in relation to storage?
how long does it take data to be ready for processing
What does Usability mean in relatiom
Which 4 repositories can SageMaker accept data from?
- S3
- Amazon EFS
- Amazon FSx for Lustre
- EBS Volumes
Describe S3
S3 is an object data repository.
How are files stored in S3
Files are stored as single objects identified using a key
What are the 4 advantages of S3?
Highly scalable, available, durable and low cost
Name the two steps of the S3 lifecycle
- Transition
- Expiration
What is the transition phase in the S3 lifecycle?
process of moving datasets through storage classes with different characteristics. Normally from highly available (S3) to cheaper storage as it gets older (S3 Glacier)
What is the expiration phase in the S3 lifecycle?
Data is deleted after a certain period. Important for regulatory requirements.
What is the order of S3 repositories during the transition phase?
- S3 - regular access, highly available
- S3 IA - Infrequent access, low value or easily recreated data
- Glacier + Glacier deep archive - long-term low-cost archiving
- Expire - delete data no longer needed or required by regulators.
Which S3 would you use for general purpose, regular access?
S3 standard - for data that is regularly required and needs to be accessed instantly.
Which S3 would you use for unknown or changing access?
S3 Intelligent-Tiering - for data accesses in an unpredictable way.
What does S3 Intelligent-tiering do?
It will automatically move data between instant access to longer term storage depending on when the data is accessed.
Which S3 would you use for infrequent access?
S3 Standard-IA
Which S3 would you use for archiving data?
AWS S3 Glacier + S3 Glacier Deep - long-term low-cost archiving of data
Explain the usage of AWS lake formation
Used to rapidly set up a data lake with S3 as the data repository.
What type of data can AWS lake formation store?
structured + unstructured data at scale.
What is AWS Lake formation built on top of?
AWS Glue
What are the steps during the setup of Lake Formation?
- Find the input data sources
- Setup the S3 data lake
- Move the data to the S3 lake
- Crawl the data to determine its structure and build a data catalogue
- Perform ETL
- Setup security to protect the data
Describe the FSx for Lustre Storage
A high-performance combination of S3 and SSD storage. Data is presented as files to the ML models so processing can start immediately without having to wait for S3 to load.
Give the five features of FSx lustre
- high performance storage system
- low latency
- high throughput
- high IOPS
- multiple underlying storage types
Explain the Machine Learning use case of Amazon FSx for Lustre
For serving massive training data to SageMaker. The file store is concurrent so multiple computer instances can work on the data at the same time, It integrates with SageMaker.
Describe EBS Volumes
A virtual version of your computer’s hardrive. Data is stored as files and fast access can be specified. The data can be backed up using snapshot and its possible to setup RAID configurations.
What are instances created by SageMaker for SageMaker notebooks?
EC2 instances with EBS volumes
Describe EFS
the networked drive version of EBS. It has multiple EBS drives networked together so that the data can be accessed by multiple compute instances
name the different verions of EFS
Standard EFS and EFS IA (infrequent access)
Name the secondary data repositories in AWS?
RDS, DynamoDB, Redshift, Redshift Spectrum, Timestream, DocumentDB
Can a secondary data repository be directly ingested by SageMaker?
no, it has to be moved to another repository for example S3
Describe RDS
Amazon Relational Database Service - makes it easy to setup, operate and scale relational databases. AWS takes care of most of the admin and maintenance
Which databases can RDS supply?
Open Sources (mySQL, PostgreSQL) and vendor owned ( Oracle, Microsoft)
Name four use cases of RDS
- Data that is relational and structured
- Data Warehouse
- online transaction processing
- Running relational joins and complex updates
Describe DynamoDB
a no-SQL database where data is stored as key-value pairs. It treats data within it as being composed as a list of attributes and values.
Name the use cases of DynamoDB
- non-relational database
- Structured and less structured data
- Storing JSON objects
Describe RedShift
a fast massively scale-able data warehouse system. It stores structured data that can be accessed and manipulated by standard SQL
List the 3 use cases of RedShift
- Data Warehouse
- Structured relational data
- complex analytical queries
Name and describe two useful commands for Redshift
UNLOAD
COPY - used to take data from an S3 bucket and place it in a RedShift table
Describe RedShift Spectrum
can be used for ad-hoc ETL. It can use the catalogue to access raw data files in S3 using standard SQL queries to clean and transform the data structure.
Give the use cases of RedShift Spectrum
- Data Lake
- Semi-structured data
Describe Amazin Timestream
A serverless database for storing time series ie log data or IoT devices. Data is stored and queried by time intervals. Accessing data is very fast.
What would you use Amazon TimeStream for?
To identify trends, patterns, and anomalies in time series data
Describe DocumentDB
A repository optimised for storing and querying JSON documents. It is an Apache MongoDB hosted on AWS infrastructure and marketed by AWS as a way to migrate existing MongoDB instances on to AWS serverless infrastructure.
State the use cases of DocumentDB
- Used to migrate MongoDB to AWS
- Store JSON documents
- Non-relational data and less structured data
Why is data in EBS and EFS so quickly available?
Because the data can be used directly without being moved
Which repositories support all types of data structures? (structured, semi-structured and unstructured)
S3, FSx for Lustre, EDS and EFS
Which repositories only support structured data?
RDS, RedShift and Timestream (schemeless)
Which repositories support only semi-structured data?
RedShift Spectrum
Which repositories support semi-structured + structured but not unstructured data?
Dynamodb + Documentdb
Which repositiories only support structured and unstructured data?
LakeFormation
What is the maximum amount of data an S3 bucket can hold?
unlimited
What datastires are suitable for structured data?
Any RDS database ie ORACLE, Microsoft, MySQL and PostgreSQL
What types of data can DynamoDB support?
Data can be both structured and semi-structured
What does UNLOAD do?
used to save a table to a set of files on S3