Data Engineering - ML data repositories compared Flashcards
What are the three characteristics of storage most relevant to ML?
- Cost
- Availability -
- Usability - can the preferred ML + preprocessing tools access the storage + how quickly
What does availability mean in relation to storage?
how long does it take data to be ready for processing
What does Usability mean in relatiom
Which 4 repositories can SageMaker accept data from?
- S3
- Amazon EFS
- Amazon FSx for Lustre
- EBS Volumes
Describe S3
S3 is an object data repository.
How are files stored in S3
Files are stored as single objects identified using a key
What are the 4 advantages of S3?
Highly scalable, available, durable and low cost
Name the two steps of the S3 lifecycle
- Transition
- Expiration
What is the transition phase in the S3 lifecycle?
process of moving datasets through storage classes with different characteristics. Normally from highly available (S3) to cheaper storage as it gets older (S3 Glacier)
What is the expiration phase in the S3 lifecycle?
Data is deleted after a certain period. Important for regulatory requirements.
What is the order of S3 repositories during the transition phase?
- S3 - regular access, highly available
- S3 IA - Infrequent access, low value or easily recreated data
- Glacier + Glacier deep archive - long-term low-cost archiving
- Expire - delete data no longer needed or required by regulators.
Which S3 would you use for general purpose, regular access?
S3 standard - for data that is regularly required and needs to be accessed instantly.
Which S3 would you use for unknown or changing access?
S3 Intelligent-Tiering - for data accesses in an unpredictable way.
What does S3 Intelligent-tiering do?
It will automatically move data between instant access to longer term storage depending on when the data is accessed.
Which S3 would you use for infrequent access?
S3 Standard-IA
Which S3 would you use for archiving data?
AWS S3 Glacier + S3 Glacier Deep - long-term low-cost archiving of data
Explain the usage of AWS lake formation
Used to rapidly set up a data lake with S3 as the data repository.
What type of data can AWS lake formation store?
structured + unstructured data at scale.
What is AWS Lake formation built on top of?
AWS Glue
What are the steps during the setup of Lake Formation?
- Find the input data sources
- Setup the S3 data lake
- Move the data to the S3 lake
- Crawl the data to determine its structure and build a data catalogue
- Perform ETL
- Setup security to protect the data
Describe the FSx for Lustre Storage
A high-performance combination of S3 and SSD storage. Data is presented as files to the ML models so processing can start immediately without having to wait for S3 to load.