1 AWS ML Eng Assoc Data Eng and Storage Flashcards
AWS S3
Amazon S3 is an object storage service that offers industry-leading scalability; durability; availability; security; and performance. Customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases; such as data lakes; websites; mobile applications; backup and restore; archive; enterprise applications; IoT devices; and big data analytics.
S3 Buckets
S3 stores data as objects within buckets. You can store any type of object in a bucket; such as text files; photo files; video files; etc. Buckets are created in a specific AWS Region and can be accessed from anywhere.
S3 Object Key
The name of the object is known as the key. The combination of bucket name and key uniquely identifies the object.
EBS Volumes
EBS (Elastic Block Store) volumes are network drives that can be attached to and detached from EC2 instances. They allow persistent data storage; even after the instance is terminated.
EBS Volumes Use Case
EBS volumes allow you to persist data; even after the instance is terminated. This is helpful when you need to recreate an instance and mount the same EBS volume from before to get your data back.
EFS (Elastic File System)
EFS is a managed NFS (Network File System) that can be mounted on many EC2 instances across multiple Availability Zones (AZs). It’s highly available; scalable; and expensive (about 3x the cost of GP2 EBS volumes). Use cases include content management; web serving; data sharing; and WordPress.
Data Ingestion and Storage
All machine learning starts with potentially large amounts of data that need to be stored in a central repository in a scalable; secure manner. This section covers types of data; properties of data; storage strategies like data warehouses; data lakes; and data lakehouses; as well as pipelines for extracting; transforming; and loading data.
3 Types of Data
Structured (organized in a defined schema; found in relational databases); Unstructured (data without a predefined schema; like raw text files; videos; audios); Semi-structured (has some structure like tags or hierarchies; but needs work to extract; like XML; JSON; log files).
3 V’s of Data Properties
Volume (amount/size of data); Velocity (speed at which data is generated/processed); Variety (different types/sources of data).
Data Warehouse
A centralized repository optimized for analysis; where data from different sources is stored in a structured format. Designed for complex queries and analytics. Data is cleaned; transformed and loaded using an ETL process. Typically uses a star or snowflake schema.
Data Lake
Stores a vast amount of raw data in its native format (structured; semi-structured; unstructured). Data is loaded as-is without predefined schema. Supports batch; real-time and stream processing for data transformation and exploration.
Data Lakehouse
A hybrid architecture combining features of data lakes and data warehouses. Supports structured and unstructured data; schema-on-write and schema-on-read; detailed analytics and machine learning. Built on cloud/distributed architectures.
Data Mesh
An organizational paradigm for decentralized data management. Individual teams own their data and offer it as data products to others in the organization. Promotes domain ownership; federated governance; and self-service data infrastructure.
ETL (Extract; Transform; Load)
A process in data integration that extracts data from sources; transforms it into a desired format; and loads it into a target data repository. Used for data warehouses; Contrast with ELT for data lakes.
ELT (Extract; Load; Transform)
A process that extracts data from sources; loads it into a data repository in its raw format; and transforms it later as needed. Used for data lakes.
Structured Data Use Case
Database tables; CSV files (if consistent columns); Excel spreadsheets (organized rows and columns).
Unstructured Data Use Case
Raw text files (books; websites; social media); videos; audio files; emails; word docs (need to extract meaning first).
Semi-structured Data Use Case
XML and JSON files (can have varying schemas); structured email headers; log files (structure not always consistent).
Data Lake Use Case
Advanced analytics; machine learning; data discovery systems.
Data Warehouse Use Case
Business intelligence; analytics when you know the schema upfront and require fast; complex queries from structured data sources.
S3 Performance Mnemonic
Use MultiPart Upload for big files greater than 100MB; Get Byte Range for downloads; Use S3 Transfer Acceleration
S3 Object Encryption
Server-side encryption: SSE-S3 (AWS-managed keys); SSE-KMS (user-managed KMS keys); SSE-C (customer-provided keys). Client-side encryption where you manage everything.
Kinesis Data Streams
Used to collect and process large streams of data records in real time. Producers send data to Kinesis Data Streams which is then consumed by consumers. Data is stored for 1-365 days and can be replayed.
Kinesis Data Streams Architecture Metaphor
Imagine a river with many streams flowing into it (the producers) and then many people trying to collect water from the river (the consumers). The river stores water for 1-365 days.