1 AWS ML Eng Assoc Data Eng and Storage Flashcards
AWS S3
Amazon S3 is an object storage service that offers industry-leading scalability; durability; availability; security; and performance. Customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases; such as data lakes; websites; mobile applications; backup and restore; archive; enterprise applications; IoT devices; and big data analytics.
S3 Buckets
S3 stores data as objects within buckets. You can store any type of object in a bucket; such as text files; photo files; video files; etc. Buckets are created in a specific AWS Region and can be accessed from anywhere.
S3 Object Key
The name of the object is known as the key. The combination of bucket name and key uniquely identifies the object.
EBS Volumes
EBS (Elastic Block Store) volumes are network drives that can be attached to and detached from EC2 instances. They allow persistent data storage; even after the instance is terminated.
EBS Volumes Use Case
EBS volumes allow you to persist data; even after the instance is terminated. This is helpful when you need to recreate an instance and mount the same EBS volume from before to get your data back.
EFS (Elastic File System)
EFS is a managed NFS (Network File System) that can be mounted on many EC2 instances across multiple Availability Zones (AZs). It’s highly available; scalable; and expensive (about 3x the cost of GP2 EBS volumes). Use cases include content management; web serving; data sharing; and WordPress.
Data Ingestion and Storage
All machine learning starts with potentially large amounts of data that need to be stored in a central repository in a scalable; secure manner. This section covers types of data; properties of data; storage strategies like data warehouses; data lakes; and data lakehouses; as well as pipelines for extracting; transforming; and loading data.
3 Types of Data
Structured (organized in a defined schema; found in relational databases); Unstructured (data without a predefined schema; like raw text files; videos; audios); Semi-structured (has some structure like tags or hierarchies; but needs work to extract; like XML; JSON; log files).
3 V’s of Data Properties
Volume (amount/size of data); Velocity (speed at which data is generated/processed); Variety (different types/sources of data).
Data Warehouse
A centralized repository optimized for analysis; where data from different sources is stored in a structured format. Designed for complex queries and analytics. Data is cleaned; transformed and loaded using an ETL process. Typically uses a star or snowflake schema.
Data Lake
Stores a vast amount of raw data in its native format (structured; semi-structured; unstructured). Data is loaded as-is without predefined schema. Supports batch; real-time and stream processing for data transformation and exploration.
Data Lakehouse
A hybrid architecture combining features of data lakes and data warehouses. Supports structured and unstructured data; schema-on-write and schema-on-read; detailed analytics and machine learning. Built on cloud/distributed architectures.
Data Mesh
An organizational paradigm for decentralized data management. Individual teams own their data and offer it as data products to others in the organization. Promotes domain ownership; federated governance; and self-service data infrastructure.
ETL (Extract; Transform; Load)
A process in data integration that extracts data from sources; transforms it into a desired format; and loads it into a target data repository. Used for data warehouses; Contrast with ELT for data lakes.
ELT (Extract; Load; Transform)
A process that extracts data from sources; loads it into a data repository in its raw format; and transforms it later as needed. Used for data lakes.
Structured Data Use Case
Database tables; CSV files (if consistent columns); Excel spreadsheets (organized rows and columns).
Unstructured Data Use Case
Raw text files (books; websites; social media); videos; audio files; emails; word docs (need to extract meaning first).
Semi-structured Data Use Case
XML and JSON files (can have varying schemas); structured email headers; log files (structure not always consistent).
Data Lake Use Case
Advanced analytics; machine learning; data discovery systems.
Data Warehouse Use Case
Business intelligence; analytics when you know the schema upfront and require fast; complex queries from structured data sources.
S3 Performance Mnemonic
Use MultiPart Upload for big files greater than 100MB; Get Byte Range for downloads; Use S3 Transfer Acceleration
S3 Object Encryption
Server-side encryption: SSE-S3 (AWS-managed keys); SSE-KMS (user-managed KMS keys); SSE-C (customer-provided keys). Client-side encryption where you manage everything.
Kinesis Data Streams
Used to collect and process large streams of data records in real time. Producers send data to Kinesis Data Streams which is then consumed by consumers. Data is stored for 1-365 days and can be replayed.
Kinesis Data Streams Architecture Metaphor
Imagine a river with many streams flowing into it (the producers) and then many people trying to collect water from the river (the consumers). The river stores water for 1-365 days.
Kinesis Data Streams Shards
A Kinesis Data Stream is made up of shards - each shard gets 1MB/s write and 2MB/s read capacity. You provision the number of shards required for your stream.
Kinesis Data Streams Producers
SDK; Kinesis Producer Library (KPL) with record aggregation/batching; Kinesis Agent for sending logs; 3rd party libs like Spark
Kinesis Data Streams Consumers
SDK; Kinesis Client Library (KCL) with shard discovery; Kinesis Connector Library; Kinesis Data Firehose; AWS Lambda; Spark
Kinesis Enhanced Fan-Out Consumers
A feature that allows multiple consumers to get 2MB/s per shard each instead of sharing the 2MB/s capacity. Provides better scaling and reduced 70ms latency.
Kinesis Scaling
You can reshard (split or merge shards) to increase/decrease stream capacity. This takes planning as not instantaneous.
Kinesis Security
IAM for producer/consumer permissions; HTTPS for encryption in-flight; KMS for encryption at rest; Manual client-side encryption; VPC endpoints
Kinesis Resharding Gotcha
If consumer hasn’t read all parent shard data before resharding; it may read child shard data first; leading to out-of-order records for a partition key. Ensure reading parent till end first.
Kinesis Data Firehose
Near real-time service to capture; transform and load streaming data into destinations like S3; Redshift; OpenSearch and Splunk. Fully managed; auto-scaling; serverless data transformation with Lambda. No data storage; batches data before delivery.
Kinesis Analytics
For running SQL queries to analyze; filter and transform streaming data. Can integrate with S3; Lambda and other services.
MSK (Amazon Managed Streaming for Apache Kafka)
A managed Apache Kafka service on AWS. Provides more configuration options than Kinesis like larger message sizes up to 10MB (vs 1MB on Kinesis). Can manually provision brokers/topics as needed. More complex but also more flexible than Kinesis.
MSK Use Case vs Kinesis
Choose MSK over Kinesis if you need larger than 1MB message sizes; need more custom configurations; or need tighter control over scaling brokers and topics.