1 AWS ML Eng Assoc Data Eng and Storage Flashcards by Yitzchak Meirovich

AWS S3

Amazon S3 is an object storage service that offers industry-leading scalability; durability; availability; security; and performance. Customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases; such as data lakes; websites; mobile applications; backup and restore; archive; enterprise applications; IoT devices; and big data analytics.

How well did you know this?

Not at all

Perfectly

S3 Buckets

S3 stores data as objects within buckets. You can store any type of object in a bucket; such as text files; photo files; video files; etc. Buckets are created in a specific AWS Region and can be accessed from anywhere.

How well did you know this?

Not at all

Perfectly

S3 Object Key

The name of the object is known as the key. The combination of bucket name and key uniquely identifies the object.

How well did you know this?

Not at all

Perfectly

EBS Volumes

EBS (Elastic Block Store) volumes are network drives that can be attached to and detached from EC2 instances. They allow persistent data storage; even after the instance is terminated.

How well did you know this?

Not at all

Perfectly

EBS Volumes Use Case

EBS volumes allow you to persist data; even after the instance is terminated. This is helpful when you need to recreate an instance and mount the same EBS volume from before to get your data back.

How well did you know this?

Not at all

Perfectly

EFS (Elastic File System)

EFS is a managed NFS (Network File System) that can be mounted on many EC2 instances across multiple Availability Zones (AZs). It’s highly available; scalable; and expensive (about 3x the cost of GP2 EBS volumes). Use cases include content management; web serving; data sharing; and WordPress.

How well did you know this?

Not at all

Perfectly

Data Ingestion and Storage

All machine learning starts with potentially large amounts of data that need to be stored in a central repository in a scalable; secure manner. This section covers types of data; properties of data; storage strategies like data warehouses; data lakes; and data lakehouses; as well as pipelines for extracting; transforming; and loading data.

How well did you know this?

Not at all

Perfectly

3 Types of Data

Structured (organized in a defined schema; found in relational databases); Unstructured (data without a predefined schema; like raw text files; videos; audios); Semi-structured (has some structure like tags or hierarchies; but needs work to extract; like XML; JSON; log files).

How well did you know this?

Not at all

Perfectly

3 V’s of Data Properties

Volume (amount/size of data); Velocity (speed at which data is generated/processed); Variety (different types/sources of data).

How well did you know this?

Not at all

Perfectly

Data Warehouse

A centralized repository optimized for analysis; where data from different sources is stored in a structured format. Designed for complex queries and analytics. Data is cleaned; transformed and loaded using an ETL process. Typically uses a star or snowflake schema.

How well did you know this?

Not at all

Perfectly

Data Lake

Stores a vast amount of raw data in its native format (structured; semi-structured; unstructured). Data is loaded as-is without predefined schema. Supports batch; real-time and stream processing for data transformation and exploration.

How well did you know this?

Not at all

Perfectly

Data Lakehouse

A hybrid architecture combining features of data lakes and data warehouses. Supports structured and unstructured data; schema-on-write and schema-on-read; detailed analytics and machine learning. Built on cloud/distributed architectures.

How well did you know this?

Not at all

Perfectly

Data Mesh

An organizational paradigm for decentralized data management. Individual teams own their data and offer it as data products to others in the organization. Promotes domain ownership; federated governance; and self-service data infrastructure.

How well did you know this?

Not at all

Perfectly

ETL (Extract; Transform; Load)

A process in data integration that extracts data from sources; transforms it into a desired format; and loads it into a target data repository. Used for data warehouses; Contrast with ELT for data lakes.

How well did you know this?

Not at all

Perfectly

ELT (Extract; Load; Transform)

A process that extracts data from sources; loads it into a data repository in its raw format; and transforms it later as needed. Used for data lakes.

How well did you know this?

Not at all

Perfectly

Structured Data Use Case

Study These Flashcards

Database tables; CSV files (if consistent columns); Excel spreadsheets (organized rows and columns).

Unstructured Data Use Case

Study These Flashcards

Raw text files (books; websites; social media); videos; audio files; emails; word docs (need to extract meaning first).

Semi-structured Data Use Case

Study These Flashcards

XML and JSON files (can have varying schemas); structured email headers; log files (structure not always consistent).

Data Lake Use Case

Study These Flashcards

Advanced analytics; machine learning; data discovery systems.

Data Warehouse Use Case

Study These Flashcards

Business intelligence; analytics when you know the schema upfront and require fast; complex queries from structured data sources.

S3 Performance Mnemonic

Study These Flashcards

Use MultiPart Upload for big files greater than 100MB; Get Byte Range for downloads; Use S3 Transfer Acceleration

S3 Object Encryption

Study These Flashcards

Server-side encryption: SSE-S3 (AWS-managed keys); SSE-KMS (user-managed KMS keys); SSE-C (customer-provided keys). Client-side encryption where you manage everything.

Kinesis Data Streams

Study These Flashcards

Used to collect and process large streams of data records in real time. Producers send data to Kinesis Data Streams which is then consumed by consumers. Data is stored for 1-365 days and can be replayed.

Kinesis Data Streams Architecture Metaphor

Study These Flashcards

Imagine a river with many streams flowing into it (the producers) and then many people trying to collect water from the river (the consumers). The river stores water for 1-365 days.

Kinesis Data Streams Shards

A Kinesis Data Stream is made up of shards - each shard gets 1MB/s write and 2MB/s read capacity. You provision the number of shards required for your stream.

Kinesis Data Streams Producers

SDK; Kinesis Producer Library (KPL) with record aggregation/batching; Kinesis Agent for sending logs; 3rd party libs like Spark

Kinesis Data Streams Consumers

SDK; Kinesis Client Library (KCL) with shard discovery; Kinesis Connector Library; Kinesis Data Firehose; AWS Lambda; Spark

Kinesis Enhanced Fan-Out Consumers

A feature that allows multiple consumers to get 2MB/s per shard each instead of sharing the 2MB/s capacity. Provides better scaling and reduced 70ms latency.

Kinesis Scaling

You can reshard (split or merge shards) to increase/decrease stream capacity. This takes planning as not instantaneous.

Kinesis Security

IAM for producer/consumer permissions; HTTPS for encryption in-flight; KMS for encryption at rest; Manual client-side encryption; VPC endpoints

Kinesis Resharding Gotcha

If consumer hasn't read all parent shard data before resharding; it may read child shard data first; leading to out-of-order records for a partition key. Ensure reading parent till end first.

Kinesis Data Firehose

Near real-time service to capture; transform and load streaming data into destinations like S3; Redshift; OpenSearch and Splunk. Fully managed; auto-scaling; serverless data transformation with Lambda. No data storage; batches data before delivery.

Kinesis Analytics

For running SQL queries to analyze; filter and transform streaming data. Can integrate with S3; Lambda and other services.

MSK (Amazon Managed Streaming for Apache Kafka)

A managed Apache Kafka service on AWS. Provides more configuration options than Kinesis like larger message sizes up to 10MB (vs 1MB on Kinesis). Can manually provision brokers/topics as needed. More complex but also more flexible than Kinesis.

MSK Use Case vs Kinesis

Choose MSK over Kinesis if you need larger than 1MB message sizes; need more custom configurations; or need tighter control over scaling brokers and topics.

1 AWS ML Eng Assoc Data Eng and Storage Flashcards

(35 cards)