1 AWS ML Eng Assoc Data Eng and Storage Flashcards

1
Q

AWS S3

A

Amazon S3 is an object storage service that offers industry-leading scalability; durability; availability; security; and performance. Customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases; such as data lakes; websites; mobile applications; backup and restore; archive; enterprise applications; IoT devices; and big data analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

S3 Buckets

A

S3 stores data as objects within buckets. You can store any type of object in a bucket; such as text files; photo files; video files; etc. Buckets are created in a specific AWS Region and can be accessed from anywhere.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

S3 Object Key

A

The name of the object is known as the key. The combination of bucket name and key uniquely identifies the object.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

EBS Volumes

A

EBS (Elastic Block Store) volumes are network drives that can be attached to and detached from EC2 instances. They allow persistent data storage; even after the instance is terminated.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

EBS Volumes Use Case

A

EBS volumes allow you to persist data; even after the instance is terminated. This is helpful when you need to recreate an instance and mount the same EBS volume from before to get your data back.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

EFS (Elastic File System)

A

EFS is a managed NFS (Network File System) that can be mounted on many EC2 instances across multiple Availability Zones (AZs). It’s highly available; scalable; and expensive (about 3x the cost of GP2 EBS volumes). Use cases include content management; web serving; data sharing; and WordPress.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Data Ingestion and Storage

A

All machine learning starts with potentially large amounts of data that need to be stored in a central repository in a scalable; secure manner. This section covers types of data; properties of data; storage strategies like data warehouses; data lakes; and data lakehouses; as well as pipelines for extracting; transforming; and loading data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

3 Types of Data

A

Structured (organized in a defined schema; found in relational databases); Unstructured (data without a predefined schema; like raw text files; videos; audios); Semi-structured (has some structure like tags or hierarchies; but needs work to extract; like XML; JSON; log files).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

3 V’s of Data Properties

A

Volume (amount/size of data); Velocity (speed at which data is generated/processed); Variety (different types/sources of data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Data Warehouse

A

A centralized repository optimized for analysis; where data from different sources is stored in a structured format. Designed for complex queries and analytics. Data is cleaned; transformed and loaded using an ETL process. Typically uses a star or snowflake schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Data Lake

A

Stores a vast amount of raw data in its native format (structured; semi-structured; unstructured). Data is loaded as-is without predefined schema. Supports batch; real-time and stream processing for data transformation and exploration.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Data Lakehouse

A

A hybrid architecture combining features of data lakes and data warehouses. Supports structured and unstructured data; schema-on-write and schema-on-read; detailed analytics and machine learning. Built on cloud/distributed architectures.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Data Mesh

A

An organizational paradigm for decentralized data management. Individual teams own their data and offer it as data products to others in the organization. Promotes domain ownership; federated governance; and self-service data infrastructure.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

ETL (Extract; Transform; Load)

A

A process in data integration that extracts data from sources; transforms it into a desired format; and loads it into a target data repository. Used for data warehouses; Contrast with ELT for data lakes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

ELT (Extract; Load; Transform)

A

A process that extracts data from sources; loads it into a data repository in its raw format; and transforms it later as needed. Used for data lakes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Structured Data Use Case

A

Database tables; CSV files (if consistent columns); Excel spreadsheets (organized rows and columns).

17
Q

Unstructured Data Use Case

A

Raw text files (books; websites; social media); videos; audio files; emails; word docs (need to extract meaning first).

18
Q

Semi-structured Data Use Case

A

XML and JSON files (can have varying schemas); structured email headers; log files (structure not always consistent).

19
Q

Data Lake Use Case

A

Advanced analytics; machine learning; data discovery systems.

20
Q

Data Warehouse Use Case

A

Business intelligence; analytics when you know the schema upfront and require fast; complex queries from structured data sources.

21
Q

S3 Performance Mnemonic

A

Use MultiPart Upload for big files greater than 100MB; Get Byte Range for downloads; Use S3 Transfer Acceleration

22
Q

S3 Object Encryption

A

Server-side encryption: SSE-S3 (AWS-managed keys); SSE-KMS (user-managed KMS keys); SSE-C (customer-provided keys). Client-side encryption where you manage everything.

23
Q

Kinesis Data Streams

A

Used to collect and process large streams of data records in real time. Producers send data to Kinesis Data Streams which is then consumed by consumers. Data is stored for 1-365 days and can be replayed.

24
Q

Kinesis Data Streams Architecture Metaphor

A

Imagine a river with many streams flowing into it (the producers) and then many people trying to collect water from the river (the consumers). The river stores water for 1-365 days.

25
Q

Kinesis Data Streams Shards

A

A Kinesis Data Stream is made up of shards - each shard gets 1MB/s write and 2MB/s read capacity. You provision the number of shards required for your stream.

26
Q

Kinesis Data Streams Producers

A

SDK; Kinesis Producer Library (KPL) with record aggregation/batching; Kinesis Agent for sending logs; 3rd party libs like Spark

27
Q

Kinesis Data Streams Consumers

A

SDK; Kinesis Client Library (KCL) with shard discovery; Kinesis Connector Library; Kinesis Data Firehose; AWS Lambda; Spark

28
Q

Kinesis Enhanced Fan-Out Consumers

A

A feature that allows multiple consumers to get 2MB/s per shard each instead of sharing the 2MB/s capacity. Provides better scaling and reduced 70ms latency.

29
Q

Kinesis Scaling

A

You can reshard (split or merge shards) to increase/decrease stream capacity. This takes planning as not instantaneous.

30
Q

Kinesis Security

A

IAM for producer/consumer permissions; HTTPS for encryption in-flight; KMS for encryption at rest; Manual client-side encryption; VPC endpoints

31
Q

Kinesis Resharding Gotcha

A

If consumer hasn’t read all parent shard data before resharding; it may read child shard data first; leading to out-of-order records for a partition key. Ensure reading parent till end first.

32
Q

Kinesis Data Firehose

A

Near real-time service to capture; transform and load streaming data into destinations like S3; Redshift; OpenSearch and Splunk. Fully managed; auto-scaling; serverless data transformation with Lambda. No data storage; batches data before delivery.

33
Q

Kinesis Analytics

A

For running SQL queries to analyze; filter and transform streaming data. Can integrate with S3; Lambda and other services.

34
Q

MSK (Amazon Managed Streaming for Apache Kafka)

A

A managed Apache Kafka service on AWS. Provides more configuration options than Kinesis like larger message sizes up to 10MB (vs 1MB on Kinesis). Can manually provision brokers/topics as needed. More complex but also more flexible than Kinesis.

35
Q

MSK Use Case vs Kinesis

A

Choose MSK over Kinesis if you need larger than 1MB message sizes; need more custom configurations; or need tighter control over scaling brokers and topics.