Data Ingestion and Storage Flashcards by Chris Lombardi

What are the three types of data?

Structured
Unstructured
Semi-Structured

How well did you know this?

Not at all

Perfectly

What is semi-structured data?

XML, JSON, Log Files with varied formats, etc..

How well did you know this?

Not at all

Perfectly

What is unstructured data?

Video files, emails, text files with no fixed format.

How well did you know this?

Not at all

Perfectly

What are the three properties of data?

Volume
Velocity
Variety

How well did you know this?

Not at all

Perfectly

What is meant by the term data variety?

Different types of data formats and sources.

How well did you know this?

Not at all

Perfectly

Is a data warehouse good for storing structured information?

Yes

How well did you know this?

Not at all

Perfectly

Is a data warehouse good for storing files and images?

How well did you know this?

Not at all

Perfectly

Is a data warehouse good for OLAP?

Yes

How well did you know this?

Not at all

Perfectly

If you have structured, semi-structured, and unstructured data, what is the best way to store it?

A data lake

How well did you know this?

Not at all

Perfectly

Do you use ETL or ELT for a Data Warehouse?

ETL

How well did you know this?

Not at all

Perfectly

Why is ELT used for a Data Lake?

You need to read the file to know the format.

How well did you know this?

Not at all

Perfectly

What is more expensive, a data lake or data warehouse?

Usually, a data warehouse

How well did you know this?

Not at all

Perfectly

What is an example of an AWS Data Lakehouse?

AWS S3 with Redshift Spectrum

How well did you know this?

Not at all

Perfectly

What is a Data Mesh?

The governance and organization of data.

How well did you know this?

Not at all

Perfectly

What is Avro?

A binary format for that that stores the data and its schema.

How well did you know this?

Not at all

Perfectly

What is Parquet?

A columnar storage format optimized for analytics.

How well did you know this?

Not at all

Perfectly

What is the S3 Key?

The full path of the file. Everything after the bucket name all the way until the file name.

How well did you know this?

Not at all

Perfectly

What are the two parts to an S3 Key?

The Prefix and Object name

How well did you know this?

Not at all

Perfectly

What is a Prefix?

The path of the file, but not the bucket or file name.

How well did you know this?

Not at all

Perfectly

Is versioning required for S3 replication?

Yes

How well did you know this?

Not at all

Perfectly

Will S3 replication work on existing objects?

No. Unless you decide to do this from a batch operation.

How well did you know this?

Not at all

Perfectly

What is Glacier Instant Retrieval?

low cost storage with millisecond instant retrieval.

How well did you know this?

Not at all

Perfectly

What is the minimum object duration for for Glacier Instant Retrieval?

90 days

How well did you know this?

Not at all

Perfectly

What is Glacier Flexible Retrieval?

Used to be Glacier. Now has three retrieval tiers.

Expedited (1 - 5 min)
Standard (3 - 5 hours)
Bulk (5 - 12 hours)

How well did you know this?

Not at all

Perfectly

What are the Glacier Deep Archive tiers?

Standard (12 hours) Bulk (48 hours)

When S3 executes SQS, SNS, or Lambda, what kind of access control policy is needed and where is it configured?

A resource based policy on the target service. e.g., SQS.

What file sizes are recommended for S3 multi-part uploads?

100MB or greater. It is a requirement for 5GB or larger files.

How does S3 Transfer Acceleration work?

It sends data to the nearest edge location.

What is an S3 Byte Range Fetch

Allows you to fetch parts of a file. Good for only downloading partial data like headers.

What is the difference between AWS SSE-KMS and SSE-C?

One uses a key in KMS and the other is a customer provided key. Could be from your own HSM.

Is SSE-S3 enabled by default on S3 buckets?

Yes

Are there rate based limitations that can cause throttling in KMS?

Yes

How can you force encryption in transit?

Resource (bucket) policy

What are S3 Access Points?

They point to specific prefixes in your bucket.

How are S3 Access Point permissions managed?

Through Access Point Policies. These are resource based policies.

Can S3 access points be private?

Yes, using VPC origins.

Do VPC endpoints have resource based policies?

Yes.

What is an S3 Object Lambda?

It is used to change the object before it is retrieved by the called application.

What protocol does AWS EFS use?

NFS

Is EFS more expensive than EBS?

Yes, around 3x.

Does EFS scale automatically?

Yes

What are the performance modes for EFS?

General Purpose Max I/O for big data and media processing

What are the throughput modes for EFS?

Bursting Provisioned Elastic

Does EFS support lifecycle policies?

Yes.

What is Amazon FSx

It allows you to launch 3rd party high-performance file systems on AWS.

What are the four file systems that FSx supports?

Lustre, Netapp ONTAP, Windows File Server, OpenZFS

Does FSx for Windows File Server support SMB?

Yes

Does FSx for Windows support Active Directory?

Yes

Can FSx for Windows be launched on Linux?

Yes

Can FSX be accessed by On-Prem?

Yes, using direct connect or a VPN

What is a key use case for FSx for Lustre?

High-Performance computing.

What is an FSx scratch file system?

Temporary storage. Data is not replicated.

What FSx file system type supports point-in-time instant cloning?

FSx for Netapp ONTAP and OpenZFS

What are the two parts that make up a record in Kinesis Data Streams when you send the record to the stream?

A partition key and data blob (1MB)

How large is a Data Blob in Kinesis Data Streams?

1MB

What is the throughput for Kinesis Data Streams from the producer to the stream?

1MB per second or 1000 messages per second.

What are the parts that make up a record in Kinesis Data Streams when you send a record to the consumer?

Partition Key Sequence Number Data Blob

What is the throughput for Kinesis Data Streams from the Stream to the Consumer?

2MB per second, per shard SHARED OR 2MB per second per shard per consumer using enhanced fan out.

What is the max retention in Kinesis?

365 Days

What are the two capacity modes in Kinesis?

Provisioned On Demand - based on peak over the last 30 days.

What are three flavors of the Kinesis SDK?

Kinesis SDK, Kinesis Producer Library, and Kinesis Agent

Does The PutRecords API call have a batch option?

Yes. This allows you to increase throughput.

What error will you get when you exceed the allowed throughput for Putrecords?

ProvisionedThroughPutExceeded

What is a good use case for the Kinesis Producer SDK?

Low throughput , higher latency

How can you address a hot shard that causes a ProvisionedThroughPutExceeded error?

Exponential backoff Increase Shards

What languages does the KPL support?

C++ and Java

Does the KPL support asynchronous operations?

Yes

Can the KPL compress records?

If a producer uses the KPL, what are the requirements for the consumer?

The records MUST be de-coded with the KCL by the consumer

What does batching in the KPL do?

It can aggregate messages into one record that is less than 1MB.

How can you control batch sending delays in Kinesis?

The recordMaxBufferedTime configuration option.

When is batching in the KPL not a good idea?

When the application cannot tolerate a latency.

Can Apache Spark, Log4J appenders, Flume, or Kafka be used as a consumer for Kinesis Data Streams?

Yes

What is the throughput for the GetRecords API call?

10MB of Data or up to 10K records

What is the GetRecords API call limit?

Maximum of 5 GetRecords API calls per shard per second. 200ms latency

Does the KCL support checkpointing?

Yes. It leverages dynamoDB for this.

What does the ExpiredIteratorException indicate in relation to Kinesis Client Library?

That the WCU needs to be increased to support checkpointing.

Can Lambda read from a Kinesis Data Stream?

Yes

What is Kinesis Fan Out for Consumers?

Each consumer gets 2mbps per shard per consumer.

What is the latency with Kinesis Enhanced Fanout for Consumers?

70ms

What is a good use case for standard consumers?

An application that can tolerate 200ms latency.

Can you split shards in kinesis?

yes using the split operation

When will a split shard be removed?

When the data is expired.

What is a common root cause of reading data out of order in Kinesis Data Streams?

Data was not fully read after a resharding event.

Can resharding be done in parallel?

No. It takes a few seconds for resharding operations.

What is a common root cause for duplicates coming from a Kinesis Producer?

Network Timeouts. This can be prevented by embedding a unique record ID in the data to deduplicate on the consumer side.

What is a common root cause for duplicates coming from a Kinesis Consumer?

When a worker terminates Worker instances are added or removed Shards are merged or Split The application is deployed

How do you handle data transformation with Amazon Firehose?

Using a lambda

What are some common destinations for AWS Data Firehose?

S3, Redshift, Amazon Opensearch

Can data be replicated to an S3 bucket in Amazon Firehose?

Yes. Both failed and all records can be stored in a backup bucket.

Is Amazon Data Firehose real-time?

No. It is near-real-time

What data formats are supported by Amazon Data Firehose?

CSV, JSON, Parquet, Avro, Raw Text, Binary data.

Can Amazon Data Firehose perform data compression or conversions?

Yes... Parquet to ORC Compression with gzip or snappy

What is the Flink framework used for?

Processing data streams

Is Managed Service for Apache Flink serverless?

Yes

What are some common use cases for the Managed Service for Apache Flink

Streaming ETL

What is RANDOM_CUT_FOREST used for in Kinesis Analytics?

Anomaly detection.

What is Amazon Managed Streaming for Apache Kafka?

It is a fully managed Apache Kafka cluster. It is a replacement for Kinesis.

What is the maximum size for a Amazon Managed Streaming for Kafka message?

10MB

Kafka producers send code to what?

Brokers. They act similar to shards in Kinesis.

Can you encrypt data in-transit with AWS Kafka?

Yes. This is enabled by default

Data Ingestion and Storage Flashcards

(101 cards)