Data Ingestion and Storage Flashcards

1
Q

What are the three types of data?

A

Structured
Unstructured
Semi-Structured

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is semi-structured data?

A

XML, JSON, Log Files with varied formats, etc..

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is unstructured data?

A

Video files, emails, text files with no fixed format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the three properties of data?

A

Volume
Velocity
Variety

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is meant by the term data variety?

A

Different types of data formats and sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Is a data warehouse good for storing structured information?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Is a data warehouse good for storing files and images?

A

No

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Is a data warehouse good for OLAP?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

If you have structured, semi-structured, and unstructured data, what is the best way to store it?

A

A data lake

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Do you use ETL or ELT for a Data Warehouse?

A

ETL

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Why is ELT used for a Data Lake?

A

You need to read the file to know the format.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is more expensive, a data lake or data warehouse?

A

Usually, a data warehouse

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is an example of an AWS Data Lakehouse?

A

AWS S3 with Redshift Spectrum

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a Data Mesh?

A

The governance and organization of data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Avro?

A

A binary format for that that stores the data and its schema.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is Parquet?

A

A columnar storage format optimized for analytics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the S3 Key?

A

The full path of the file. Everything after the bucket name all the way until the file name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What are the two parts to an S3 Key?

A

The Prefix and Object name

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What is a Prefix?

A

The path of the file, but not the bucket or file name.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Is versioning required for S3 replication?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Will S3 replication work on existing objects?

A

No. Unless you decide to do this from a batch operation.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

What is Glacier Instant Retrieval?

A

low cost storage with millisecond instant retrieval.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the minimum object duration for for Glacier Instant Retrieval?

A

90 days

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What is Glacier Flexible Retrieval?

A

Used to be Glacier. Now has three retrieval tiers.

Expedited (1 - 5 min)
Standard (3 - 5 hours)
Bulk (5 - 12 hours)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the Glacier Deep Archive tiers?

A

Standard (12 hours)
Bulk (48 hours)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

When S3 executes SQS, SNS, or Lambda, what kind of access control policy is needed and where is it configured?

A

A resource based policy on the target service. e.g., SQS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What file sizes are recommended for S3 multi-part uploads?

A

100MB or greater. It is a requirement for 5GB or larger files.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

How does S3 Transfer Acceleration work?

A

It sends data to the nearest edge location.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is an S3 Byte Range Fetch

A

Allows you to fetch parts of a file. Good for only downloading partial data like headers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What is the difference between AWS SSE-KMS and SSE-C?

A

One uses a key in KMS and the other is a customer provided key. Could be from your own HSM.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Is SSE-S3 enabled by default on S3 buckets?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Are there rate based limitations that can cause throttling in KMS?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

How can you force encryption in transit?

A

Resource (bucket) policy

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

What are S3 Access Points?

A

They point to specific prefixes in your bucket.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

How are S3 Access Point permissions managed?

A

Through Access Point Policies. These are resource based policies.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

Can S3 access points be private?

A

Yes, using VPC origins.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

Do VPC endpoints have resource based policies?

A

Yes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

What is an S3 Object Lambda?

A

It is used to change the object before it is retrieved by the called application.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

What protocol does AWS EFS use?

A

NFS

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

Is EFS more expensive than EBS?

A

Yes, around 3x.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q

Does EFS scale automatically?

A

Yes

42
Q

What are the performance modes for EFS?

A

General Purpose
Max I/O for big data and media processing

43
Q

What are the throughput modes for EFS?

A

Bursting
Provisioned
Elastic

44
Q

Does EFS support lifecycle policies?

A

Yes.

45
Q

What is Amazon FSx

A

It allows you to launch 3rd party high-performance file systems on AWS.

46
Q

What are the four file systems that FSx supports?

A

Lustre, Netapp ONTAP, Windows File Server, OpenZFS

47
Q

Does FSx for Windows File Server support SMB?

A

Yes

48
Q

Does FSx for Windows support Active Directory?

A

Yes

49
Q

Can FSx for Windows be launched on Linux?

A

Yes

50
Q

Can FSX be accessed by On-Prem?

A

Yes, using direct connect or a VPN

51
Q

What is a key use case for FSx for Lustre?

A

High-Performance computing.

52
Q

What is an FSx scratch file system?

A

Temporary storage. Data is not replicated.

53
Q

What FSx file system type supports point-in-time instant cloning?

A

FSx for Netapp ONTAP and OpenZFS

54
Q

What are the two parts that make up a record in Kinesis Data Streams when you send the record to the stream?

A

A partition key and data blob (1MB)

55
Q

How large is a Data Blob in Kinesis Data Streams?

A

1MB

56
Q

What is the throughput for Kinesis Data Streams from the producer to the stream?

A

1MB per second or 1000 messages per second.

57
Q

What are the parts that make up a record in Kinesis Data Streams when you send a record to the consumer?

A

Partition Key
Sequence Number
Data Blob

58
Q

What is the throughput for Kinesis Data Streams from the Stream to the Consumer?

A

2MB per second, per shard SHARED
OR
2MB per second per shard per consumer using enhanced fan out.

59
Q

What is the max retention in Kinesis?

A

365 Days

60
Q

What are the two capacity modes in Kinesis?

A

Provisioned

On Demand - based on peak over the last 30 days.

61
Q

What are three flavors of the Kinesis SDK?

A

Kinesis SDK, Kinesis Producer Library, and Kinesis Agent

62
Q

Does The PutRecords API call have a batch option?

A

Yes. This allows you to increase throughput.

63
Q

What error will you get when you exceed the allowed throughput for Putrecords?

A

ProvisionedThroughPutExceeded

64
Q

What is a good use case for the Kinesis Producer SDK?

A

Low throughput , higher latency

65
Q

How can you address a hot shard that causes a ProvisionedThroughPutExceeded error?

A

Exponential backoff

Increase Shards

66
Q

What languages does the KPL support?

A

C++ and Java

67
Q

Does the KPL support asynchronous operations?

A

Yes

68
Q

Can the KPL compress records?

A

No

69
Q

If a producer uses the KPL, what are the requirements for the consumer?

A

The records MUST be de-coded with the KCL by the consumer

70
Q

What does batching in the KPL do?

A

It can aggregate messages into one record that is less than 1MB.

71
Q

How can you control batch sending delays in Kinesis?

A

The recordMaxBufferedTime configuration option.

72
Q

When is batching in the KPL not a good idea?

A

When the application cannot tolerate a latency.

73
Q

Can Apache Spark, Log4J appenders, Flume, or Kafka be used as a consumer for Kinesis Data Streams?

A

Yes

74
Q

What is the throughput for the GetRecords API call?

A

10MB of Data or up to 10K records

75
Q

What is the GetRecords API call limit?

A

Maximum of 5 GetRecords API calls per shard per second. 200ms latency

76
Q

Does the KCL support checkpointing?

A

Yes. It leverages dynamoDB for this.

77
Q

What does the ExpiredIteratorException indicate in relation to Kinesis Client Library?

A

That the WCU needs to be increased to support checkpointing.

78
Q

Can Lambda read from a Kinesis Data Stream?

A

Yes

79
Q

What is Kinesis Fan Out for Consumers?

A

Each consumer gets 2mbps per shard per consumer.

80
Q

What is the latency with Kinesis Enhanced Fanout for Consumers?

A

70ms

81
Q

What is a good use case for standard consumers?

A

An application that can tolerate 200ms latency.

82
Q

Can you split shards in kinesis?

A

yes using the split operation

83
Q

When will a split shard be removed?

A

When the data is expired.

84
Q

What is a common root cause of reading data out of order in Kinesis Data Streams?

A

Data was not fully read after a resharding event.

85
Q

Can resharding be done in parallel?

A

No. It takes a few seconds for resharding operations.

86
Q

What is a common root cause for duplicates coming from a Kinesis Producer?

A

Network Timeouts. This can be prevented by embedding a unique record ID in the data to deduplicate on the consumer side.

87
Q

What is a common root cause for duplicates coming from a Kinesis Consumer?

A

When a worker terminates
Worker instances are added or removed
Shards are merged or Split
The application is deployed

88
Q

How do you handle data transformation with Amazon Firehose?

A

Using a lambda

89
Q

What are some common destinations for AWS Data Firehose?

A

S3, Redshift, Amazon Opensearch

90
Q

Can data be replicated to an S3 bucket in Amazon Firehose?

A

Yes. Both failed and all records can be stored in a backup bucket.

91
Q

Is Amazon Data Firehose real-time?

A

No. It is near-real-time

92
Q

What data formats are supported by Amazon Data Firehose?

A

CSV, JSON, Parquet, Avro, Raw Text, Binary data.

93
Q

Can Amazon Data Firehose perform data compression or conversions?

A

Yes…

Parquet to ORC

Compression with gzip or snappy

94
Q

What is the Flink framework used for?

A

Processing data streams

95
Q

Is Managed Service for Apache Flink serverless?

A

Yes

96
Q

What are some common use cases for the Managed Service for Apache Flink

A

Streaming ETL

97
Q

What is RANDOM_CUT_FOREST used for in Kinesis Analytics?

A

Anomaly detection.

98
Q

What is Amazon Managed Streaming for Apache Kafka?

A

It is a fully managed Apache Kafka cluster. It is a replacement for Kinesis.

99
Q

What is the maximum size for a Amazon Managed Streaming for Kafka message?

A

10MB

100
Q

Kafka producers send code to what?

A

Brokers. They act similar to shards in Kinesis.

101
Q

Can you encrypt data in-transit with AWS Kafka?

A

Yes. This is enabled by default