Data Ingestion and Storage Flashcards
What are the three types of data?
Structured
Unstructured
Semi-Structured
What is semi-structured data?
XML, JSON, Log Files with varied formats, etc..
What is unstructured data?
Video files, emails, text files with no fixed format.
What are the three properties of data?
Volume
Velocity
Variety
What is meant by the term data variety?
Different types of data formats and sources.
Is a data warehouse good for storing structured information?
Yes
Is a data warehouse good for storing files and images?
No
Is a data warehouse good for OLAP?
Yes
If you have structured, semi-structured, and unstructured data, what is the best way to store it?
A data lake
Do you use ETL or ELT for a Data Warehouse?
ETL
Why is ELT used for a Data Lake?
You need to read the file to know the format.
What is more expensive, a data lake or data warehouse?
Usually, a data warehouse
What is an example of an AWS Data Lakehouse?
AWS S3 with Redshift Spectrum
What is a Data Mesh?
The governance and organization of data.
What is Avro?
A binary format for that that stores the data and its schema.
What is Parquet?
A columnar storage format optimized for analytics.
What is the S3 Key?
The full path of the file. Everything after the bucket name all the way until the file name.
What are the two parts to an S3 Key?
The Prefix and Object name
What is a Prefix?
The path of the file, but not the bucket or file name.
Is versioning required for S3 replication?
Yes
Will S3 replication work on existing objects?
No. Unless you decide to do this from a batch operation.
What is Glacier Instant Retrieval?
low cost storage with millisecond instant retrieval.
What is the minimum object duration for for Glacier Instant Retrieval?
90 days
What is Glacier Flexible Retrieval?
Used to be Glacier. Now has three retrieval tiers.
Expedited (1 - 5 min)
Standard (3 - 5 hours)
Bulk (5 - 12 hours)
What are the Glacier Deep Archive tiers?
Standard (12 hours)
Bulk (48 hours)
When S3 executes SQS, SNS, or Lambda, what kind of access control policy is needed and where is it configured?
A resource based policy on the target service. e.g., SQS.
What file sizes are recommended for S3 multi-part uploads?
100MB or greater. It is a requirement for 5GB or larger files.
How does S3 Transfer Acceleration work?
It sends data to the nearest edge location.
What is an S3 Byte Range Fetch
Allows you to fetch parts of a file. Good for only downloading partial data like headers.
What is the difference between AWS SSE-KMS and SSE-C?
One uses a key in KMS and the other is a customer provided key. Could be from your own HSM.
Is SSE-S3 enabled by default on S3 buckets?
Yes
Are there rate based limitations that can cause throttling in KMS?
Yes
How can you force encryption in transit?
Resource (bucket) policy
What are S3 Access Points?
They point to specific prefixes in your bucket.
How are S3 Access Point permissions managed?
Through Access Point Policies. These are resource based policies.
Can S3 access points be private?
Yes, using VPC origins.
Do VPC endpoints have resource based policies?
Yes.
What is an S3 Object Lambda?
It is used to change the object before it is retrieved by the called application.
What protocol does AWS EFS use?
NFS
Is EFS more expensive than EBS?
Yes, around 3x.
Does EFS scale automatically?
Yes
What are the performance modes for EFS?
General Purpose
Max I/O for big data and media processing
What are the throughput modes for EFS?
Bursting
Provisioned
Elastic
Does EFS support lifecycle policies?
Yes.
What is Amazon FSx
It allows you to launch 3rd party high-performance file systems on AWS.
What are the four file systems that FSx supports?
Lustre, Netapp ONTAP, Windows File Server, OpenZFS
Does FSx for Windows File Server support SMB?
Yes
Does FSx for Windows support Active Directory?
Yes
Can FSx for Windows be launched on Linux?
Yes
Can FSX be accessed by On-Prem?
Yes, using direct connect or a VPN
What is a key use case for FSx for Lustre?
High-Performance computing.
What is an FSx scratch file system?
Temporary storage. Data is not replicated.
What FSx file system type supports point-in-time instant cloning?
FSx for Netapp ONTAP and OpenZFS
What are the two parts that make up a record in Kinesis Data Streams when you send the record to the stream?
A partition key and data blob (1MB)
How large is a Data Blob in Kinesis Data Streams?
1MB
What is the throughput for Kinesis Data Streams from the producer to the stream?
1MB per second or 1000 messages per second.
What are the parts that make up a record in Kinesis Data Streams when you send a record to the consumer?
Partition Key
Sequence Number
Data Blob
What is the throughput for Kinesis Data Streams from the Stream to the Consumer?
2MB per second, per shard SHARED
OR
2MB per second per shard per consumer using enhanced fan out.
What is the max retention in Kinesis?
365 Days
What are the two capacity modes in Kinesis?
Provisioned
On Demand - based on peak over the last 30 days.
What are three flavors of the Kinesis SDK?
Kinesis SDK, Kinesis Producer Library, and Kinesis Agent
Does The PutRecords API call have a batch option?
Yes. This allows you to increase throughput.
What error will you get when you exceed the allowed throughput for Putrecords?
ProvisionedThroughPutExceeded
What is a good use case for the Kinesis Producer SDK?
Low throughput , higher latency
How can you address a hot shard that causes a ProvisionedThroughPutExceeded error?
Exponential backoff
Increase Shards
What languages does the KPL support?
C++ and Java
Does the KPL support asynchronous operations?
Yes
Can the KPL compress records?
No
If a producer uses the KPL, what are the requirements for the consumer?
The records MUST be de-coded with the KCL by the consumer
What does batching in the KPL do?
It can aggregate messages into one record that is less than 1MB.
How can you control batch sending delays in Kinesis?
The recordMaxBufferedTime configuration option.
When is batching in the KPL not a good idea?
When the application cannot tolerate a latency.
Can Apache Spark, Log4J appenders, Flume, or Kafka be used as a consumer for Kinesis Data Streams?
Yes
What is the throughput for the GetRecords API call?
10MB of Data or up to 10K records
What is the GetRecords API call limit?
Maximum of 5 GetRecords API calls per shard per second. 200ms latency
Does the KCL support checkpointing?
Yes. It leverages dynamoDB for this.
What does the ExpiredIteratorException indicate in relation to Kinesis Client Library?
That the WCU needs to be increased to support checkpointing.
Can Lambda read from a Kinesis Data Stream?
Yes
What is Kinesis Fan Out for Consumers?
Each consumer gets 2mbps per shard per consumer.
What is the latency with Kinesis Enhanced Fanout for Consumers?
70ms
What is a good use case for standard consumers?
An application that can tolerate 200ms latency.
Can you split shards in kinesis?
yes using the split operation
When will a split shard be removed?
When the data is expired.
What is a common root cause of reading data out of order in Kinesis Data Streams?
Data was not fully read after a resharding event.
Can resharding be done in parallel?
No. It takes a few seconds for resharding operations.
What is a common root cause for duplicates coming from a Kinesis Producer?
Network Timeouts. This can be prevented by embedding a unique record ID in the data to deduplicate on the consumer side.
What is a common root cause for duplicates coming from a Kinesis Consumer?
When a worker terminates
Worker instances are added or removed
Shards are merged or Split
The application is deployed
How do you handle data transformation with Amazon Firehose?
Using a lambda
What are some common destinations for AWS Data Firehose?
S3, Redshift, Amazon Opensearch
Can data be replicated to an S3 bucket in Amazon Firehose?
Yes. Both failed and all records can be stored in a backup bucket.
Is Amazon Data Firehose real-time?
No. It is near-real-time
What data formats are supported by Amazon Data Firehose?
CSV, JSON, Parquet, Avro, Raw Text, Binary data.
Can Amazon Data Firehose perform data compression or conversions?
Yes…
Parquet to ORC
Compression with gzip or snappy
What is the Flink framework used for?
Processing data streams
Is Managed Service for Apache Flink serverless?
Yes
What are some common use cases for the Managed Service for Apache Flink
Streaming ETL
What is RANDOM_CUT_FOREST used for in Kinesis Analytics?
Anomaly detection.
What is Amazon Managed Streaming for Apache Kafka?
It is a fully managed Apache Kafka cluster. It is a replacement for Kinesis.
What is the maximum size for a Amazon Managed Streaming for Kafka message?
10MB
Kafka producers send code to what?
Brokers. They act similar to shards in Kinesis.
Can you encrypt data in-transit with AWS Kafka?
Yes. This is enabled by default