Data Ingestion and Storage Flashcards
What are the three types of data?
Structured
Unstructured
Semi-Structured
What is semi-structured data?
XML, JSON, Log Files with varied formats, etc..
What is unstructured data?
Video files, emails, text files with no fixed format.
What are the three properties of data?
Volume
Velocity
Variety
What is meant by the term data variety?
Different types of data formats and sources.
Is a data warehouse good for storing structured information?
Yes
Is a data warehouse good for storing files and images?
No
Is a data warehouse good for OLAP?
Yes
If you have structured, semi-structured, and unstructured data, what is the best way to store it?
A data lake
Do you use ETL or ELT for a Data Warehouse?
ETL
Why is ELT used for a Data Lake?
You need to read the file to know the format.
What is more expensive, a data lake or data warehouse?
Usually, a data warehouse
What is an example of an AWS Data Lakehouse?
AWS S3 with Redshift Spectrum
What is a Data Mesh?
The governance and organization of data.
What is Avro?
A binary format for that that stores the data and its schema.
What is Parquet?
A columnar storage format optimized for analytics.
What is the S3 Key?
The full path of the file. Everything after the bucket name all the way until the file name.
What are the two parts to an S3 Key?
The Prefix and Object name
What is a Prefix?
The path of the file, but not the bucket or file name.
Is versioning required for S3 replication?
Yes
Will S3 replication work on existing objects?
No. Unless you decide to do this from a batch operation.
What is Glacier Instant Retrieval?
low cost storage with millisecond instant retrieval.
What is the minimum object duration for for Glacier Instant Retrieval?
90 days
What is Glacier Flexible Retrieval?
Used to be Glacier. Now has three retrieval tiers.
Expedited (1 - 5 min)
Standard (3 - 5 hours)
Bulk (5 - 12 hours)
What are the Glacier Deep Archive tiers?
Standard (12 hours)
Bulk (48 hours)
When S3 executes SQS, SNS, or Lambda, what kind of access control policy is needed and where is it configured?
A resource based policy on the target service. e.g., SQS.
What file sizes are recommended for S3 multi-part uploads?
100MB or greater. It is a requirement for 5GB or larger files.
How does S3 Transfer Acceleration work?
It sends data to the nearest edge location.
What is an S3 Byte Range Fetch
Allows you to fetch parts of a file. Good for only downloading partial data like headers.
What is the difference between AWS SSE-KMS and SSE-C?
One uses a key in KMS and the other is a customer provided key. Could be from your own HSM.
Is SSE-S3 enabled by default on S3 buckets?
Yes
Are there rate based limitations that can cause throttling in KMS?
Yes
How can you force encryption in transit?
Resource (bucket) policy
What are S3 Access Points?
They point to specific prefixes in your bucket.
How are S3 Access Point permissions managed?
Through Access Point Policies. These are resource based policies.
Can S3 access points be private?
Yes, using VPC origins.
Do VPC endpoints have resource based policies?
Yes.
What is an S3 Object Lambda?
It is used to change the object before it is retrieved by the called application.
What protocol does AWS EFS use?
NFS
Is EFS more expensive than EBS?
Yes, around 3x.