Data Ingestion and Storage Flashcards
What are the three types of data?
Structured
Unstructured
Semi-Structured
What is semi-structured data?
XML, JSON, Log Files with varied formats, etc..
What is unstructured data?
Video files, emails, text files with no fixed format.
What are the three properties of data?
Volume
Velocity
Variety
What is meant by the term data variety?
Different types of data formats and sources.
Is a data warehouse good for storing structured information?
Yes
Is a data warehouse good for storing files and images?
No
Is a data warehouse good for OLAP?
Yes
If you have structured, semi-structured, and unstructured data, what is the best way to store it?
A data lake
Do you use ETL or ELT for a Data Warehouse?
ETL
Why is ELT used for a Data Lake?
You need to read the file to know the format.
What is more expensive, a data lake or data warehouse?
Usually, a data warehouse
What is an example of an AWS Data Lakehouse?
AWS S3 with Redshift Spectrum
What is a Data Mesh?
The governance and organization of data.
What is Avro?
A binary format for that that stores the data and its schema.
What is Parquet?
A columnar storage format optimized for analytics.
What is the S3 Key?
The full path of the file. Everything after the bucket name all the way until the file name.
What are the two parts to an S3 Key?
The Prefix and Object name
What is a Prefix?
The path of the file, but not the bucket or file name.
Is versioning required for S3 replication?
Yes
Will S3 replication work on existing objects?
No. Unless you decide to do this from a batch operation.
What is Glacier Instant Retrieval?
low cost storage with millisecond instant retrieval.
What is the minimum object duration for for Glacier Instant Retrieval?
90 days
What is Glacier Flexible Retrieval?
Used to be Glacier. Now has three retrieval tiers.
Expedited (1 - 5 min)
Standard (3 - 5 hours)
Bulk (5 - 12 hours)