Storage and Data Management: 22% (Redshift, S3, Lake Formation, Glue Data Catalog, HDFS, EMRFS) Flashcards
Be able to: a) determine the operational characteristics of a storage solution for analytics b) determine data access and retrieval patterns c) select an appropriate data layout, schema, structure, and format d) define a data lifecycle based on usage patterns and business requirements e) determine an appropriate system for cataloging data and managing metadata
Which services are appropriate for building data lakes on AWS?
S3, Lake Formation
Which services are appropriate for building data warehouses on AWS?
Redshift
Which storage service is appropriate for highly structured data serving as a single point of truth?
Redshift
Name three roles that Lake Formation fills
a) organising and curating ingested data
b) securing lake data
c) orchestrating transformation jobs with other services
What sort of data can be stored in an S3 data lake, structured, semistructured or unstructured?
All three
Is Lake Formation used to create ETL operations?
No
Name three user-defined components of an S3 object url
a) region
b) bucket name
c) object key
Is Redshift a relational or columnar database?
Columnar
Name the key difference between columnar and relational databases
Relational databases are optimised for fast retrieval of rows, typically for transactional applications
Columnar databases are optimised for fast retrieval of columns, typically for analytical applications
Name two Apache columnar databases that can be hosted on AWS
Cassandra and HBase
What is the fastest way to load data into Redshift?
Bulk copying of multiple compressed files from S3
How can a manifest file be used with the Redshift copy command?
TBD