Analytics Flashcards
What is AWS Glue?
A serverless ETL service for discovering, preparing, and combining data for analytics, machine learning, and application development.
What are the key components of AWS Glue?
Data Catalog, Crawler, ETL Engine, Studio, Data Quality, DataBrew, Workflows
What is the purpose of the Glue Data Catalog?
Stores table definitions and schemas for data located in various sources like S3, RDS, and DynamoDB.
What does a Glue Crawler do?
Scans data stores to infer schemas and create table definitions in the Data Catalog.
What programming languages are supported for Glue ETL jobs?
Scala and Python
What is a DynamicFrame in AWS Glue?
A collection of DynamicRecords with a schema, similar to a Spark DataFrame but with additional ETL capabilities.
What are some common transformations in Glue ETL?
DropFields, DropNullFields, Filter, Join, Map, FindMatches ML, format conversions
How can you modify the Data Catalog using Glue ETL scripts?
By using options like enableUpdateCatalog
, partitionKeys
, and updateBehavior
to add partitions, update schemas, or create new tables.
What are AWS Glue Development Endpoints?
Notebook-based environments for developing and testing Glue ETL scripts.
What are Glue Job Bookmarks used for?
Persisting state from a job run to prevent reprocessing of old data and ensure incremental processing.
How are AWS Glue costs calculated?
Primarily billed by the second for crawler and ETL jobs, with additional charges for Data Catalog storage beyond the free tier and for development endpoints.
What is AWS Glue Studio?
A visual interface for creating and managing ETL workflows without writing code.
What is the purpose of AWS Glue Data Quality?
To define and enforce data quality rules within Glue jobs using DQDL.
What is AWS Glue DataBrew?
A visual data preparation tool with over 250 pre-built transformations for cleaning and normalizing data.
What is AWS Lake Formation?
A service that simplifies the setup and management of secure data lakes.