AWS Glue Flashcards
What is AWS Glue?
It is a fully managed ETL service. A Spark ETL Engine. Consists fo a Central Metadata Repository - Glue Data Catalog. Flexible Scheduler.
Why use AWS Glue?
AWS Glue offers a fully managed serverless ETL tool. This removes the overhead, and barriers to entry, when there is a requirement for a ETL service in AWS.
What is Glue Data Catalog?
It is a persistent Metadata Store.
It is a managed service that lets you store, annotate, and share metadata which can be used to query and transform data.
One AWS Glue Data Catalog per AWS region.
Identity and Access Management (IAM) policies control access.
Can be used for data governance.
What are the kinds of Glue Data Catalog store?
Data Location
Schema
Data Types
Data Classification
What are AWS Glue Databases?
A set of associated Data Catalog table definitions organized into a logical group.
An S3 folder can be a AWS Glue Database for example (like a folder called raw_data).
What is AWS Glue Table?
The metadata definition that represents your data. It saves things like:
- schema (column name, data type, …)
Tables are part of AWS Glue Databases that is part of Glue Data Catalog
What is AWS Glue Crawler?
A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the AWS Glue Data Catalog.
What are Partitions?
Folders where data is stored on S3, which are physical entities, are mapped to partitions, which are logical entities i.e. Columns in the Glue table.
S3:/ / sales/year=2019/month=Jan/day=1
S3:/ / sales/year=2019/ month=Jan/day=2
S3:/ /sales/year=2019/month=Feb/day=1
S3:/ / sales/year= 2019/month=Feb/day=2
What are AWS Glue Connections?
A Data Catalog object that contains the properties that are required to connect to a particular data store.
What is AWS Glue ETL?
AWS Glue ETL supports extracting data from various sources, transforming it to meet your business needs, and loading it into a destination of your choice.
What is AWS Glue ETL engine?
Apache Spark engine to distribute big data workloads across worker nodes.
What is AWS Glue DPUs?
1 DPU is equivalent to 4 vCPUs and 16 GB memory.
What is AWS Glue bookmarks?
Tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run.
What is AWS Glue Data Quality?
Monitor the quality of your data by Data Quality Definition Language (DQDL) using DeeQu
What is AWS Glue Triggers?
Initiates an ETL job. Triggers can be defined based on a scheduled time or an event.