AWS Glue Flashcards

1
Q

What is AWS Glue?

A

It is a fully managed ETL service. A Spark ETL Engine. Consists fo a Central Metadata Repository - Glue Data Catalog. Flexible Scheduler.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why use AWS Glue?

A

AWS Glue offers a fully managed serverless ETL tool. This removes the overhead, and barriers to entry, when there is a requirement for a ETL service in AWS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is Glue Data Catalog?

A

It is a persistent Metadata Store.

It is a managed service that lets you store, annotate, and share metadata which can be used to query and transform data.

One AWS Glue Data Catalog per AWS region.

Identity and Access Management (IAM) policies control access.

Can be used for data governance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the kinds of Glue Data Catalog store?

A

Data Location
Schema
Data Types
Data Classification

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are AWS Glue Databases?

A

A set of associated Data Catalog table definitions organized into a logical group.

An S3 folder can be a AWS Glue Database for example (like a folder called raw_data).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is AWS Glue Table?

A

The metadata definition that represents your data. It saves things like:
- schema (column name, data type, …)

Tables are part of AWS Glue Databases that is part of Glue Data Catalog

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is AWS Glue Crawler?

A

A program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the AWS Glue Data Catalog.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are Partitions?

A

Folders where data is stored on S3, which are physical entities, are mapped to partitions, which are logical entities i.e. Columns in the Glue table.

S3:/ / sales/year=2019/month=Jan/day=1
S3:/ / sales/year=2019/ month=Jan/day=2
S3:/ /sales/year=2019/month=Feb/day=1
S3:/ / sales/year= 2019/month=Feb/day=2

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are AWS Glue Connections?

A

A Data Catalog object that contains the properties that are required to connect to a particular data store.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is AWS Glue ETL?

A

AWS Glue ETL supports extracting data from various sources, transforming it to meet your business needs, and loading it into a destination of your choice.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is AWS Glue ETL engine?

A

Apache Spark engine to distribute big data workloads across worker nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is AWS Glue DPUs?

A

1 DPU is equivalent to 4 vCPUs and 16 GB memory.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is AWS Glue bookmarks?

A

Tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is AWS Glue Data Quality?

A

Monitor the quality of your data by Data Quality Definition Language (DQDL) using DeeQu

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is AWS Glue Triggers?

A

Initiates an ETL job. Triggers can be defined based on a scheduled time or an event.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is AWS Glue Data Brew?

A

It is a visual data preparation tool that makes it easier for data analysts and data scientists to clean and normalize data.

17
Q
A