Glue Flashcards by Amar Singhal

what is Glue crawler?

An AWS Glue crawler is a program that connects to a data source, extracts metadata (like the structure or schema of the data), and then writes this metadata to the AWS Glue Data Catalog. Essentially, it’s a tool that automates the process of gathering metadata from various data sources.

Crawlers can access data stored in various formats in locations like Amazon S3, Amazon RDS, Amazon DynamoDB, and other JDBC-accessible databases. They scan the data in these sources and infer a schema that can be used for querying and processing the data.

Furthermore, crawlers can be scheduled to run periodically. This means they can automatically keep the AWS Glue Data Catalog up-to-date if new data is added to the data stores, or if the schema of the data changes.

So, in short, Glue crawlers help in automating the process of discovering and categorizing data, which makes it easier to analyze and process the data.

How well did you know this?

Not at all

Perfectly

What Glue is built on?

It’s based on Apache Spark

How well did you know this?

Not at all

Perfectly

Can Glue intergrate with Hive?

Yes

How well did you know this?

Not at all

Perfectly

Can Hive metastore be imported into Glue Data Catalog?

Yes

How well did you know this?

Not at all

Perfectly

Which languages can be used in Glue for coding ETL processes?

Scala and Python

How well did you know this?

Not at all

Perfectly

Can glue ETL be event driven?

Yes it can be

How well did you know this?

Not at all

Perfectly

How to improve the performance of the underlying spark jobs in glue?

You can increase DPU (data processing units) To improve the performance

How well did you know this?

Not at all

Perfectly

How can you identify the performance issue in glue jobs?

Use cloudwatch metrics

How well did you know this?

Not at all

Perfectly

Can glue UI generated code be modified?

How well did you know this?

Not at all

Perfectly

What is ResolveChoice?

ResolveChoice is a method available in AWS Glue’s DynamicFrame. It’s used to handle scenarios where a single column in the dataset could contain different types of data. This situation is commonly called a ‘choice type’. ResolveChoice provides options to resolve these choice types in a number of ways:
* MakeCols: This option creates a new column for each possible type in the original column.
* Cast: Casts every record in the column to a particular data type. If a record can’t be casted, it’s replaced with a null value.
* Project: Drops the column from the DynamicFrame.
* Spec: A custom specification to resolve the choice type.

How well did you know this?

Not at all

Perfectly

Can Data Catalog be modified?

The result of an ETL (Extract, Transform, Load) job in AWS Glue sometimes involves modifications to the data catalog itself, such as adding fields or tables. If changes to the schema or partitioning need to be made after a crawler has defined the schema, there are two primary ways to accomplish this:

Rerun the crawler.
Use an ETL script to explicitly change the partitions and update the table schema. This can be done using the ‘enable update catalog’, ‘partition keys options’, and ‘update behavior’ options. The ETL script can also create new tables by using the ‘set catalog info’ option.

However, modifying the data catalog from ETL has its limitations. It can only be done when Amazon S3 is the underlying data store and is limited to Json, CSV, AVRO, and Parquet formats. If using Parquet, special code is required. Additionally, nested schemas are not supported.

In summary, while AWS Glue ETL allows modifications to the data catalog, there are certain conditions and restrictions to be aware of. However, in certain situations, these modifications can be done directly from the script, eliminating the need to rerun the crawler.

How well did you know this?

Not at all

Perfectly

what is glue development endpoint?

An AWS Glue development endpoint is an interactive environment for developing and testing your ETL scripts. It provisions an Apache Spark setup where you can iteratively run and debug your code. This environment can be connected with a notebook or an IDE, and the developed scripts can be converted into ETL jobs for AWS Glue to run. It’s useful for complex ETL tasks that require debugging and validation.

How well did you know this?

Not at all

Perfectly

Glue Cost

AWS Glue charges are based on the time taken by Glue crawlers to extract schemas and the time taken by ETL jobs to process and transform data. The first million objects stored in the Glue Data Catalog are free, but large-scale data applications may exceed this allowance quickly. If you use a development endpoint to edit ETL scripts using a notebook, you’ll be billed per minute. It’s important to shut down these endpoints when not in use to avoid unnecessary charges.

How well did you know this?

Not at all

Perfectly

Glue Anti-pattern

AWS Glue is designed to be flexible and general-purpose, utilizing Apache Spark for ETL operations. If you need to use other ETL engines, such as Hive or Pig, it’s recommended to use Data Pipeline or EMR for data processing. While Spark is versatile enough for most needs, the use of Hive, Pig, or other tools might be necessary for legacy code or system integration.

How well did you know this?

Not at all

Perfectly

What is Glue Data Quality?

AWS Glue Data Quality is a serverless, cost-effective, petabyte-scale data quality tool that helps you identify, measure, and monitor the quality of your data. It is built on top of Deequ, an open-source framework for data quality, and it provides a variety of features to help you assess the quality of your data, including:

Data profiling: AWS Glue Data Quality automatically computes statistics for your data, such as the number of rows, the number of unique values, and the distribution of values. This information can help you identify potential problems with your data, such as missing values, duplicate values, and outliers.
Data quality rules: AWS Glue Data Quality provides a set of pre-built data quality rules that you can use to assess the quality of your data. These rules check for common data quality problems, such as missing values, duplicate values, and out-of-range values. You can also create your own data quality rules.

Data quality monitoring: AWS Glue Data Quality can monitor your data quality over time and alert you when the quality of your data deteriorates. This can help you identify and fix data quality problems before they impact your business.

How well did you know this?

Not at all

Perfectly

What is Glue data brew?

AWS Glue DataBrew is a visual, code-free data preparation service. It allows users to clean, normalize, and transform data with over 250 pre-built transformations. Its main features include a user-friendly interface, reusable transformation recipes, automated scheduling of data preparation jobs, integration with other AWS services, and data profiling capabilities for improving data quality. It’s especially useful for preparing data for further analysis or machine learning tasks.

What is the difference between glue data brew and data quality?

DataBrew and Data Quality are both AWS services that help you improve the quality of your data. However, they have different strengths and weaknesses.

DataBrew is a visual data preparation tool that makes it easy to clean, transform, and visualize your data. It is a good choice for users who want to quickly and easily improve the quality of their data without having to write any code.

Data Quality is a serverless data quality tool that helps you identify, measure, and monitor the quality of your data. It is a good choice for users who need to automate data quality checks and track the quality of their data over time.

What is AWS Lake Formation?

AWS Lake Formation is a fully managed service that simplifies the process of building, securing, and managing data lakes. It automates many of the complex manual steps usually required to create a data lake, including collecting, cleaning, and cataloging data and securely making that data available for analytics.

What are the benefits of AWS Lake Formation?

AWS Lake Formation benefits include simplifying the data lake building process, improving data security and compliance, and increasing data accessibility and analysis.

How does AWS Lake Formation work?

AWS Lake Formation works by automating the manual and time-consuming steps in setting up a data lake. It orchestrates data workflows, catalogs data from various sources, transforms the data into a ready-to-analyze format, and applies security policies across the data lake.

What is the relationship between AWS Glue and AWS Lake Formation?

Lake Formation leverages a shared infrastructure with AWS Glue, including console controls, ETL code creation and job monitoring, a common data catalog, and a serverless architecture. While AWS Glue is focused on ETL functions, Lake Formation encompasses AWS Glue features and provides additional capabilities designed to build, secure, and manage a data lake.

What is the solution for cross-account permissions issue in AWS Lake Formation?

The solution for cross-account permissions in AWS Lake Formation is to set the recipient as a data lake administrator.

What is the role of AWS Resource Access Manager (RAM) in AWS Lake Formation?

AWS Resource Access Manager (RAM) is a useful tool in AWS Lake Formation for managing accounts external to your organization and controlling what resources they can access within your account.

Does AWS Lake Formation support manifests inquiries from Athena or Redshift?

No, AWS Lake Formation does not support manifests inquiries from Athena or Redshift. Using a manifest in a query can cause errors.

What permissions are needed for encrypted data catalogs in AWS Lake Formation?

IAM permissions on the KMS encryption keys are needed for encrypted data catalogs in AWS Lake Formation.

What permissions are needed to create blueprints and workflows within AWS Lake Formation?

IAM permissions are needed to create blueprints and workflows within AWS Lake Formation.

Granular access control in aWS Lake Formation?

One of the most powerful features of AWS Lake Formation is its granular access control. It offers a fine-grained, role-based access control model that allows data administrators to precisely define who has access to what data.

What are governed tables in Lake Formation?

Governed tables in Lake Formation are a new type of table that provides a number of benefits over traditional tables, including: 1. Data governance: Governed tables are governed by a set of policies that define who can access the data, what they can do with it, and how it can be stored. This helps to ensure that your data is secure and compliant. 1. Data quality: Governed tables are automatically checked for quality issues, such as missing values, duplicate values, and out-of-range values. This helps to ensure that your data is accurate and reliable. 1. Data performance: Governed tables are optimized for performance, so they can be queried quickly and easily. This helps you to get the most out of your data.

Data permissions in Lake Formation

Data permissions in Lake Formation are used to control who can access data in a Lake Formation Data Catalog. You can use data permissions to grant or revoke access to specific tables, databases, or the entire Data Catalog. There are two main types of data permissions in Lake Formation: * Metadata permissions: Metadata permissions control who can access metadata in the Data Catalog, such as table names, schema, and data location. * Data access permissions: Data access permissions control who can access data in underlying Amazon S3 locations. To grant or revoke data permissions, you can use the Lake Formation console, the AWS CLI, or the AWS API. Here are some examples of data permissions: * You can grant a user permission to access a specific table by specifying the table name and the user's IAM role. * You can grant a user permission to access all tables in a database by specifying the database name and the user's IAM role. * You can grant a user permission to access the entire Data Catalog by specifying the Data Catalog name and the user's IAM role. Data permissions are a powerful tool that can help you to secure your data. By carefully controlling who has access to your data, you can help to prevent unauthorized access and data breaches.