Glue Flashcards
what is Glue crawler?
An AWS Glue crawler is a program that connects to a data source, extracts metadata (like the structure or schema of the data), and then writes this metadata to the AWS Glue Data Catalog. Essentially, it’s a tool that automates the process of gathering metadata from various data sources.
Crawlers can access data stored in various formats in locations like Amazon S3, Amazon RDS, Amazon DynamoDB, and other JDBC-accessible databases. They scan the data in these sources and infer a schema that can be used for querying and processing the data.
Furthermore, crawlers can be scheduled to run periodically. This means they can automatically keep the AWS Glue Data Catalog up-to-date if new data is added to the data stores, or if the schema of the data changes.
So, in short, Glue crawlers help in automating the process of discovering and categorizing data, which makes it easier to analyze and process the data.
What Glue is built on?
It’s based on Apache Spark
Can Glue intergrate with Hive?
Yes
Can Hive metastore be imported into Glue Data Catalog?
Yes
Which languages can be used in Glue for coding ETL processes?
Scala and Python
Can glue ETL be event driven?
Yes it can be
How to improve the performance of the underlying spark jobs in glue?
You can increase DPU (data processing units) To improve the performance
How can you identify the performance issue in glue jobs?
Use cloudwatch metrics
Can glue UI generated code be modified?
What is ResolveChoice?
ResolveChoice is a method available in AWS Glue’s DynamicFrame. It’s used to handle scenarios where a single column in the dataset could contain different types of data. This situation is commonly called a ‘choice type’. ResolveChoice provides options to resolve these choice types in a number of ways:
* MakeCols: This option creates a new column for each possible type in the original column.
* Cast: Casts every record in the column to a particular data type. If a record can’t be casted, it’s replaced with a null value.
* Project: Drops the column from the DynamicFrame.
* Spec: A custom specification to resolve the choice type.
Can Data Catalog be modified?
The result of an ETL (Extract, Transform, Load) job in AWS Glue sometimes involves modifications to the data catalog itself, such as adding fields or tables. If changes to the schema or partitioning need to be made after a crawler has defined the schema, there are two primary ways to accomplish this:
- Rerun the crawler.
- Use an ETL script to explicitly change the partitions and update the table schema. This can be done using the ‘enable update catalog’, ‘partition keys options’, and ‘update behavior’ options. The ETL script can also create new tables by using the ‘set catalog info’ option.
However, modifying the data catalog from ETL has its limitations. It can only be done when Amazon S3 is the underlying data store and is limited to Json, CSV, AVRO, and Parquet formats. If using Parquet, special code is required. Additionally, nested schemas are not supported.
In summary, while AWS Glue ETL allows modifications to the data catalog, there are certain conditions and restrictions to be aware of. However, in certain situations, these modifications can be done directly from the script, eliminating the need to rerun the crawler.
what is glue development endpoint?
An AWS Glue development endpoint is an interactive environment for developing and testing your ETL scripts. It provisions an Apache Spark setup where you can iteratively run and debug your code. This environment can be connected with a notebook or an IDE, and the developed scripts can be converted into ETL jobs for AWS Glue to run. It’s useful for complex ETL tasks that require debugging and validation.
Glue Cost
AWS Glue charges are based on the time taken by Glue crawlers to extract schemas and the time taken by ETL jobs to process and transform data. The first million objects stored in the Glue Data Catalog are free, but large-scale data applications may exceed this allowance quickly. If you use a development endpoint to edit ETL scripts using a notebook, you’ll be billed per minute. It’s important to shut down these endpoints when not in use to avoid unnecessary charges.
Glue Anti-pattern
AWS Glue is designed to be flexible and general-purpose, utilizing Apache Spark for ETL operations. If you need to use other ETL engines, such as Hive or Pig, it’s recommended to use Data Pipeline or EMR for data processing. While Spark is versatile enough for most needs, the use of Hive, Pig, or other tools might be necessary for legacy code or system integration.
What is Glue Data Quality?
AWS Glue Data Quality is a serverless, cost-effective, petabyte-scale data quality tool that helps you identify, measure, and monitor the quality of your data. It is built on top of Deequ, an open-source framework for data quality, and it provides a variety of features to help you assess the quality of your data, including:
Data profiling: AWS Glue Data Quality automatically computes statistics for your data, such as the number of rows, the number of unique values, and the distribution of values. This information can help you identify potential problems with your data, such as missing values, duplicate values, and outliers.
Data quality rules: AWS Glue Data Quality provides a set of pre-built data quality rules that you can use to assess the quality of your data. These rules check for common data quality problems, such as missing values, duplicate values, and out-of-range values. You can also create your own data quality rules.
Data quality monitoring: AWS Glue Data Quality can monitor your data quality over time and alert you when the quality of your data deteriorates. This can help you identify and fix data quality problems before they impact your business.