Analytics | AWS Glue Flashcards
What is AWS Glue?
General
AWS Glue | Analytics
AWS Glue is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the time-consuming steps of data preparation for analytics. AWS Glue automatically discovers and profiles your data via the Glue Data Catalog, recommends and generates ETL code to transform your source data into target schemas, and runs the ETL jobs on a fully managed, scale-out Apache Spark environment to load your data into its destination. It also allows you to setup, orchestrate, and monitor complex data flows.
How do I get started with AWS Glue?
General
AWS Glue | Analytics
To start using AWS Glue, simply sign into the AWS Management Console and navigate to “Glue” under the “Analytics” category. You can follow one of our guided tutorials that will walk you through an example use case for AWS Glue. You can also find sample ETL code in our GitHub repository under AWS Labs.
What are the main components of AWS Glue?
General
AWS Glue | Analytics
AWS Glue consists of a Data Catalog which is a central metadata repository, an ETL engine that can automatically generate Scala or Python code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries. Together, these automate much of the undifferentiated heavy lifting involved with discovering, categorizing, cleaning, enriching, and moving data, so you can spend more time analyzing your data.
When should I use AWS Glue?
General
AWS Glue | Analytics
You should use AWS Glue to discover properties of the data you own, transform it, and prepare it for analytics. Glue can automatically discover both structured and semi-structured data stored in your data lake on Amazon S3, data warehouse in Amazon Redshift, and various databases running on AWS. It provides a unified view of your data via the Glue Data Catalog that is available for ETL, querying and reporting using services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Glue automatically generates Scala or Python code for your ETL jobs that you can further customize using tools you are already familiar with. AWS Glue is serverless, so there are no compute resources to configure and manage.
What data sources does AWS Glue support?
AWS Glue Data Catalog
AWS Glue | Analytics
AWS Glue natively supports data stored in Amazon Aurora, Amazon RDS for MySQL, Amazon RDS for Oracle, Amazon RDS for PostgreSQL, Amazon RDS for SQL Server, Amazon Redshift, and Amazon S3, as well as MySQL, Oracle, Microsoft SQL Server, and PostgreSQL databases in your Virtual Private Cloud (Amazon VPC) running on Amazon EC2. The metadata stored in the AWS Glue Data Catalog can be readily accessed from Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. You can also write custom Scala or Python code and import custom libraries and Jar files into your Glue ETL jobs to access data sources not natively supported by AWS Glue. For more details on importing custom libraries, refer to our documentation.
What is the AWS Glue Data Catalog?
AWS Glue Data Catalog
AWS Glue | Analytics
The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given data set, you can store its table definition, physical location, add business relevant attributes, as well as track how this data has changed over time.
The AWS Glue Data Catalog is Apache Hive Metastore compatible and is a drop-in replacement for the Apache Hive Metastore for Big Data applications running on Amazon EMR. For more information on setting up your EMR cluster to use AWS Glue Data Catalog as an Apache Hive Metastore, click here.
The AWS Glue Data Catalog also provides out-of-box integration with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. Once you add your table definitions to the Glue Data Catalog, they are available for ETL and also readily available for querying in Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.
How do I get my metadata into the AWS Glue Data Catalog?
AWS Glue Data Catalog
AWS Glue | Analytics
AWS Glue provides a number of ways to populate metadata into the AWS Glue Data Catalog. Glue crawlers scan various data stores you own to automatically infer schemas and partition structure and populate the Glue Data Catalog with corresponding table definitions and statistics. You can also schedule crawlers to run periodically so that your metadata is always up-to-date and in-sync with the underlying data. Alternately, you can add and update table details manually by using the AWS Glue Console or by calling the API. You can also run Hive DDL statements via the Amazon Athena Console or a Hive client on an Amazon EMR cluster. Finally, if you already have a persistent Apache Hive Metastore, you can perform a bulk import of that metadata into the AWS Glue Data Catalog by using our import script.
What are AWS Glue crawlers?
AWS Glue Data Catalog
AWS Glue | Analytics
An AWS Glue crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can run periodically to detect the availability of new data as well as changes to existing data, including table definition changes. Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions. You can customize Glue crawlers to classify your own file types.
How do I import data from my existing Apache Hive Metastore to the AWS Glue Data Catalog?
AWS Glue Data Catalog
AWS Glue | Analytics
You simply run an ETL job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog.
Do I need to maintain my Apache Hive Metastore if I am storing my metadata in the AWS Glue Data Catalog?
AWS Glue Data Catalog
AWS Glue | Analytics
No. AWS Glue Data Catalog is Apache Hive Metastore compatible. You can point to the Glue Data Catalog endpoint and use it as an Apache Hive Metastore replacement. For more information on how to configure your cluster to use AWS Glue Data Catalog as an Apache Hive Metastore, please read our documentation here.
If I am already using Amazon Athena or Amazon Redshift Spectrum and have tables in Amazon Athena’s internal data catalog, how can I start using the AWS Glue Data Catalog as my common metadata repository?
Extract, transform, and load (ETL)
AWS Glue | Analytics
Before you can start using AWS Glue Data Catalog as a common metadata repository between Amazon Athena, Amazon Redshift Spectrum, and AWS Glue, you must upgrade your Amazon Athena data catalog to AWS Glue Data Catalog. The steps required for the upgrade are detailed here.
What programming language can I use to write my ETL code for AWS Glue?
Extract, transform, and load (ETL)
AWS Glue | Analytics
You can use either Scala or Python.
How can I customize the ETL code generated by AWS Glue?
Extract, transform, and load (ETL)
AWS Glue | Analytics
AWS Glue’s ETL script recommendation system generates Scala or Python code. It leverages Glue’s custom ETL library to simplify access to data sources as well as manage job execution. You can find more details about the library in our documentation. You can write ETL code using AWS Glue’s custom library or write arbitrary code in Scala or Python by using inline editing via the AWS Glue Console script editor, downloading the auto-generated code, and editing it in your own IDE. You can also start with one of the many samples hosted in our Github repository and customize that code.
Can I import custom libraries as part of my ETL script?
Extract, transform, and load (ETL)
AWS Glue | Analytics
Yes. You can import custom Python libraries and Jar files into your AWS Glue ETL job. For more details, please check our documentation here.
Can I bring my own code?
Extract, transform, and load (ETL)
AWS Glue | Analytics
Yes. You can write your own code using AWS Glue’s ETL library, or write your own Scala or Python code and upload it to a Glue ETL job. For more details, please check our documentation here.