Analytics Flashcards
What is AWS Glue?
A serverless ETL service for discovering, preparing, and combining data for analytics, machine learning, and application development.
What are the key components of AWS Glue?
Data Catalog, Crawler, ETL Engine, Studio, Data Quality, DataBrew, Workflows
What is the purpose of the Glue Data Catalog?
Stores table definitions and schemas for data located in various sources like S3, RDS, and DynamoDB.
What does a Glue Crawler do?
Scans data stores to infer schemas and create table definitions in the Data Catalog.
What programming languages are supported for Glue ETL jobs?
Scala and Python
What is a DynamicFrame in AWS Glue?
A collection of DynamicRecords with a schema, similar to a Spark DataFrame but with additional ETL capabilities.
What are some common transformations in Glue ETL?
DropFields, DropNullFields, Filter, Join, Map, FindMatches ML, format conversions
How can you modify the Data Catalog using Glue ETL scripts?
By using options like enableUpdateCatalog
, partitionKeys
, and updateBehavior
to add partitions, update schemas, or create new tables.
What are AWS Glue Development Endpoints?
Notebook-based environments for developing and testing Glue ETL scripts.
What are Glue Job Bookmarks used for?
Persisting state from a job run to prevent reprocessing of old data and ensure incremental processing.
How are AWS Glue costs calculated?
Primarily billed by the second for crawler and ETL jobs, with additional charges for Data Catalog storage beyond the free tier and for development endpoints.
What is AWS Glue Studio?
A visual interface for creating and managing ETL workflows without writing code.
What is the purpose of AWS Glue Data Quality?
To define and enforce data quality rules within Glue jobs using DQDL.
What is AWS Glue DataBrew?
A visual data preparation tool with over 250 pre-built transformations for cleaning and normalizing data.
What is AWS Lake Formation?
A service that simplifies the setup and management of secure data lakes.
What are Governed Tables in AWS Lake Formation?
A new type of S3 table that supports ACID transactions and fine-grained access control.
What is Amazon Athena?
A serverless interactive query service that allows you to analyze data in S3 using standard SQL.
What data formats are supported by Athena?
CSV, TSV, JSON, ORC, Parquet, Avro, and various compression formats.
How is Amazon Athena priced?
Pay-as-you-go, based on the amount of data scanned by queries.
What are Athena Workgroups?
A way to organize users, teams, and workloads in Athena, allowing for cost tracking and query access control.
How can you optimize Athena query performance?
Use columnar data formats, partition data effectively, and minimize the number of small files.
What is the CREATE TABLE AS SELECT
(CTAS) statement used for in Athena?
Creating a new table from the results of a query, including data format conversion.
What is Apache Spark?
A distributed processing framework for big data processing, known for its in-memory caching and optimized query execution.
What are the core components of Apache Spark?
Spark Core, Spark SQL, Spark Streaming, MLLib, GraphX
What is Spark Structured Streaming?
A scalable and fault-tolerant stream processing engine built on the Spark SQL engine.
What is Amazon EMR?
A managed Hadoop framework on EC2 that allows you to run big data frameworks like Spark, Hive, Presto, and Flink.
What are the different node types in an EMR cluster?
Master node, Core node, Task node
What is EMRFS?
A file system that allows you to access data in S3 as if it were HDFS.
What is EMR Managed Scaling?
A feature that automatically scales EMR clusters based on workload demands.
What is EMR Serverless?
A way to run Spark, Hive, and Presto applications on EMR without managing clusters.
What is Amazon Kinesis Data Streams?
A service for real-time data streaming with high throughput and low latency.
What are the two capacity modes for Kinesis Data Streams?
Provisioned mode and On-demand mode
What is the Kinesis Producer Library (KPL)?
A library for building high-performance producers that send data to Kinesis Data Streams.
What is Kinesis Enhanced Fan-Out?
A feature that allows multiple consumers to each get dedicated throughput from a Kinesis stream.
What is AWS Kinesis Data Firehose?
A fully managed service for loading streaming data into data stores like S3, Redshift, and OpenSearch.
What is Amazon Kinesis Data Analytics?
A service for analyzing streaming data in real-time using SQL or Apache Flink.
What are Reference Tables in Kinesis Data Analytics?
A way to join streaming data with static data stored in S3 for enrichment and lookups.
What is the Managed Service for Apache Flink?
A serverless service that allows you to run Python and Scala Flink applications for stream processing.
What is the RANDOM_CUT_FOREST
function used for in Kinesis Data Analytics?
Anomaly detection on numeric data within a streaming data set.