Analytics Flashcards

1
Q

What is AWS Glue?

A

A serverless ETL service for discovering, preparing, and combining data for analytics, machine learning, and application development.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the key components of AWS Glue?

A

Data Catalog, Crawler, ETL Engine, Studio, Data Quality, DataBrew, Workflows

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is the purpose of the Glue Data Catalog?

A

Stores table definitions and schemas for data located in various sources like S3, RDS, and DynamoDB.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What does a Glue Crawler do?

A

Scans data stores to infer schemas and create table definitions in the Data Catalog.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What programming languages are supported for Glue ETL jobs?

A

Scala and Python

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a DynamicFrame in AWS Glue?

A

A collection of DynamicRecords with a schema, similar to a Spark DataFrame but with additional ETL capabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What are some common transformations in Glue ETL?

A

DropFields, DropNullFields, Filter, Join, Map, FindMatches ML, format conversions

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How can you modify the Data Catalog using Glue ETL scripts?

A

By using options like enableUpdateCatalog, partitionKeys, and updateBehavior to add partitions, update schemas, or create new tables.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are AWS Glue Development Endpoints?

A

Notebook-based environments for developing and testing Glue ETL scripts.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are Glue Job Bookmarks used for?

A

Persisting state from a job run to prevent reprocessing of old data and ensure incremental processing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How are AWS Glue costs calculated?

A

Primarily billed by the second for crawler and ETL jobs, with additional charges for Data Catalog storage beyond the free tier and for development endpoints.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is AWS Glue Studio?

A

A visual interface for creating and managing ETL workflows without writing code.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the purpose of AWS Glue Data Quality?

A

To define and enforce data quality rules within Glue jobs using DQDL.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is AWS Glue DataBrew?

A

A visual data preparation tool with over 250 pre-built transformations for cleaning and normalizing data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is AWS Lake Formation?

A

A service that simplifies the setup and management of secure data lakes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What are Governed Tables in AWS Lake Formation?

A

A new type of S3 table that supports ACID transactions and fine-grained access control.

17
Q

What is Amazon Athena?

A

A serverless interactive query service that allows you to analyze data in S3 using standard SQL.

18
Q

What data formats are supported by Athena?

A

CSV, TSV, JSON, ORC, Parquet, Avro, and various compression formats.

19
Q

How is Amazon Athena priced?

A

Pay-as-you-go, based on the amount of data scanned by queries.

20
Q

What are Athena Workgroups?

A

A way to organize users, teams, and workloads in Athena, allowing for cost tracking and query access control.

21
Q

How can you optimize Athena query performance?

A

Use columnar data formats, partition data effectively, and minimize the number of small files.

22
Q

What is the CREATE TABLE AS SELECT (CTAS) statement used for in Athena?

A

Creating a new table from the results of a query, including data format conversion.

23
Q

What is Apache Spark?

A

A distributed processing framework for big data processing, known for its in-memory caching and optimized query execution.

24
Q

What are the core components of Apache Spark?

A

Spark Core, Spark SQL, Spark Streaming, MLLib, GraphX

25
Q

What is Spark Structured Streaming?

A

A scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

26
Q

What is Amazon EMR?

A

A managed Hadoop framework on EC2 that allows you to run big data frameworks like Spark, Hive, Presto, and Flink.

27
Q

What are the different node types in an EMR cluster?

A

Master node, Core node, Task node

28
Q

What is EMRFS?

A

A file system that allows you to access data in S3 as if it were HDFS.

29
Q

What is EMR Managed Scaling?

A

A feature that automatically scales EMR clusters based on workload demands.

30
Q

What is EMR Serverless?

A

A way to run Spark, Hive, and Presto applications on EMR without managing clusters.

31
Q

What is Amazon Kinesis Data Streams?

A

A service for real-time data streaming with high throughput and low latency.

32
Q

What are the two capacity modes for Kinesis Data Streams?

A

Provisioned mode and On-demand mode

33
Q

What is the Kinesis Producer Library (KPL)?

A

A library for building high-performance producers that send data to Kinesis Data Streams.

34
Q

What is Kinesis Enhanced Fan-Out?

A

A feature that allows multiple consumers to each get dedicated throughput from a Kinesis stream.

35
Q

What is AWS Kinesis Data Firehose?

A

A fully managed service for loading streaming data into data stores like S3, Redshift, and OpenSearch.

36
Q

What is Amazon Kinesis Data Analytics?

A

A service for analyzing streaming data in real-time using SQL or Apache Flink.

37
Q

What are Reference Tables in Kinesis Data Analytics?

A

A way to join streaming data with static data stored in S3 for enrichment and lookups.

38
Q

What is the Managed Service for Apache Flink?

A

A serverless service that allows you to run Python and Scala Flink applications for stream processing.

39
Q

What is the RANDOM_CUT_FOREST function used for in Kinesis Data Analytics?

A

Anomaly detection on numeric data within a streaming data set.