Athena Flashcards by Sindhusha Boyapati

What is Amazon Athena?

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL

How well did you know this?

Not at all

Perfectly

Athena is _________(on-server/serverless)

serverless

How well did you know this?

Not at all

Perfectly

Athena infrastructure

Athena has no infrastructure to set up or manage, and you pay only for the queries you run.

How well did you know this?

Not at all

Perfectly

Athena scaling

Athena scales automatically—running queries in parallel—so results are fast, even with large datasets and complex queries.

How well did you know this?

Not at all

Perfectly

Data formats that can be analyzed using Athena

Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3.

How well did you know this?

Not at all

Perfectly

Do you have to load the data into Athena to analyze the data stored in S3?

You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena.

How well did you know this?

Not at all

Perfectly

Athena integrates with _______ for easy data visualization.

Amazon QuickSight

How well did you know this?

Not at all

Perfectly

Athena integrates with the AWS Glue Data Catalog, which offers______________

a persistent metadata store for your data in Amazon S3.

How well did you know this?

Not at all

Perfectly

What does Athena integration Glue data store allow ______________

It allows you to create tables and query data in Athena based on a central metadata store available throughout your Amazon Web Services account and integrated with the ETL and data discovery features of AWS Glue.

How well did you know this?

Not at all

Perfectly

Athena will use a default library called _________ to do the actual work of parsing the data.

LazySimpleSerDe

How well did you know this?

Not at all

Perfectly

To use a regex in your CREATE TABLE statement, use syntax like the following.

ROW FORMAT SERDE org.apache.hadoop.hive.serde2.RegexSerDe'
  WITH SERDEPROPERTIES ("input.regex" = "regular_expression")

How well did you know this?

Not at all

Perfectly

The tables and databases that you work with in Athena to run queries are based on __________.

metadata

How well did you know this?

Not at all

Perfectly

What is metadata?

Metadata is data about the underlying data in your dataset.

How well did you know this?

Not at all

Perfectly

How that metadata describes your dataset is called the ___________.

schema

How well did you know this?

Not at all

Perfectly

In Athena, we call a system for organizing metadata a ________ or _________

data catalog or a metastore.

How well did you know this?

Not at all

Perfectly

The combination of a dataset and the data catalog that describes it is called a ____________

data source.

How well did you know this?

Not at all

Perfectly

The relationship of metadata to an underlying dataset depends on the type of _______ that you work with.

data source

How well did you know this?

Not at all

Perfectly

Types of data sources

Relation data sources- like MySQL, PostgreSQL, and SQL tightly integrate the metadata with the dataset.
Other data sources, like those built using Hive - allow you to define metadata on-the-fly when you read the dataset.

How well did you know this?

Not at all

Perfectly

Athena uses the _______ to store and retrieve table metadata for the Amazon S3 data in your Amazon Web Services account.

AWS Glue Data Catalog

How well did you know this?

Not at all

Perfectly

How does table metadata helps Athena Query Engine?

. The table metadata lets the Athena query engine know how to find, read, and process the data that you want to query.

How well did you know this?

Not at all

Perfectly

What is AWS Glue?

AWS Glue is a fully managed ETL srevice.

How well did you know this?

Not at all

Perfectly

What are AWS Glue crawlers?

AWS Glue crawlers automatically infer database and table schema from your data in Amazon S3 and store the associated metadata in the AWS Glue Data Catalog.

How well did you know this?

Not at all

Perfectly

How to create database and table schema in Glue data catalog?

To create table and database schema in Glue data catalog:

You can run AWS Glue crawlers on your data source from within Athena
You can run DDL queries directly on your Athena Query Editor.

How well did you know this?

Not at all

Perfectly

Under the hood, Athena uses _________ to process DML statements and _________ to process the DDL statements that create and modify schema.

Presto; Hive

How well did you know this?

Not at all

Perfectly

When you create schema in AWS Glue to query in Athena, you can use the AWS Glue Catalog Manager to ________, but at this time______and _______ cannot be changed using the AWS Glue console.

rename columns; table names and database names

How to rename a database or table?

To rename databases or tables, you need to create a new database/table and copy tables/data to it

Athena does not recognize __________ that you specify for an AWS Glue crawler.

exclude patterns

If Athena detects that the schema of a partition differs from the schema of the table, Athena may not be able to process the query and fails with ______________

HIVE_PARTITION_SCHEMA_MISMATCH.

You can use the___________ for external Hive metastore to query data sets in Amazon S3 that use an _____________

Amazon Athena data connector; Apache Hive metastore.

. The connection from Lambda to your Hive metastore is secured by a ________ and does not use the ___________

private Amazon VPC channel; public internet.

Can you use AWS Glue Data catalog and external Hive metastores in the same Athena Query?

Yes

How to use the syntax - database.table instead of catalog.database.table

Specify a catalog in the query execution context as the current default catalog.

How does Athena interacts with Hive metastore?

1. A Lambda function is created connecting the Athena and Hive metastore which is inside a VPC 2. Register a unique catalog name for your hive metastore and a corresponding function name in your account 3. When you run Athena DML or DDL query that uses the catalog name, Athena query engine calls the Lambda function that you associated with the catalog name. 4. Using AWS PrivateLink Lambda function communicates with Hive metastore in your VPC and receives responses for metadata requests.

When using Athena Data connector with external hive metastore - The maximum number of registered catalogs that you can have is _________

1000

Hive views and Athena views.

Hive views are not compatible with Athena views and are not supported.

Kerberos authentication for __________ is not supported.

Kerberos Authentication.

What is spill location

Because of the limit on Lambda function limit sizes, responses larger than the threshold spill into Amazon S3 location that you specify when you create a lambda function.

If you have data in sources other than Amazon S3, you can use __________ to query the data in place

Athena Federated Query

Where else can Athena Federated Query be used?

It can be used to build pipelines that extract data from multiple data sources and store them in Amazon S3.

With Athena Federated Query, you can run SQL queries across data stored in _______________

relational (SQL), non-relational (NoSQL), object (S3), and custom data sources.

How does Athena run Federated Queries

Athena uses data source connectors that run on AWS Lambda to run federated queries.

What is a data source connector?

data source connector is a piece of code that can translate between your target data source and Athena.

Data source connector can be deemed as an extension of _______________

Athena Query Engine

List of Prebuilt Athena data source connectors

1. Amazon CloudWatch Logs 2. Amazon DynamoDB 3. Amazon DocumentDB 4. Amazon RDS, and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license.

Using ____________ you can write custom connectors.

Athena Query Federation SDK

o choose, configure, and deploy a data source connector to your account, you can use ____________

1. Athena or Lambda consoles | 2. AWS Serverless Application Repository

Can you include multiple catalogs from multiple data sources in a single query?

Yes.

How can you include multiple catalogs from multiple data sources in the same query?

Using Athena Federated Queries.

How is Athena/Athena Federated Query executed once a query is submitted against a data source?

1. Athena invokes corresponding connector to identify parts of the table that needs to be read, manages parallelism and pushes down filter predicates.

Connectors use _____________ as the format for returning data requested in a query, which enables connectors to be implemented in languages such as __________

Apache Arrow; | C, C++, Java, Python, and Rust

connectors are processed in __________

Lambda

Athena Federated Query is supported only on ____________

Athena engine version 2.

How to use views using Federated Data sources?

You cannot use Views using Federated data sources.

To control access to data catalogs, use _______________

resource-level IAM permissions or identity-based IAM policies.

Athena uses an approach known as _________ for schema reading

schema-on-read, which means a schema is projected on to your data at the time you run a query.

______ does not modify your data in Amazon S3.

Athena

How can Athena query previous version of a object present in S3 bucket?

Athena cannot query previous versions. It can only query current versions.

Athena supports querying objects that are stored with __________ in the same bucket specified by the LOCATION clause.

multiple storage classes

Athena supports ____________ payment

Requester Pays Buckets.

Athena does not support querying the data in the ____________ storage classes.

S3 Glacier or S3 Glacier Deep Archive

All the tables in Athena are _________. Only tables with _________ keyword are created.

External;External

If you are interacting with Apache Spark, then your table names and table column names must be _________

lowercase.

Special characters other than __________ are not supported for Athena databases, tables or column names.

underscore (_)

Specifying location for Athena tables

1. s3://bucketname/folder/ 2. You can use a path to an Amazon S3 folder or an Amazon S3 access point alias. - s3://access-point-name-metadata-s3alias/folder/

Your source data may be grouped into Amazon S3 folders called _________ based on a set of columns.

partitions

If the S3 path is in _______, MSCK REPAIR TABLE doesn't add the partitions to the AWS Glue Data Catalog.

camel case

What is partition projection?

In partition projection, partition values and locations are calculated from configuration rather than read from a repository like the AWS Glue Data Catalog.

How partition projection can reduce the runtime of queries

Because in-memory operations are often faster than remote operations, partition projection can reduce the runtime of queries against highly partitioned tables.

_______ eliminates the need to specify partitions manually in AWS Glue or an external Hive metastore.

Partition projection

What happens if a projected partition does not exist in Amazon S3?

If a projected partition does not exist in Amazon S3, Athena will still project the partition. Athena does not throw an error, but no data is returned.

What happens if too many projected partitions are empty?

If too many of your partitions are empty, performance can be slower compared to traditional AWS Glue partitions. If more than half of your projected partitions are empty, it is recommended that you use traditional partitions.

Partition projection is usable only when the table is queried through ______. If the same table is read through another service such as Amazon Redshift Spectrum or Amazon EMR, the__________ is used

Athena; standard partition metadata

Athena Flashcards

(72 cards)