Amazon Athena | Creating Tables, Data Formats and Partitions Flashcards
Can I use Amazon Athena to query data that I process using Amazon EMR?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
Yes, Amazon Athena supports many of the same data formats as Amazon EMR. Athena’s data catalog is Hive metastore compatible. If you’re using EMR and already have a Hive metastore, you simply execute your DDL statements on Amazon Athena, and then you can start querying your data right away without impacting your Amazon EMR jobs.
How do I create tables and schemas for my data on Amazon S3?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
Amazon Athena uses Apache Hive DDL to define tables. You can run DDL statements using the Athena console, via a JDBC driver, via the API, or using the Athena create table wizard. If you use the AWS Glue Data Catalog with Athena, you can also use Glue crawlers to automatically infer schemas and partitions. An AWS Glue crawler connects to a data store, progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can run periodically to detect the availability of new data as well as changes to existing data, including table definition changes. Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions. You can customize Glue crawlers to classify your own file types.
When you create a new table schema in Amazon Athena the schema is stored in the Data Catalog and used when executing queries, but it does not modify your data in S3. Athena uses an approach known as schema-on-read, which allows you to project your schema onto your data at the time you execute a query. This eliminates the need for any data loading or transformation. Learn more about creating tables.
What data formats does Amazon Athena support?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats. By compressing, partitioning, and using columnar formats you can improve performance and reduce your costs.
What kind of data types does Amazon Athena support?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
Amazon Athena supports both simple data types such as INTEGER, DOUBLE, VARCHAR and complex data types such as MAPS, ARRAY and STRUCT.
Can I run any Hive Query on Athena?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
Amazon Athena uses Hive only for DDL (Data Definition Language) and for creation/modification and deletion of tables and/or partitions. Please click here for a complete list of statements supported. Athena uses Presto when you run SQL queries on Amazon S3. You can run ANSI-Compliant SQL SELECT statements to query your data in Amazon S3.
What is a SerDe?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
SerDe stands for Serializer/Deserializer, which are libraries that tell Hive how to interpret data formats. Hive DLL statements require you to specify a SerDe, so that the system knows how to interpret the data that you’re pointing to. Amazon Athena uses SerDes to interpret the data read from Amazon S3. The concept of SerDes in Athena is the same as the concept used in Hive. Amazon Athena supports the following SerDes:
Apache Web Logs: “org.apache.hadoop.hive.serde2.RegexSerDe”
CSV: “org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe”
TSV: “org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe”
Custom Delimiters: “org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe”
Parquet: “org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe”
Orc: “org.apache.hadoop.hive.ql.io.orc.OrcSerde”
JSON: “org.apache.hive.hcatalog.data.JsonSerDe” OR org.openx.data.jsonserde.JsonSerDe
Can I add my own SerDe (Serializer/Deserializer) to Amazon Athena?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
Currently, you cannot add your own SerDe to Amazon Athena. We appreciate your feedback, so if there are any SerDes you would like to see added, please contact the Athena team at Athena-feedback@amazon.com
I created Parquet/ORC files using Spark/Hive. Will I be able to query them via Athena?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
Yes, Parquet and ORC files created via Spark can be read in Athena.
I have data coming from Kinesis Firehose. How can I query it using Athena?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
If your Kinesis Firehose data is stored in Amazon S3, you can query it using Amazon Athena. Simply create a schema for your data in Athena and start querying. We recommend that you organize the data into partitions to optimize performance. You can add partitions created by Kinesis Firehose using ALTER TABLE DDL statements. Learn more about partitions.
Does Amazon Athena support data partitioning?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
Yes. Amazon Athena allows you to partition your data on any column. Partitions allow you to limit the amount of data each query scans, leading to cost savings and faster performance. You can specify your partitioning scheme using the PARTITIONED BY clause in the CREATE TABLE statement. Learn more about partitioning data.
How do I add new data to an existing table in Amazon Athena?
Creating Tables, Data Formats and Partitions
Amazon Athena | Analytics
If your data is partitioned, you will need to run a metadata query (ALTER TABLE ADD PARTITION) to add the partition to Athena once new data becomes available on Amazon S3. If your data is not partitioned, just adding the new data (or files) to the existing prefix automatically adds the data to Athena. Learn more about partitioning data.