Amazon EMR | Kinesis Connector Flashcards
Which versions of HBase are supported on Amazon EMR?
Kinesis Connector
Amazon EMR | Analytics
Amazon EMR supports HBase 0.94.7 and HBase 0.92.0. To use HBase 0.94.7 you must specify AMI version 3.0.0. If you are using the CLI you must use version 2013-10-07 or later.
What does EMR Connector to Kinesis enable?
Kinesis Connector
Amazon EMR | Analytics
The connector enables EMR to directly read and query data from Kinesis streams. You can now perform batch processing of Kinesis streams using existing Hadoop ecosystem tools such as Hive, Pig, MapReduce, Hadoop Streaming, and Cascading.
What does the EMR connector to Kinesis enable that I couldn’t have done before?
Kinesis Connector
Amazon EMR | Analytics
Reading and processing data from a Kinesis stream would require you to write, deploy and maintain independent stream processing applications. These take time and effort. However, with this connector, you can start reading and analyzing a Kinesis stream by writing a simple Hive or Pig script. This means you can analyze Kinesis streams using SQL! Of course, other Hadoop ecosystem tools could be used as well. You don’t need to developed or maintain a new set of processing applications.
Who will find this functionality useful?
Kinesis Connector
Amazon EMR | Analytics
The following types of users will find this integration useful:
Hadoop users who are interested in utilizing the extensive set of Hadoop ecosystem tools to analyze Kinesis streams.
Kinesis users who are looking for an easy way to get up and running with stream processing and ETL of Kinesis data.
Business analysts and IT professionals who would like to perform ad-hoc analysis of data in Kinesis streams using familiar tools like SQL (via Hive) or scripting languages like Pig.
What are some use cases for this integration?
Kinesis Connector
Amazon EMR | Analytics
The following are representative use cases are enabled by this integration:
Streaming Log Analysis: You can analyze streaming web logs to generate a list of top 10 error type every few minutes by region, browser, and access domains.
Complex Data Processing Workflows: You can join Kinesis stream with data stored in S3, Dynamo DB tables, and HDFS. You can write queries that join clickstream data from Kinesis with advertising campaign information stored in a DynamoDB table to identify the most effective categories of ads that are displayed on particular websites.
Ad-hoc Queries: You can periodically load data from Kinesis into HDFS and make it available as a local Impala table for fast, interactive, analytic queries.
What EMR AMI version do I need to be able to use the connector?
Kinesis Connector
Amazon EMR | Analytics
You need to use EMR’s AMI version 3.0.4 and later.
Is this connector a stand-alone tool?
Kinesis Connector
Amazon EMR | Analytics
No, it is a built in component of the Amazon distribution of Hadoop and is present on EMR AMI versions 3.0.4 and later. Customer simply needs to spin up a cluster with AMI version 3.0.4 or later to start using this feature.
What data format is required to allow EMR to read from a Kinesis stream?
Kinesis Connector
Amazon EMR | Analytics
The EMR Kinesis integration is not data format specific. You can read data in any format. Individual Kinesis records are presented to Hadoop as standard records that can be read using any Hadoop MapReduce framework. Individual frameworks like Hive, Pig and Cascading have built in components that help with serialization and deserialization, making it easy for developers to query data from many formats without having to implement custom code. For example, in Hive users can read data from JSON files, XML files and SEQ files by specifying the appropriate Hive SerDe when they define a table. Pig has a similar component called Loadfunc/Evalfunc and Cascading has a similar component called a Tap. Hadoop users can leverage the extensive ecosystem of Hadoop adapters without having to write format specific code. You can also implement custom deserialization formats to read domain specific data in any of these tools.
How do I analyze a Kinesis stream using Hive in EMR?
Kinesis Connector
Amazon EMR | Analytics
Create a table that references a Kinesis stream. You can then analyze the table like any other table in Hive. Please see our tutorials for page more details.
Using Hive, how do I create queries that combine Kinesis stream data with other data source?
Kinesis Connector
Amazon EMR | Analytics
First create a table that references a Kinesis stream. Once a Hive table has been created, you can join it with tables mapping to other data sources such as Amazon S3, Amazon Dynamo DB, and HDFS. This effectively results in joining data from Kinesis stream to other data sources.
Is this integration only available for Hive?
Kinesis Connector
Amazon EMR | Analytics
No, you can use Hive, Pig, MapReduce, Hadoop Streaming, and Cascading.
How do I setup scheduled jobs to run on a Kinesis stream?
Kinesis Connector
Amazon EMR | Analytics
The EMR Kinesis input connector provides features that help you configure and manage scheduled periodic jobs in traditional scheduling engines such as Cron. For example, you can develop a Hive script that runs every N minutes. In the configuration parameters for a job, you can specify a Logical Name for the job. The Logical Name is a label that will inform the EMR Kinesis input connector that individual instances of the job are members of the same periodic schedule. The Logical Name allows the process to take advantage of iterations, which are explained next.
Since MapReduce is a batch processing framework, to analyze a Kinesis stream using EMR, the continuous stream is divided in to batches. Each batch is called an Iteration. Each Iteration is assigned a number, starting with 0. Each Iteration’s boundaries are defined by a start sequence number and end sequence number. Iterations are then processed sequentially by EMR.
In the event of an attempt’s failure, the EMR Kinesis input connector will re-try the iteration within the Logical Name from the known start sequence number of the iteration. This functionality ensures that successive attempts on the same iteration will have precisely the same input records from the Kinesis stream as the previous attempts. This guarantees idempotent (consistent) processing of a Kinesis stream.
You can specify Logical Names and Iterations as runtime parameters in your respective Hadoop tools. For example, in the tutorial section “Running queries with checkpoints”, the code sample shows a scheduled Hive query that designates a Logical Name for the query and increments the iteration with each successive run of the job.
Additionally, a sample cron scheduling script is provided in the tutorials.
Where is the metadata for Logical Names and Iterations stored?
Kinesis Connector
Amazon EMR | Analytics
The metadata that allows the EMR Kinesis input connector to work in scheduled periodic workflows is stored in Amazon DynamoDB. You must provision an Amazon Dynamo DB table and specify it as an input parameter to the Hadoop Job. It is important that you configure appropriate IOPS for the table to enable this integration. Please refer to the getting started tutorial for more information on setting up your Amazon Dynamo DB table.
What happens when an iteration processing fails?
Kinesis Connector
Amazon EMR | Analytics
Iterations identifiers are user-provided values that map to specific boundary (start and end sequence numbers) in a Kinesis stream. Data corresponding to these boundaries is loaded in the Map phase of the MapReduce job. This phase is managed by the framework and will be automatically re-run (three times by default) in case of job failure. If all the retries fail, you would still have options to retry the processing starting from last successful data boundary or past data boundaries. This behavior is controlled by providing kinesis.checkpoint.iteration.no parameter during processing. Please refer to the getting started tutorial for more information on how this value is configured for different tools in the Hadoop ecosystem.
Can I run multiple queries on the same iteration?
Kinesis Connector
Amazon EMR | Analytics
Yes, you can specify a previously run iteration by setting the kinesis.checkpoint.iteration.no parameter in successive processing. The implementation ensures that successive runs on the same iteration will have precisely the same input records from the Kinesis stream as the previous runs.