Analytics | Amazon EMR Flashcards

Question

Does Amazon EMR support third-party software packages? Developing Amazon EMR | Analytics

Answer 1

Yes. The recommended way to install third-party software packages on your cluster is to use Bootstrap Actions. Alternatively you can package any third party libraries directly into your Mapper or Reducer executable. You can also upload statically compiled executables using the Hadoop distributed cache mechanism.

Answer 2

For the latest versions supported by Amazon EMR, please reference the documentation.

Answer 3

Yes. Amazon EMR is active with the open source community and contributes many fixes back to the Hadoop source.

Answer 4

Amazon EMR periodically updates its supported version of Hadoop based on the Hadoop releases by the community. Amazon EMR may choose to skip some Hadoop releases.

Answer 5

Amazon EMR service retires support for old Hadoop versions several months after deprecation. However, Amazon EMR APIs are backward compatible, so if you build tools on top of these APIs, they will work even when Amazon EMR updates the Hadoop version it’s using.

Answer 6

You first select the cluster you want to debug, then click on the "Debug" button to access the debug a cluster window in the AWS Management Console. This will enable you to track progress and identify issues in steps, jobs, tasks, or task attempts of your clusters. Alternatively you can SSH directly into the Amazon Elastic Compute Cloud (Amazon EC2) instances that are running your cluster and use your favorite command-line debugger to troubleshoot the cluster.

Answer 7

The cluster debug tool is a part of the AWS Management Console where you can track progress and identify issues in steps, jobs, tasks, or task attempts of your clusters. To access the cluster debug tool, first select the cluster you want to debug and then click on the "Debug" button.

Answer 8

To enable debugging you need to set "Enable Debugging" flag when you create a cluster in the AWS Management Console. Alternatively, you can pass the --enable-debugging and --log-uri flags in the Command Line Client when creating a cluster.

Answer 9

Please reference the AWS Management Console section of the Developer’s Guide for instructions on how to access and use the debug a cluster window.

Answer 10

You can debug all types of clusters currently supported by Amazon EMR including custom jar, streaming, Hive, and Pig.

Answer 11

Amazon EMR stores state information about Hadoop jobs, tasks and task attempts under your account in Amazon SimpleDB. You can subscribe to Amazon SimpleDB here.

Answer 12

You will be able to browse cluster steps and step logs but will not be able to browse Hadoop jobs, tasks, or task attempts if you are not subscribed to Amazon SimpeDB.

Answer 13

Yes. You can delete Amazon SimpleDB domains that Amazon EMR created on your behalf. Please reference the Amazon SimpleDB documentation for instructions.

Answer 14

You can use Amazon S3 APIs to upload data to Amazon S3. Alternatively, you can use many open source or commercial clients to easily upload data to Amazon S3.

Answer 15

Hadoop system logs as well as user logs will be placed in the Amazon S3 bucket which you specify when creating a cluster.

Answer 16

No. At this time Amazon EMR does not compress logs as it moves them to Amazon S3.

Answer 17

Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet, EC2 bandwidth charges will apply. Amazon EMR also provides Hive-based access to data in DynamoDB.

Answer 18

No. As each cluster and input data is different, we cannot estimate your job duration.

Answer 19

As with the rest of AWS, you pay only for what you use. There is no minimum fee and there are no up-front commitments or long-term contracts. Amazon EMR pricing is in addition to normal Amazon EC2 and Amazon S3 pricing. For Amazon EMR pricing information, please visit EMR's pricing page. Amazon EC2, Amazon S3 and Amazon SimpleDB charges are billed separately. Pricing for Amazon EMR is per-second consumed for each instance type (with a one-minute minimum), from the time cluster is requested until it is terminated. For additional details on Amazon EC2 Instance Types, Amazon EC2 Spot Pricing, Amazon EC2 Reserved Instances Pricing, Amazon S3 Pricing, or Amazon SimpleDB Pricing, follow the links below: Amazon EC2 Instance Types Amazon EC2 Reserved Instances Pricing Amazon EC2 Spot Instances Pricing Amazon S3 Pricing Amazon SimpleDB Pricing

Answer 20

Billing commences when Amazon EMR starts running your cluster. You are only charged for the resources actually consumed. For example, let’s say you launched 100 Amazon EC2 Standard Small instances for an Amazon EMR cluster, where the Amazon EMR cost is an incremental $0.015 per hour. The Amazon EC2 instances will begin booting immediately, but they won’t necessarily all start at the same moment. Amazon EMR will track when each instance starts and will check it into the cluster so that it can accept processing tasks. In the first 10 minutes after your launch request, Amazon EMR either starts your cluster (if all of your instances are available) or checks in as many instances as possible. Once the 10 minute mark has passed, Amazon EMR will start processing (and charging for) your cluster as soon as 90% of your requested instances are available. As the remaining 10% of your requested instances check in, Amazon EMR starts charging for those instances as well. So, in the above example, if all 100 of your requested instances are available 10 minutes after you kick off a launch request, you’ll be charged $1.50 per hour (100 \* $0.015) for as long as the cluster takes to complete. If only 90 of your requested instances were available at the 10 minute mark, you’d be charged $1.35 per hour (90 \* $0.015) for as long as this was the number of instances running your cluster. When the remaining 10 instances checked in, you’d be charged $1.50 per hour (100 \* $0.015) for as long as the balance of the cluster takes to complete. Each cluster will run until one of the following occurs: you terminate the cluster with the TerminateJobFlows API call (or an equivalent tool), the cluster shuts itself down, or the cluster is terminated due to software or hardware failure.

Answer 21

You can track your usage in the Billing & Cost Management Console.

Answer 22

On the AWS Management Console, every cluster has a Normalized Instance Hours column that displays the approximate number of compute hours the cluster has used, rounded up to the nearest hour. Normalized Instance Hours are hours of compute time based on the standard of 1 hour of m1.small usage = 1 hour normalized compute time. The following table outlines the normalization factor used to calculate normalized instance hours for the various instance sizes: Instance Size Normalization Factor Small 1 Medium 2 Large 4 Xlarge 8 2xlarge 16 4xlarge 32 8xlarge 64 For example, if you run a 10-node r3.8xlarge cluster for an hour, the total number of Normalized Instance Hours displayed on the console will be 640 (10 (number of nodes) x 64 (normalization factor) x 1 (number of hours tthat the cluster ran) = 640). This is an approximate number and should not be used for billing purposes. Please refer to the Billing & Cost Management Console for billable Amazon EMR usage. Note that we recently changed the normalization factor to accurately reflect the weights of the instances, and the normalization factor does not affect your monthly bill.

Answer 23

Yes. Amazon EMR seamlessly supports On-Demand, Spot, and Reserved Instances. Click here to learn more about Amazon EC2 Reserved Instances. Click here to learn more about Amazon EC2 Spot Instances.

Answer 24

Except as otherwise noted, our prices are exclusive of applicable taxes and duties, including VAT and applicable sales tax. For customers with a Japanese billing address, use of AWS services is subject to Japanese Consumption Tax. Learn more.

Answer 25

Amazon EMR starts your instances in two Amazon EC2 security groups, one for the master and another for the slaves. The master security group has a port open for communication with the service. It also has the SSH port open to allow you to SSH into the instances, using the key specified at startup. The slaves start in a separate security group, which only allows interaction with the master instance. By default both security groups are set up to not allow access from external sources including Amazon EC2 instances belonging to other customers. Since these are security groups within your account, you can reconfigure them using the standard EC2 tools or dashboard. Click here to learn more about EC2 security groups.

Answer 26

Amazon S3 provides authentication mechanisms to ensure that stored data is secured against unauthorized access. Unless the customer who is uploading the data specifies otherwise, only that customer can access the data. Amazon EMR customers can also choose to send data to Amazon S3 using the HTTPS protocol for secure transmission. In addition, Amazon EMR always uses HTTPS to send data between Amazon S3 and Amazon EC2. For added security, customers may encrypt the input data before they upload it to Amazon S3 (using any common data compression tool); they then need to add a decryption step to the beginning of their cluster when Amazon EMR fetches the data from Amazon S3.

Answer 27

Yes. AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The AWS API call history produced by CloudTrail enables security analysis, resource change tracking, and compliance auditing. Learn more about CloudTrail at the AWS CloudTrail detail page, and turn it on via CloudTrail's AWS Management Console.

Answer 28

Amazon EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone. Running a cluster in the same zone improves performance of the jobs flows because it provides a higher data access rate. By default, Amazon EMR chooses the Availability Zone with the most available resources in which to run your cluster. However, you can specify another Availability Zone if required.

Answer 29

For a list of the supported Amazon EMR AWS regions, please visit the AWS Region Table for all AWS global infrastructure.

Answer 30

When creating a cluster, typically you should select the Region where your data is located.

Answer 31

Yes you can. If you transfer data from one region to the other you will be charged bandwidth charges. For bandwidth pricing information, please visit the pricing section on the EC2 detail page.

Answer 32

The AWS GovCloud (US) region is designed for US government agencies and customers. It adheres to US ITAR requirements. In GovCloud, EMR does not support spot instances or the enable-debugging feature. The EMR Management Console is not yet available in GovCloud.

Answer 33

Customers upload their input data and a data processing application into Amazon S3. Amazon EMR then launches a number of Amazon EC2 instances as specified by the customer. The service begins the cluster execution while pulling the input data from Amazon S3 using S3N protocol into the launched Amazon EC2 instances. Once the cluster is finished, Amazon EMR transfers the output data to Amazon S3, where customers can then retrieve it or use as input in another cluster.

Answer 34

Amazon EMR uses the Hadoop data processing engine to conduct computations implemented in the MapReduce programming model. The customer implements their algorithm in terms of map() and reduce() functions. The service starts a customer-specified number of Amazon EC2 instances, comprised of one master and multiple slaves. Amazon EMR runs Hadoop software on these instances. The master node divides input data into blocks, and distributes the processing of the blocks to the slave node. Each slave node then runs the map function on the data it has been allocated, generating intermediate data. The intermediate data is then sorted and partitioned and sent to processes which apply the reducer function to it. These processes also run on the slave nodes. Finally, the output from the reducer tasks is collected in files. A single "cluster" may involve a sequence of such MapReduce steps.

Answer 35

Amazon EMR manages an Amazon EC2 cluster of compute instances using Amazon’s highly available, proven network infrastructure and datacenters. Amazon EMR uses industry proven, fault-tolerant Hadoop software as its data processing engine. Hadoop splits the data into multiple subsets and assigns each subset to more than one Amazon EC2 instance. So, if an Amazon EC2 instance fails to process one subset of data, the results of another Amazon EC2 instance can be used.

Answer 36

Amazon EMR starts resource provisioning of Amazon EC2 On-Demand instances almost immediately. If the instances are not available, Amazon EMR will keep trying to provision the resources for your cluster until they are provisioned or you cancel your request. The instance provisioning is done on a best-efforts basis and depends on the number of instances requested, time when the cluster is created, and total number of requests in the system. After resources have been provisioned, it typically takes fewer than 15 minutes to start processing. In order to guarantee capacity for your clusters at the time you need it, you may pay a one-time fee for Amazon EC2 Reserved Instances to reserve instance capacity in the cloud at a discounted hourly rate. Like On-Demand Instances, customers pay usage charges only for the time when their instances are running. In this way, Reserved Instances enable businesses with known instance requirements to maintain the elasticity and flexibility of On-Demand Instances, while also reducing their predictable usage costs even further.

Answer 37

Amazon EMR supports 12 EC2 instance types including Standard, High CPU, High Memory, Cluster Compute, High I/O, and High Storage. Standard Instances have memory to CPU ratios suitable for most general-purpose applications. High CPU instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications. High Memory instances offer large memory sizes for high throughput applications. Cluster Compute instances have proportionally high CPU with increased network performance and are well suited for High Performance Compute (HPC) applications and other demanding network-bound applications. High Storage instances offer 48 TB of storage across 24 disks and are ideal for applications that require sequential access to very large data sets such as data warehousing and log processing. See the EMR pricing page for details on available instance types and pricing per region.

Answer 38

When choosing instance types, you should consider the characteristics of your application with regards to resource utilization and select the optimal instance family. One of the advantages of Amazon EMR with Amazon EC2 is that you pay only for what you use, which makes it convenient and inexpensive to test the performance of your clusters on different instance types and quantity. One effective way to determine the most appropriate instance type is to launch several small clusters and benchmark your clusters.

Answer 39

The number of instances to use in your cluster is application-dependent and should be based on both the amount of resources required to store and process your data and the acceptable amount of time for your job to complete. As a general guideline, we recommend that you limit 60% of your disk space to storing the data you will be processing, leaving the rest for intermediate output. Hence, given 3x replication on HDFS, if you were looking to process 5 TB on m1.xlarge instances, which have 1,690 GB of disk space, we recommend your cluster contains at least (5 TB \* 3) / (1,690 GB \* .6) = 15 m1.xlarge core nodes. You may want to increase this number if your job generates a high amount of intermediate data or has significant I/O requirements. You may also want to include additional task nodes to improve processing performance. See Amazon EC2 Instance Types for details on local instance storage for each instance type configuration.

Answer 40

The time to run your cluster will depend on several factors including the type of your cluster, the amount of input data, and the number and type of Amazon EC2 instances you choose for your cluster.

Answer 41

No. If the master node goes down, your cluster will be terminated and you’ll have to rerun your job. Amazon EMR currently does not support automatic failover of the master nodes or master node state recovery. In case of master node failure, the AWS Management console displays "The master node was terminated" message which is an indicator for you to start a new cluster. Customers can instrument check pointing in their clusters to save intermediate data (data created in the middle of a cluster that has not yet been reduced) on Amazon S3. This will allow resuming the cluster from the last check point in case of failure.

Answer 42

Yes. Amazon EMR is fault tolerant for slave failures and continues job execution if a slave node goes down. Amazon EMR will also provision a new node when a core node fails. However, Amazon EMR will not replace nodes if all nodes in the cluster are lost.

Answer 43

Yes. You can SSH onto your cluster nodes and execute Hadoop commands directly from there. If you need to SSH into a slave node, you have to first SSH to the master node, and then SSH into the slave node.

Answer 44

At this time, Amazon EMR supports Debian/Lenny in 32 and 64 bit modes. We are always listening to customer feedback and will add more capabilities over time to help our customers solve their data crunching business problems.

Answer 45

Bootstrap Actions is a feature in Amazon EMR that provides users a way to run custom set-up prior to the execution of their cluster. Bootstrap Actions can be used to install software or configure instances before running your cluster. You can read more about bootstrap actions in EMR's Developer Guide.

Answer 46

You can write a Bootstrap Action script in any language already installed on the cluster instance including Bash, Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a cluster. Please refer to the "Developer’s Guide": http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/ for details on how to use Bootstrap Actions.

Answer 47

The EMR default Hadoop configuration is appropriate for most workloads. However, based on your cluster’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your cluster tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your cluster on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.

Answer 48

Yes. Slave nodes can be of two types: (1) core nodes, which both host persistent data using Hadoop Distributed File System (HDFS) and run Hadoop tasks and (2) task nodes, which only run Hadoop tasks. While a cluster is running you may increase the number of core nodes and you may either increase or decrease the number of task nodes. This can be done through the API, Java SDK, or though the command line client. Please refer to the Resizing Running clusters section in the Developer’s Guide for details on how to modify the size of your running cluster.

Answer 49

As core nodes host persistent data in HDFS and cannot be removed, core nodes should be reserved for the capacity that is required until your cluster completes. As task nodes can be added or removed and do not contain HDFS, they are ideal for capacity that is only needed on a temporary basis.

Answer 50

There are several scenarios where you may want to modify the number of slave nodes in a running cluster. If your cluster is running slower than expected, or timing requirements change, you can increase the number of core nodes to increase cluster performance. If different phases of your cluster have different capacity needs, you can start with a small number of core nodes and increase or decrease the number of task nodes to meet your cluster’s varying capacity requirements.

Answer 51

Yes. You may include a predefined step in your workflow that automatically resizes a cluster between steps that are known to have different capacity needs. As all steps are guaranteed to run sequentially, this allows you to set the number of slave nodes that will execute a given cluster step.

Answer 52

To create a new cluster that is visible to all IAM users within the EMR CLI: Add the --visible-to-all-users flag when you create the cluster. For example: elastic-mapreduce --create --visible-to-all-users. Within the Management Console, simply select "Visible to all IAM Users" on the Advanced Options pane of the Create cluster Wizard. To make an existing cluster visible to all IAM users you must use the EMR CLI. Use --set-visible-to-all-users and specify the cluster identifier. For example: elastic-mapreduce --set-visible-to-all-users true --jobflow j-xxxxxxx. This can only be done by the creator of the cluster. To learn more, see the Configuring User Permissions section of the EMR Developer Guide.

Answer 53

You can add tags to an active Amazon EMR cluster. An Amazon EMR cluster consists of Amazon EC2 instances, and a tag added to an Amazon EMR cluster will be propagated to each active Amazon EC2 instance in that cluster. You cannot add, edit, or remove tags from terminated clusters or terminated Amazon EC2 instances which were part of an active cluster.

Answer 54

No, Amazon EMR does not support resource-based permissions by tag. However, it is important to note that propagated tags to Amazon EC2 instances behave as normal Amazon EC2 tags. Therefore, an IAM Policy for Amazon EC2 will act on tags propagated from Amazon EMR if they match conditions in that policy.

Answer 55

You can add up to ten tags on an Amazon EMR cluster.

Answer 56

Yes, Amazon EMR propagates the tags added to a cluster to that cluster's underlying EC2 instances. If you add a tag to an Amazon EMR cluster, it will also appear on the related Amazon EC2 instances. Likewise, if you remove a tag from an Amazon EMR cluster, it will also be removed from its associated Amazon EC2 instances. However, if you are using IAM policies for Amazon EC2 and plan to use Amazon EMR's tagging functionality, you should make sure that permission to use the Amazon EC2 tagging APIs CreateTags and DeleteTags is granted.

Answer 57

Select the tags you would like to use in your AWS billing report here. Then, to see the cost of your combined resources, you can organize your billing information based on resources that have the same tag key values.

Answer 58

An Amazon EC2 instance associated with an Amazon EMR cluster will have two system tags: aws:elasticmapreduce:instance-group-role=CORE Key = instance-group role ; Value = [CORE or TASK] aws:elasticmapreduce:job-flow-id=j-12345678 Key = job-flow-id ; Value = [JobFlowID]

Answer 59

Yes, you can add or remove tags directly on Amazon EC2 instances that are part of an Amazon EMR cluster. However, we do not recommend doing this, because Amazon EMR’s tagging system will not sync the changes you make to an associated Amazon EC2 instance directly. We recommend that tags for Amazon EMR clusters be added and removed from the Amazon EMR console, CLI, or API to ensure that the cluster and its associated Amazon EC2 instances have the correct tags.

Answer 60

Most EC2 instances have fixed storage capacity attached to an instance, known as an "instance store". You can now add EBS volumes to the instances in your Amazon EMR cluster, allowing you to customize the storage on an instance. The feature also allows you to run Amazon EMR clusters on EBS-Only instance families such as the M4 and C4.

Answer 61

You will benefit by adding EBS volumes to an instance in the following scenarios: Your processing requirements are such that you need a large amount of HDFS (or local) storage that what is available today on an instance. With support for EBS volumes, you will be able to customize the storage capacity on an instance relative to the compute capacity that the instance provides. Optimizing the storage on an instance will allow you to save costs. You are running on an older generation instance family (such as the M1 and M2 family) and want to move to latest generation instance family but are constrained by the storage available per node on the next generation instance types. Now you can use any of the new generation instance type and add EBS volumes to optimize the storage. Internal benchmarks indicate that you can save cost and improve performance by moving from an older generation instance family (M1 or M2) to a new generation one (M4, C4 & R3). The Amazon EMR team recommends that you run your application to arrive at the right conclusion. You want to use or migrate to the next-generation EBS-Only M4 and C4 family.

Answer 62

Currently, Amazon EMR will delete volumes once the cluster is terminated. If you want to persist data outside the lifecycle of a cluster, consider using Amazon S3 as your data store.

Answer 63

Amazon EMR allows you to use different EBS Volume Types: General Purpose SSD (GP2), Magnetic and Provisioned IOPS (SSD).

Answer 64

Amazon EMR will delete the volumes once the EMR cluster is terminated.

Answer 65

Yes, You can add EBS volumes to instances that have an instance store.

Answer 66

No, currently you can only add EBS volumes when launching a cluster.

Answer 67

The EBS API allows you to Snapshot a cluster. However, Amazon EMR currently does not allow you to restore from a snapshot.

Answer 68

No, encrypted volumes are not supported in the current release.

Answer 69

Removing an attached volume from a running cluster will be treated as a node failure. Amazon EMR will replace the node and the EBS volume with each of the same.

Answer 70

Hive is an open source datawarehouse and analytics package that runs on top of Hadoop. Hive is operated by a SQL-based language called Hive QL that allows users to structure, summarize, and query data sources stored in Amazon S3. Hive QL goes beyond standard SQL, adding first-class support for map/reduce functions and complex extensible user-defined data types like Json and Thrift. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Hive allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3.

Answer 71

Using Hive with Amazon EMR, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon EMR. With Amazon EMR, you can turn your Hive applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.

Answer 72

Traditional RDBMS systems provide transaction semantics and ACID properties. They also allow tables to be indexed and cached so that small amounts of data can be retrieved very quickly. They provide for fast update of small amounts of data and for enforcement of referential integrity constraints. Typically they run on a single large machine and do not provide support for executing map and reduce functions on the table, nor do they typically support acting over complex user defined data types. In contrast, Hive executes SQL-like queries using MapReduce. Consequently, it is optimized for doing full table scans while running on a cluster of machines and is therefore able to process very large amounts of data. Hive provides partitioned tables, which allow it to scan a partition of a table rather than the whole table if that is appropriate for the query it is executing. Traditional RDMS systems are best for when transactional semantics and referential integrity are required and frequent small updates are performed. Hive is best for offline reporting, transformation, and analysis of large data sets; for example, performing click stream analysis of a large website or collection of websites. One of the common practices is to export data from RDBMS systems into Amazon S3 where offline analysis can be performed using Amazon EMR clusters running Hive.

Answer 73

The best place to start is to review our written documentation located here.

Answer 74

Yes. There are four new features which make Hive even more powerful when used with Amazon EMR, including: a/ The ability to load table partitions automatically from Amazon S3. Previously, to import a partitioned table you needed a separate alter table statement for each individual partition in the table. Amazon EMR a now includes a new statement type for the Hive language: "alter table recover partitions." This statement allows you to easily import tables concurrently into many clusters without having to maintain a shared meta-data store. Use this functionality to read from tables into which external processes are depositing data, for example log files. b/ The ability to specify an off-instance metadata store. By default, the metadata store where Hive stores its schema information is located on the master node and ceases to exist when the cluster terminates. This feature allows you to override the location of the metadata store to use, for example a MySQL instance that you already have running in EC2. c/ Writing data directly to Amazon S3. When writing data to tables in Amazon S3, the version of Hive installed in Amazon EMR writes directly to Amazon S3 without the use of temporary files. This produces a significant performance improvement but it means that HDFS and S3 from a Hive perspective behave differently. You cannot read and write within the same statement to the same table if that table is located in Amazon S3. If you want to update a table located in S3, then create a temporary table in the cluster’s local HDFS filesystem, write the results to that table, and then copy them to Amazon S3. d/ Accessing resources located in Amazon S3. The version of Hive installed in Amazon EMR allows you to reference resources such as scripts for custom map and reduce operations or additional libraries located in Amazon S3 directly from within your Hive script (e.g., add jar s3://elasticmapreduce/samples/hive-ads/libs/jsonserde.jar).

Answer 75

There are two types of clusters supported with Hive: interactive and batch. In an interactive mode a customer can start a cluster and run Hive scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Hive script is stored in Amazon S3 and is referenced at the start of the cluster. Typically, batch mode is used for repeatable runs such as report generation.

Answer 76

Both batch and interactive clusters can be started from AWS Management Console, EMR command line client, or APIs. Please refer to the Hive section in the Release Guide for more details on launching a Hive cluster.

Answer 77

Hive and PIG both provide high level data-processing languages with support for complex data types for operating on large datasets. The Hive language is a variant of SQL and so is more accessible to people already familiar with SQL and relational databases. Hive has support for partitioned tables which allow Amazon EMR clusters to pull down only the table partition relevant to the query being executed rather than doing a full table scan. Both PIG and Hive have query plan optimization. PIG is able to optimize across an entire scripts while Hive queries are optimized at the statement level. Ultimately the choice of whether to use Hive or PIG will depend on the exact requirements of the application domain and the preferences of the implementers and those writing queries.

Answer 78

Amazon EMR supports multiple versions of Hive, including version 0.11.0.

Answer 79

No. Hive does not support concurrently writing to tables. You should avoid concurrently writing to the same table or reading from a table while you are writing to it. Hive has non-deterministic behavior when reading and writing at the same time or writing and writing at the same time.

Answer 80

Yes. You can read data in Amazon S3 within a Hive script by having ‘create external table’ statements at the top of your script. You need a create table statement for each external resource that you access.

Answer 81

Amazon EMR provides a unique capability for you to use both methods. On the one hand one large cluster may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate cluster tuned to the specific task sharing data sources stored in Amazon S3.

Answer 82

No. You must upload the script or jar to Amazon S3 or to the cluster’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.

Answer 83

Yes. You run a cluster in a manual termination mode so it will not terminate between Hive steps. To reduce the risk of data loss we recommend periodically persisting all of your important data in Amazon S3. It is good practice to regularly transfer your work to a new cluster to test you process for recovering from master node failure.

Answer 84

Yes. Hive scripts executed by multiple users on separate clusters may contain create external table statements to concurrently import source data residing in Amazon S3.

Answer 85

Yes. In the batch mode, steps are serialized. Multiple users can add Hive steps to the same cluster, however, the steps will be executed serially. In interactive mode, several users can be logged on to the same cluster and execute Hive statements concurrently.

Answer 86

Yes. Data can be shared using standard Amazon S3 sharing mechanism described here http://docs.amazonwebservices.com/AmazonS3/latest/index.html?S3\_ACLs.html

Answer 87

Yes. Hive provides JDBC drive, which can be used to programmatically execute Hive statements. To start a JDBC service in your cluster you need to pass an optional parameter in the Amazon EMR command line client. You also need to establish an SSH tunnel because the security group does not permit external connections.

Answer 88

We run a select set of packages from Debian/stable including security patches. We will upgrade a package whenever it gets upgraded in Debian/stable. The "r-recommended" package on our image is up to date with Debian/stable (http://packages.debian.org/search?keywords=r-recommended).

Answer 89

Yes. You can use Bootstrap Actions to install updates to packages on your clusters.

Answer 90

Yes. Simply define an external Hive table based on your DynamoDB table. You can then use Hive to analyze the data stored in DynamoDB and either load the results back into DynamoDB or archive them in Amazon S3. For more information please visit our Developer Guide.

Answer 91

Impala is an open source tool in the Hadoop ecosystem for interactive, ad hoc querying using SQL syntax. Instead of using MapReduce, it leverages a massively parallel processing (MPP) engine similar to that found in traditional relational database management systems (RDBMS). With this architecture, you can query your data in HDFS or HBase tables very quickly, and leverage Hadoop’s ability to process diverse data types and provide schema at runtime. This lends Impala to interactive, low-latency analytics. In addition, Impala uses the Hive metastore to hold information about the input data, including the partition names and data types. Also, Impala on Amazon EMR requires AMIs running Hadoop 2.x or greater. Click here to learn more about Impala.

Answer 92

Similar to using Hive with Amazon EMR, leveraging Impala with Amazon EMR can implement sophisticated data-processing applications with SQL syntax. However, Impala is built to perform faster in certain use cases (see below). With Amazon EMR, you can use Impala as a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence. Here are three use cases: Use Impala instead of Hive on long-running clusters to perform ad hoc queries. Impala reduces interactive queries to seconds, making it an excellent tool for fast investigation. You could run Impala on the same cluster as your batch MapReduce workflows, use Impala on a long-running analytics cluster with Hive and Pig, or create a cluster specifically tuned for Impala queries. Use Impala instead of Hive for batch ETL jobs on transient Amazon EMR clusters. Impala is faster than Hive for many queries, which provides better performance for these workloads. Like Hive, Impala uses SQL, so queries can easily be modified from Hive to Impala. Use Impala in conjunction with a third party business intelligence tool. Connect a client ODBC or JDBC driver with your cluster to use Impala as an engine for powerful visualization tools and dashboards. Both batch and interactive Impala clusters can be created in Amazon EMR. For instance, you can have a long-running Amazon EMR cluster running Impala for ad hoc, interactive querying or use transient Impala clusters for quick ETL workflows.

Answer 93

Traditional relational database systems provide transaction semantics and database atomicity, consistency, isolation, and durability (ACID) properties. They also allow tables to be indexed and cached so that small amounts of data can be retrieved very quickly, provide for fast updates of small amounts of data, and for enforcement of referential integrity constraints. Typically, they run on a single large machine and do not provide support for acting over complex user defined data types. Impala uses a similar distributed query system to that found in RDBMSs, but queries data stored in HDFS and uses the Hive metastore to hold information about the input data. As with Hive, the schema for a query is provided at runtime, allowing for easier schema changes. Also, Impala can query a variety of complex data types and execute user defined functions. However, because Impala processes data in-memory, it is important to understand the hardware limitations of your cluster and optimize your queries for the best performance.

Answer 94

Impala executes SQL queries using a massively parallel processing (MPP) engine, while Hive executes SQL queries using MapReduce. Impala avoids Hive’s overhead from creating MapReduce jobs, giving it faster query times than Hive. However, Impala uses significant memory resources and the cluster’s available memory places a constraint on how much memory any query can consume. Hive is not limited in the same way, and can successfully process larger data sets with the same hardware. Generally, you should use Impala for fast, interactive queries, while Hive is better for ETL workloads on large datasets. Impala is built for speed and is great for ad hoc investigation, but requires a significant amount of memory to execute expensive queries or process very large datasets. Because of these limitations, Hive is recommended for workloads where speed is not as crucial as completion. Click here to view some performance benchmarks between Impala and Hive.

Answer 95

No, Impala requires Hadoop 2, and will not run on a cluster with an AMI running Hadoop 1.x.

Answer 96

For the best experience with Impala, we recommend using memory-optimized instances for your cluster. However, we have shown that there are performance gains over Hive when using standard instance types as well. We suggest reading our Performance Testing and Query Optimization section in the Amazon EMR Developer’s Guide to better estimate the memory resources your cluster will need with regards to your dataset and query types. The compression type, partitions, and the actual query (number of joins, result size, etc.) all play a role in the memory required. You can use the EXPLAIN statement to estimate the memory and other resources needed for an Impala query.

Answer 97

If you run out of memory, queries fail and the Impala daemon installed on the affected node shuts down. Amazon EMR then restarts the daemon on that node so that Impala will be ready to run another query. Your data in HDFS on the node remains available, because only the daemon running on the node shuts down, rather than the entire node itself. For ad hoc analysis with Impala, the query time can often be measured in seconds; therefore, if a query fails, you can discover the problem quickly and be able to submit a new query in quick succession.

Answer 98

Yes, Impala supports user defined functions (UDFs). You can write Impala specific UDFs in Java or C++. Also, you can modify UDFs or user-defined aggregate functions created for Hive for use with Impala. For information about Hive UDFs, click here.

Answer 99

Impala queries data in HDFS or in HBase tables.

Answer 100

Yes, you can set up a multitenant cluster with Impala and MapReduce. However, you should be sure to allot resources (memory, disk, and CPU) to each application using YARN on Hadoop 2.x. The resources allocated should be dependent on the needs for the jobs you plan to run on each application.

Answer 101

While you can use ODBC drivers, Impala is also a great engine for third-party tools connected through JDBC. You can download and install the Impala client JDBC driver from http://elasticmapreduce.s3.amazonaws.com/libs/impala/1.2.1/impala-jdbc-1.2.1.zip. From the client computer where you have your business intelligence tool installed, connect the JDBC driver to the master node of an Impala cluster using SSH or a VPN on port 21050. For more information, see Open an SSH Tunnel to the Master Node.

Answer 102

Pig is an open source analytics package that runs on top of Hadoop. Pig is operated by a SQL-like language called Pig Latin, which allows users to structure, summarize, and query data sources stored in Amazon S3. As well as SQL-like operations, Pig Latin also adds first-class support for map/reduce functions and complex extensible user defined data types. This capability allows processing of complex and even unstructured data sources such as text documents and log files. Pig allows user extensions via user-defined functions written in Java and deployed via storage in Amazon S3.

Answer 103

Using Pig with Amazon EMR, you can implement sophisticated data-processing applications with a familiar SQL-like language and easy to use tools available with Amazon EMR. With Amazon EMR, you can turn your Pig applications into a reliable data warehouse to execute tasks such as data analytics, monitoring, and business intelligence tasks.

Answer 104

The best place to start is to review our written documentation located here.

Answer 105

Yes. There are three new features which make Pig even more powerful when used with Amazon EMR, including: a/ Accessing multiple filesystems. By default a Pig job can only access one remote file system, be it an HDFS store or S3 bucket, for input, output and temporary data. EMR has extended Pig so that any job can access as many file systems as it wishes. An advantage of this is that temporary intra-job data is always stored on the local HDFS, leading to improved perfomance. b/ Loading resources from S3. EMR has extended Pig so that custom JARs and scripts can come from the S3 file system, for example "REGISTER s3:///my-bucket/piggybank.jar" c/ Additional Piggybank function for String and DateTime processing.

Answer 106

There are two types of clusters supported with Pig: interactive and batch. In an interactive mode a customer can start a cluster and run Pig scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development. In batch mode, the Pig script is stored in Amazon S3 and is referenced at the start of the cluster. Typically, batch mode is used for repeatable runs such as report generation.

Answer 107

Both batch and interactive clusters can be started from AWS Management Console, EMR command line client, or APIs.

Answer 108

Amazon EMR supports multiple versions of Pig.

Answer 109

Yes, you are able to write to the same bucket from two concurrent clusters.

Answer 110

Yes, you are able to read the same data in S3 from two concurrent clusters.

Answer 111

Yes. Data can be shared using standard Amazon S3 sharing mechanism described here http://docs.amazonwebservices.com/AmazonS3/latest/index.html?S3\_ACLs.html

Answer 112

Amazon EMR provides a unique capability for you to use both methods. On the one hand one large cluster may be more efficient for processing regular batch workloads. On the other hand, if you require ad-hoc querying or workloads that vary with time, you may choose to create several separate cluster tuned to the specific task sharing data sources stored in Amazon S3.

Answer 113

No. You must upload the script or jar to Amazon S3 or to the cluster’s master node before it can be referenced. For uploading to Amazon S3 you can use tools including s3cmd, jets3t or S3Organizer.

Answer 114

Yes. You run a cluster in a manual termination mode so it will not terminate between Pig steps. To reduce the risk of data loss we recommend periodically persisting all important data in Amazon S3. It is good practice to regularly transfer your work to a new cluster to test you process for recovering from master node failure.

Answer 115

No. Pig does not support access through JDBC.

Answer 116

HBase is an open source, non-relational, distributed database modeled after Google's BigTable. It was developed as part of Apache Software Foundation's Hadoop project and runs on top of Hadoop Distributed File System(HDFS) to provide BigTable-like capabilities for Hadoop. HBase provides you a fault-tolerant, efficient way of storing large quantities of sparse data using column-based compression and storage. In addition, HBase provides fast lookup of data because data is stored in-memory instead of on disk. HBase is optimized for sequential write operations, and it is highly efficient for batch inserts, updates, and deletes. HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to Hadoop jobs. HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC).

Answer 117

With Amazon EMR you can back up HBase to Amazon S3 (full or incremental, manual or automated) and you can restore from a previously created backup. Learn more about HBase and EMR.

Answer 118

Amazon EMR supports HBase 0.94.7 and HBase 0.92.0. To use HBase 0.94.7 you must specify AMI version 3.0.0. If you are using the CLI you must use version 2013-10-07 or later.

Answer 119

The connector enables EMR to directly read and query data from Kinesis streams. You can now perform batch processing of Kinesis streams using existing Hadoop ecosystem tools such as Hive, Pig, MapReduce, Hadoop Streaming, and Cascading.

Answer 120

Reading and processing data from a Kinesis stream would require you to write, deploy and maintain independent stream processing applications. These take time and effort. However, with this connector, you can start reading and analyzing a Kinesis stream by writing a simple Hive or Pig script. This means you can analyze Kinesis streams using SQL! Of course, other Hadoop ecosystem tools could be used as well. You don’t need to developed or maintain a new set of processing applications.

Answer 121

The following types of users will find this integration useful: Hadoop users who are interested in utilizing the extensive set of Hadoop ecosystem tools to analyze Kinesis streams. Kinesis users who are looking for an easy way to get up and running with stream processing and ETL of Kinesis data. Business analysts and IT professionals who would like to perform ad-hoc analysis of data in Kinesis streams using familiar tools like SQL (via Hive) or scripting languages like Pig.

Answer 122

The following are representative use cases are enabled by this integration: Streaming Log Analysis: You can analyze streaming web logs to generate a list of top 10 error type every few minutes by region, browser, and access domains. Complex Data Processing Workflows: You can join Kinesis stream with data stored in S3, Dynamo DB tables, and HDFS. You can write queries that join clickstream data from Kinesis with advertising campaign information stored in a DynamoDB table to identify the most effective categories of ads that are displayed on particular websites. Ad-hoc Queries: You can periodically load data from Kinesis into HDFS and make it available as a local Impala table for fast, interactive, analytic queries.

Answer 123

You need to use EMR’s AMI version 3.0.4 and later.

Answer 124

No, it is a built in component of the Amazon distribution of Hadoop and is present on EMR AMI versions 3.0.4 and later. Customer simply needs to spin up a cluster with AMI version 3.0.4 or later to start using this feature.

Answer 125

The EMR Kinesis integration is not data format specific. You can read data in any format. Individual Kinesis records are presented to Hadoop as standard records that can be read using any Hadoop MapReduce framework. Individual frameworks like Hive, Pig and Cascading have built in components that help with serialization and deserialization, making it easy for developers to query data from many formats without having to implement custom code. For example, in Hive users can read data from JSON files, XML files and SEQ files by specifying the appropriate Hive SerDe when they define a table. Pig has a similar component called Loadfunc/Evalfunc and Cascading has a similar component called a Tap. Hadoop users can leverage the extensive ecosystem of Hadoop adapters without having to write format specific code. You can also implement custom deserialization formats to read domain specific data in any of these tools.

Answer 126

Create a table that references a Kinesis stream. You can then analyze the table like any other table in Hive. Please see our tutorials for page more details.

Answer 127

First create a table that references a Kinesis stream. Once a Hive table has been created, you can join it with tables mapping to other data sources such as Amazon S3, Amazon Dynamo DB, and HDFS. This effectively results in joining data from Kinesis stream to other data sources.

Answer 128

No, you can use Hive, Pig, MapReduce, Hadoop Streaming, and Cascading.

Answer 129

The EMR Kinesis input connector provides features that help you configure and manage scheduled periodic jobs in traditional scheduling engines such as Cron. For example, you can develop a Hive script that runs every N minutes. In the configuration parameters for a job, you can specify a Logical Name for the job. The Logical Name is a label that will inform the EMR Kinesis input connector that individual instances of the job are members of the same periodic schedule. The Logical Name allows the process to take advantage of iterations, which are explained next. Since MapReduce is a batch processing framework, to analyze a Kinesis stream using EMR, the continuous stream is divided in to batches. Each batch is called an Iteration. Each Iteration is assigned a number, starting with 0. Each Iteration’s boundaries are defined by a start sequence number and end sequence number. Iterations are then processed sequentially by EMR. In the event of an attempt’s failure, the EMR Kinesis input connector will re-try the iteration within the Logical Name from the known start sequence number of the iteration. This functionality ensures that successive attempts on the same iteration will have precisely the same input records from the Kinesis stream as the previous attempts. This guarantees idempotent (consistent) processing of a Kinesis stream. You can specify Logical Names and Iterations as runtime parameters in your respective Hadoop tools. For example, in the tutorial section "Running queries with checkpoints", the code sample shows a scheduled Hive query that designates a Logical Name for the query and increments the iteration with each successive run of the job. Additionally, a sample cron scheduling script is provided in the tutorials.

Answer 130

The metadata that allows the EMR Kinesis input connector to work in scheduled periodic workflows is stored in Amazon DynamoDB. You must provision an Amazon Dynamo DB table and specify it as an input parameter to the Hadoop Job. It is important that you configure appropriate IOPS for the table to enable this integration. Please refer to the getting started tutorial for more information on setting up your Amazon Dynamo DB table.

Answer 131

Iterations identifiers are user-provided values that map to specific boundary (start and end sequence numbers) in a Kinesis stream. Data corresponding to these boundaries is loaded in the Map phase of the MapReduce job. This phase is managed by the framework and will be automatically re-run (three times by default) in case of job failure. If all the retries fail, you would still have options to retry the processing starting from last successful data boundary or past data boundaries. This behavior is controlled by providing kinesis.checkpoint.iteration.no parameter during processing. Please refer to the getting started tutorial for more information on how this value is configured for different tools in the Hadoop ecosystem.

Answer 132

Yes, you can specify a previously run iteration by setting the kinesis.checkpoint.iteration.no parameter in successive processing. The implementation ensures that successive runs on the same iteration will have precisely the same input records from the Kinesis stream as the previous runs.

Answer 133

In the event that the beginning sequence number and/or end sequence number of an iteration belong to records that have expired from the Kinesis steam, the Hadoop job will fail. You would need to use a different Logical Name to process data from the beginning of the Kinesis stream.

Answer 134

No. The EMR Kinesis connector currently does not support writing data back into a Kinesis stream.

Answer 135

The Hadoop MapReduce framework is a batch processing system. As such, it does not support continuous queries. However there is an emerging set of Hadoop ecosystem frameworks like Twitter Storm and Spark Streaming that enable to developers build applications for continuous stream processing. A Storm connector for Kinesis is available at on GitHub here and you can find a tutorial explaining how to setup Spark Streaming on EMR and run continuous queries here. Additionally, developers can utilize the Kinesis client library to develop real-time stream processing applications. You can find more information on developing custom Kinesis applications in the Kinesis documentation here.

Answer 136

Yes. You can read streams from another AWS account by specifying the appropriate access credentials of the account that owns the Kinesis stream. By default, the Kinesis connector utilizes the user-supplied access credentials that are specified when the cluster is created. You can override these credentials to access streams from other AWS Accounts by setting the kinesis.accessKey and kinesis.secretKey parameters. The following examples show how to set the kinesis.accessKey and kinesis.secretKey parameters in Hive and Pig. Code sample for Hive: ... STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler' TBLPROPERTIES( "kinesis.accessKey"="AwsAccessKey", "kinesis.secretKey"="AwsSecretKey", ); Code sample for Pig: … raw\_logs = LOAD 'AccessLogStream' USING com.amazon.emr.kinesis.pig.Kin esisStreamLoader('kinesis.accessKey=AwsAccessKey', 'kinesis.secretKey=AwsSecretKey' ) AS (line:chararray);

Answer 137

Yes, a customer can run multiple parallel queries on the same stream by using separate logical names for each query. However, reading from a shard within a Kinesis stream is subjected to a rate limit of of 2MB/sec. Thus, if there are N parallel queries running on the same stream, each one would get roughly (2/N) MB/sec egress rate per shard on the stream. This may slow down the processing and in some cases fail the queries as well.

Answer 138

Yes, for example in Hive, you can create two tables mapping to two different Kinesis streams and create joins between the tables.

Answer 139

Yes. The implementation handles split and merge events. The Kinesis connector ties individual Kinesis shards (the logical unit of scale within a Kinesis stream) to Hadoop MapReduce map tasks. Each unique shard that exists within a stream in the logical period of an Iteration will result in exactly one map task. In the event of a shard split or merge event, Kinesis will provision new unique shard Ids. As a result, the MapReduce framework will provision more map tasks to read from Kinesis. All of this is transparent to the user.

Answer 140

The implementation allows you to configure a parameter called kinesis.nodata.timeout. For example, consider a scenario where kinesis.nodata.timeout is set to 2 minutes and you want to run a Hive query every 10 minutes. Additionally, consider some data has been written to the stream since the last iteration (10 minutes ago). However, currently no new records are arriving, i.e. there is a silence in the stream. In this case, when the current iteration of the query launches, the Kinesis connector would find that no new records are arriving. The connector will keep polling the stream for 2 minutes and if no records arrive for that interval then it will stop and process only those records that were already read in the current batch of stream. However, if new records start arriving before kinesis.nodata.timeout interval is up, then the connector will wait for an additional interval corresponding to a parameter called kinesis.iteration.timeout. Please look at the tutorials to see how to define these parameters.

Answer 141

In the event of a processing failure, you can utilize the same tools they currently do when debugging Hadoop Jobs. Including the Amazon EMR web console, which helps identify and access error logs. More details on debugging an EMR job can be found here.

Answer 142

The job would fail and the exception would show up in error logs for the job.

Answer 143

The job would fail and the exception would show up in error logs for the job.

Analytics | Amazon EMR Flashcards

(167 cards)