Amazon EMR | Developing Flashcards
How many clusters can I run simultaneously?
Developing
Amazon EMR | Analytics
You can start as many clusters as you like. You are limited to 20 instances across all your clusters. If you need more instances, complete the Amazon EC2 instance request form and your use case and instance increase will be considered. If your Amazon EC2 limit has been already raised, the new limit will be applied to your Amazon EMR clusters.
Where can I find code samples?
Developing
Amazon EMR | Analytics
Check out the sample code in these Articles and Tutorials.
How do I develop a data processing application?
Developing
Amazon EMR | Analytics
You can develop a data processing job on your desktop, for example, using Eclipse or NetBeans plug-ins such as IBM MapReduce Tools for Eclipse (http://www.alphaworks.ibm.com/tech/mapreducetools). These tools make it easy to develop and debug MapReduce jobs and test them locally on your machine. Additionally, you can develop your cluster directly on Amazon EMR using one or more instances.
What is the benefit of using the Command Line Tools or APIs vs. AWS Management Console?
Developing
Amazon EMR | Analytics
The Command Line Tools or APIs provide the ability to programmatically launch and monitor progress of running clusters, to create additional custom functionality around clusters (such as sequences with multiple processing steps, scheduling, workflow, or monitoring), or to build value-added tools or applications for other Amazon EMR customers. In contrast, the AWS Management Console provides an easy-to-use graphical interface for launching and monitoring your clusters directly from a web browser.
Can I add steps to a cluster that is already running?
Developing
Amazon EMR | Analytics
Yes. Once the job is running, you can optionally add more steps to it via the AddJobFlowSteps API. The AddJobFlowSteps API will add new steps to the end of the current step sequence. You may want to use this API to implement conditional logic in your cluster or for debugging.
Can I run a persistent cluster?
Developing
Amazon EMR | Analytics
Yes. Amazon EMR clusters that are started with the –alive flag will continue until explicitly terminated. This allows customers to add steps to a cluster on demand. You may want to use this to debug your application without having to repeatedly wait for cluster startup. You may also use a persistent cluster to run a long-running data warehouse cluster. This can be combined with data warehouse and analytics packages that runs on top of Hadoop such as Hive and Pig.
Can I be notified when my cluster is finished?
Developing
Amazon EMR | Analytics
You can sign up for up Amazon SNS and have the cluster post to your SNS topic when it is finished. You can also view your cluster progress on the AWS Management Console or you can use the Command Line, SDK, or APIs get a status on the cluster.
What programming languages does Amazon EMR support?
Developing
Amazon EMR | Analytics
You can use Java to implement Hadoop custom jars. Alternatively, you may use other languages including Perl, Python, Ruby, C++, PHP, and R via Hadoop Streaming. Please refer to the Developer’s Guide for instructions on using Hadoop Streaming.
What OS versions are supported with Amazon EMR?
Developing
Amazon EMR | Analytics
At this time Amazon EMR supports Debian/Squeeze in 32 and 64 bit modes.
Can I view the Hadoop UI while my cluster is running?
Developing
Amazon EMR | Analytics
Yes. Please refer to the Hadoop UI section in the Developer’s Guide for instructions on how to access the Hadoop UI.
Does Amazon EMR support third-party software packages?
Developing
Amazon EMR | Analytics
Yes. The recommended way to install third-party software packages on your cluster is to use Bootstrap Actions. Alternatively you can package any third party libraries directly into your Mapper or Reducer executable. You can also upload statically compiled executables using the Hadoop distributed cache mechanism.
Which Hadoop versions does Amazon EMR support?
Developing
Amazon EMR | Analytics
For the latest versions supported by Amazon EMR, please reference the documentation.
Does Amazon contribute Hadoop improvements to the open source community?
Developing
Amazon EMR | Analytics
Yes. Amazon EMR is active with the open source community and contributes many fixes back to the Hadoop source.
Does Amazon EMR update the version of Hadoop it supports?
Developing
Amazon EMR | Analytics
Amazon EMR periodically updates its supported version of Hadoop based on the Hadoop releases by the community. Amazon EMR may choose to skip some Hadoop releases.