Analytics | Amazon EMR Flashcards
What is Amazon EMR?
General
Amazon EMR | Analytics
Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
What can I do with Amazon EMR?
General
Amazon EMR | Analytics
Using Amazon EMR, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon EMR lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
Amazon EMR is ideal for problems that necessitate the fast and efficient processing of large amounts of data. The web service interfaces allow you to build processing workflows, and programmatically monitor progress of running clusters. In addition, you can use the simple web interface of the AWS Management Console to launch your clusters and monitor processing-intensive computation on clusters of Amazon EC2 instances.
Who can use Amazon EMR?
General
Amazon EMR | Analytics
Anyone who requires simple access to powerful data analysis can use Amazon EMR. You don’t need any software development experience to experiment with several sample applications available in the Developer Guide and on the AWS Big Data Blog.
What can I do with Amazon EMR that I could not do before?
General
Amazon EMR | Analytics
Amazon EMR significantly reduces the complexity of the time-consuming set-up, management. and tuning of Hadoop clusters or the compute capacity upon which they sit. You can instantly spin up large Hadoop clusters which will start processing within minutes, not hours or days. When your cluster finishes its processing, unless you specify otherwise, it will be automatically terminated so you are not paying for resources you no longer need.
Using this service you can quickly perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
As a software developer, you can also develop and run your own more sophisticated applications, allowing you to add functionality such as scheduling, workflows, monitoring, or other features.
What is the data processing engine behind Amazon EMR?
General
Amazon EMR | Analytics
Amazon EMR uses Apache Hadoop as its distributed data processing engine. Hadoop is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware. Hadoop implements a programming model named “MapReduce,” where the data is divided into many small fragments of work, each of which may be executed on any node in the cluster. This framework has been widely used by developers, enterprises and startups and has proven to be a reliable software platform for processing up to petabytes of data on clusters of thousands of commodity machines.
What is an Amazon EMR cluster?
General
Amazon EMR | Analytics
Amazon EMR historically referred to an Amazon EMR cluster (and all processing steps assigned to it) as a “cluster”. Every cluster or cluster has a unique identifier that starts with “j-“.
What is a cluster step?
General
Amazon EMR | Analytics
A cluster step is a user-defined unit of processing, mapping roughly to one algorithm that manipulates the data. A step is a Hadoop MapReduce application implemented as a Java jar or a streaming program written in Java, Ruby, Perl, Python, PHP, R, or C++. For example, to count the frequency with which words appear in a document, and output them sorted by the count, the first step would be a MapReduce application which counts the occurrences of each word, and the second step would be a MapReduce application which sorts the output from the first step based on the counts.
What are different cluster states?
General
Amazon EMR | Analytics
STARTING – The cluster provisions, starts, and configures EC2 instances.
BOOTSTRAPPING – Bootstrap actions are being executed on the cluster.
RUNNING – A step for the cluster is currently being run.
WAITING – The cluster is currently active, but has no steps to run.
TERMINATING - The cluster is in the process of shutting down.
TERMINATED - The cluster was shut down without error.
TERMINATED_WITH_ERRORS - The cluster was shut down with errors.
What are different step states?
Launching a Cluster
Amazon EMR | Analytics
PENDING – The step is waiting to be run.
RUNNING – The step is currently running.
COMPLETED – The step completed successfully.
CANCELLED – The step was cancelled before running because an earlier step failed or cluster was terminated before it could run.
FAILED – The step failed while running.
How can I access Amazon EMR?
Launching a Cluster
Amazon EMR | Analytics
You can access Amazon EMR by using the AWS Management Console, Command Line Tools, SDKS, or the EMR API.
How can I launch a cluster?
Launching a Cluster
Amazon EMR | Analytics
You can launch a cluster through the AWS Management Console by filling out a simple cluster request form. In the request form, you specify the name of your cluster, the location in Amazon S3 of your input data, your processing application, your desired data output location, and the number and type of Amazon EC2 instances you’d like to use. Optionally, you can specify a location to store your cluster log files and SSH Key to login to your cluster while it is running. Alternatively, you can launch a cluster using the RunJobFlow API or using the ‘create’ command in the Command Line Tools.
How can I get started with Amazon EMR?
Launching a Cluster
Amazon EMR | Analytics
To sign up for Amazon EMR, click the “Sign Up Now” button on the Amazon EMR detail page http://aws.amazon.com/elasticmapreduce. You must be signed up for Amazon EC2 and Amazon S3 to access Amazon EMR; if you are not already signed up for these services, you will be prompted to do so during the Amazon EMR sign-up process. After signing up, please refer to the Amazon EMR documentation, which includes our Getting Started Guide – the best place to get going with the service.
How can I terminate a cluster?
Launching a Cluster
Amazon EMR | Analytics
At any time, you can terminate a cluster via the AWS Management Console by selecting a cluster and clicking the “Terminate” button. Alternatively, you can use the TerminateJobFlows API. If you terminate a running cluster, any results that have not been persisted to Amazon S3 will be lost and all Amazon EC2 instances will be shut down.
Does Amazon EMR support multiple simultaneous cluster?
Launching a Cluster
Amazon EMR | Analytics
Yes. At any time, you can create a new cluster, even if you’re already running one or more clusters.
How many clusters can I run simultaneously?
Developing
Amazon EMR | Analytics
You can start as many clusters as you like. You are limited to 20 instances across all your clusters. If you need more instances, complete the Amazon EC2 instance request form and your use case and instance increase will be considered. If your Amazon EC2 limit has been already raised, the new limit will be applied to your Amazon EMR clusters.
Where can I find code samples?
Developing
Amazon EMR | Analytics
Check out the sample code in these Articles and Tutorials.
How do I develop a data processing application?
Developing
Amazon EMR | Analytics
You can develop a data processing job on your desktop, for example, using Eclipse or NetBeans plug-ins such as IBM MapReduce Tools for Eclipse (http://www.alphaworks.ibm.com/tech/mapreducetools). These tools make it easy to develop and debug MapReduce jobs and test them locally on your machine. Additionally, you can develop your cluster directly on Amazon EMR using one or more instances.
What is the benefit of using the Command Line Tools or APIs vs. AWS Management Console?
Developing
Amazon EMR | Analytics
The Command Line Tools or APIs provide the ability to programmatically launch and monitor progress of running clusters, to create additional custom functionality around clusters (such as sequences with multiple processing steps, scheduling, workflow, or monitoring), or to build value-added tools or applications for other Amazon EMR customers. In contrast, the AWS Management Console provides an easy-to-use graphical interface for launching and monitoring your clusters directly from a web browser.
Can I add steps to a cluster that is already running?
Developing
Amazon EMR | Analytics
Yes. Once the job is running, you can optionally add more steps to it via the AddJobFlowSteps API. The AddJobFlowSteps API will add new steps to the end of the current step sequence. You may want to use this API to implement conditional logic in your cluster or for debugging.
Can I run a persistent cluster?
Developing
Amazon EMR | Analytics
Yes. Amazon EMR clusters that are started with the –alive flag will continue until explicitly terminated. This allows customers to add steps to a cluster on demand. You may want to use this to debug your application without having to repeatedly wait for cluster startup. You may also use a persistent cluster to run a long-running data warehouse cluster. This can be combined with data warehouse and analytics packages that runs on top of Hadoop such as Hive and Pig.
Can I be notified when my cluster is finished?
Developing
Amazon EMR | Analytics
You can sign up for up Amazon SNS and have the cluster post to your SNS topic when it is finished. You can also view your cluster progress on the AWS Management Console or you can use the Command Line, SDK, or APIs get a status on the cluster.
What programming languages does Amazon EMR support?
Developing
Amazon EMR | Analytics
You can use Java to implement Hadoop custom jars. Alternatively, you may use other languages including Perl, Python, Ruby, C++, PHP, and R via Hadoop Streaming. Please refer to the Developer’s Guide for instructions on using Hadoop Streaming.
What OS versions are supported with Amazon EMR?
Developing
Amazon EMR | Analytics
At this time Amazon EMR supports Debian/Squeeze in 32 and 64 bit modes.
Can I view the Hadoop UI while my cluster is running?
Developing
Amazon EMR | Analytics
Yes. Please refer to the Hadoop UI section in the Developer’s Guide for instructions on how to access the Hadoop UI.
Does Amazon EMR support third-party software packages?
Developing
Amazon EMR | Analytics
Yes. The recommended way to install third-party software packages on your cluster is to use Bootstrap Actions. Alternatively you can package any third party libraries directly into your Mapper or Reducer executable. You can also upload statically compiled executables using the Hadoop distributed cache mechanism.
Which Hadoop versions does Amazon EMR support?
Developing
Amazon EMR | Analytics
For the latest versions supported by Amazon EMR, please reference the documentation.
Does Amazon contribute Hadoop improvements to the open source community?
Developing
Amazon EMR | Analytics
Yes. Amazon EMR is active with the open source community and contributes many fixes back to the Hadoop source.
Does Amazon EMR update the version of Hadoop it supports?
Developing
Amazon EMR | Analytics
Amazon EMR periodically updates its supported version of Hadoop based on the Hadoop releases by the community. Amazon EMR may choose to skip some Hadoop releases.
How quickly does Amazon EMR retire support for old Hadoop versions?
Debugging
Amazon EMR | Analytics
Amazon EMR service retires support for old Hadoop versions several months after deprecation. However, Amazon EMR APIs are backward compatible, so if you build tools on top of these APIs, they will work even when Amazon EMR updates the Hadoop version it’s using.
How can I debug my cluster?
Debugging
Amazon EMR | Analytics
You first select the cluster you want to debug, then click on the “Debug” button to access the debug a cluster window in the AWS Management Console. This will enable you to track progress and identify issues in steps, jobs, tasks, or task attempts of your clusters. Alternatively you can SSH directly into the Amazon Elastic Compute Cloud (Amazon EC2) instances that are running your cluster and use your favorite command-line debugger to troubleshoot the cluster.
What is the cluster debug tool?
Debugging
Amazon EMR | Analytics
The cluster debug tool is a part of the AWS Management Console where you can track progress and identify issues in steps, jobs, tasks, or task attempts of your clusters. To access the cluster debug tool, first select the cluster you want to debug and then click on the “Debug” button.
How can I enable debugging of my cluster?
Debugging
Amazon EMR | Analytics
To enable debugging you need to set “Enable Debugging” flag when you create a cluster in the AWS Management Console. Alternatively, you can pass the –enable-debugging and –log-uri flags in the Command Line Client when creating a cluster.
Where can I find instructions on how to use the debug a cluster window?
Debugging
Amazon EMR | Analytics
Please reference the AWS Management Console section of the Developer’s Guide for instructions on how to access and use the debug a cluster window.
What types of clusters can I debug with the debug a cluster window?
Debugging
Amazon EMR | Analytics
You can debug all types of clusters currently supported by Amazon EMR including custom jar, streaming, Hive, and Pig.
Why do I have to sign-up for Amazon SimpleDB to use cluster debugging?
Debugging
Amazon EMR | Analytics
Amazon EMR stores state information about Hadoop jobs, tasks and task attempts under your account in Amazon SimpleDB. You can subscribe to Amazon SimpleDB here.
Can I use the cluster debugging feature without Amazon SimpleDB subscription?
Debugging
Amazon EMR | Analytics
You will be able to browse cluster steps and step logs but will not be able to browse Hadoop jobs, tasks, or task attempts if you are not subscribed to Amazon SimpeDB.
Can I delete historical cluster data from Amazon SimpleDB?
Managing Data
Amazon EMR | Analytics
Yes. You can delete Amazon SimpleDB domains that Amazon EMR created on your behalf. Please reference the Amazon SimpleDB documentation for instructions.
How do I get my data into Amazon S3?
Managing Data
Amazon EMR | Analytics
You can use Amazon S3 APIs to upload data to Amazon S3. Alternatively, you can use many open source or commercial clients to easily upload data to Amazon S3.
How do I get logs for completed clusters?
Managing Data
Amazon EMR | Analytics
Hadoop system logs as well as user logs will be placed in the Amazon S3 bucket which you specify when creating a cluster.
Do you compress logs?
Managing Data
Amazon EMR | Analytics
No. At this time Amazon EMR does not compress logs as it moves them to Amazon S3.
Can I load my data from the internet or somewhere other than Amazon S3?
Billing
Amazon EMR | Analytics
Yes. Your Hadoop application can load the data from anywhere on the internet or from other AWS services. Note that if you load data from the internet, EC2 bandwidth charges will apply. Amazon EMR also provides Hive-based access to data in DynamoDB.
Can Amazon EMR estimate how long it will take to process my input data?
Billing
Amazon EMR | Analytics
No. As each cluster and input data is different, we cannot estimate your job duration.
How much does Amazon EMR cost?
Billing
Amazon EMR | Analytics
As with the rest of AWS, you pay only for what you use. There is no minimum fee and there are no up-front commitments or long-term contracts. Amazon EMR pricing is in addition to normal Amazon EC2 and Amazon S3 pricing.
For Amazon EMR pricing information, please visit EMR’s pricing page.
Amazon EC2, Amazon S3 and Amazon SimpleDB charges are billed separately. Pricing for Amazon EMR is per-second consumed for each instance type (with a one-minute minimum), from the time cluster is requested until it is terminated. For additional details on Amazon EC2 Instance Types, Amazon EC2 Spot Pricing, Amazon EC2 Reserved Instances Pricing, Amazon S3 Pricing, or Amazon SimpleDB Pricing, follow the links below:
Amazon EC2 Instance Types
Amazon EC2 Reserved Instances Pricing
Amazon EC2 Spot Instances Pricing
Amazon S3 Pricing
Amazon SimpleDB Pricing
When does billing of my Amazon EMR cluster begin and end?
Billing
Amazon EMR | Analytics
Billing commences when Amazon EMR starts running your cluster. You are only charged for the resources actually consumed. For example, let’s say you launched 100 Amazon EC2 Standard Small instances for an Amazon EMR cluster, where the Amazon EMR cost is an incremental $0.015 per hour. The Amazon EC2 instances will begin booting immediately, but they won’t necessarily all start at the same moment. Amazon EMR will track when each instance starts and will check it into the cluster so that it can accept processing tasks.
In the first 10 minutes after your launch request, Amazon EMR either starts your cluster (if all of your instances are available) or checks in as many instances as possible. Once the 10 minute mark has passed, Amazon EMR will start processing (and charging for) your cluster as soon as 90% of your requested instances are available. As the remaining 10% of your requested instances check in, Amazon EMR starts charging for those instances as well.
So, in the above example, if all 100 of your requested instances are available 10 minutes after you kick off a launch request, you’ll be charged $1.50 per hour (100 * $0.015) for as long as the cluster takes to complete. If only 90 of your requested instances were available at the 10 minute mark, you’d be charged $1.35 per hour (90 * $0.015) for as long as this was the number of instances running your cluster. When the remaining 10 instances checked in, you’d be charged $1.50 per hour (100 * $0.015) for as long as the balance of the cluster takes to complete.
Each cluster will run until one of the following occurs: you terminate the cluster with the TerminateJobFlows API call (or an equivalent tool), the cluster shuts itself down, or the cluster is terminated due to software or hardware failure.
Where can I track my Amazon EMR, Amazon EC2 and Amazon S3 usage?
Billing
Amazon EMR | Analytics
You can track your usage in the Billing & Cost Management Console.
How do you calculate the Normalized Instance Hours displayed on the console ?
Billing
Amazon EMR | Analytics
On the AWS Management Console, every cluster has a Normalized Instance Hours column that displays the approximate number of compute hours the cluster has used, rounded up to the nearest hour. Normalized Instance Hours are hours of compute time based on the standard of 1 hour of m1.small usage = 1 hour normalized compute time. The following table outlines the normalization factor used to calculate normalized instance hours for the various instance sizes:
Instance Size Normalization Factor
Small 1
Medium 2
Large 4
Xlarge 8
2xlarge 16
4xlarge 32
8xlarge 64
For example, if you run a 10-node r3.8xlarge cluster for an hour, the total number of Normalized Instance Hours displayed on the console will be 640 (10 (number of nodes) x 64 (normalization factor) x 1 (number of hours tthat the cluster ran) = 640).
This is an approximate number and should not be used for billing purposes. Please refer to the Billing & Cost Management Console for billable Amazon EMR usage. Note that we recently changed the normalization factor to accurately reflect the weights of the instances, and the normalization factor does not affect your monthly bill.
Does Amazon EMR support Amazon EC2 On-Demand, Spot, and Reserved Instances?
Billing
Amazon EMR | Analytics
Yes. Amazon EMR seamlessly supports On-Demand, Spot, and Reserved Instances. Click here to learn more about Amazon EC2 Reserved Instances. Click here to learn more about Amazon EC2 Spot Instances.
Do your prices include taxes?
Security
Amazon EMR | Analytics
Except as otherwise noted, our prices are exclusive of applicable taxes and duties, including VAT and applicable sales tax. For customers with a Japanese billing address, use of AWS services is subject to Japanese Consumption Tax. Learn more.
How do I prevent other people from viewing my data during cluster execution?
Security
Amazon EMR | Analytics
Amazon EMR starts your instances in two Amazon EC2 security groups, one for the master and another for the slaves. The master security group has a port open for communication with the service. It also has the SSH port open to allow you to SSH into the instances, using the key specified at startup. The slaves start in a separate security group, which only allows interaction with the master instance. By default both security groups are set up to not allow access from external sources including Amazon EC2 instances belonging to other customers. Since these are security groups within your account, you can reconfigure them using the standard EC2 tools or dashboard. Click here to learn more about EC2 security groups.
How secure is my data?
Security
Amazon EMR | Analytics
Amazon S3 provides authentication mechanisms to ensure that stored data is secured against unauthorized access. Unless the customer who is uploading the data specifies otherwise, only that customer can access the data. Amazon EMR customers can also choose to send data to Amazon S3 using the HTTPS protocol for secure transmission. In addition, Amazon EMR always uses HTTPS to send data between Amazon S3 and Amazon EC2. For added security, customers may encrypt the input data before they upload it to Amazon S3 (using any common data compression tool); they then need to add a decryption step to the beginning of their cluster when Amazon EMR fetches the data from Amazon S3.
Can I get a history of all EMR API calls made on my account for security or compliance auditing?
Regions & Availability Zones
Amazon EMR | Analytics
Yes. AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The AWS API call history produced by CloudTrail enables security analysis, resource change tracking, and compliance auditing. Learn more about CloudTrail at the AWS CloudTrail detail page, and turn it on via CloudTrail’s AWS Management Console.
How does Amazon EMR make use of Availability Zones?
Regions & Availability Zones
Amazon EMR | Analytics
Amazon EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone. Running a cluster in the same zone improves performance of the jobs flows because it provides a higher data access rate. By default, Amazon EMR chooses the Availability Zone with the most available resources in which to run your cluster. However, you can specify another Availability Zone if required.
In what Regions is this Amazon EMR available?
Regions & Availability Zones
Amazon EMR | Analytics
For a list of the supported Amazon EMR AWS regions, please visit the AWS Region Table for all AWS global infrastructure.
Which Region should I select to run my clusters?
Regions & Availability Zones
Amazon EMR | Analytics
When creating a cluster, typically you should select the Region where your data is located.
Can I use EU data in a cluster running in the US region and vice versa?
Regions & Availability Zones
Amazon EMR | Analytics
Yes you can. If you transfer data from one region to the other you will be charged bandwidth charges. For bandwidth pricing information, please visit the pricing section on the EC2 detail page.
What is different about the AWS GovCloud (US) region?
Managing your Cluster
Amazon EMR | Analytics
The AWS GovCloud (US) region is designed for US government agencies and customers. It adheres to US ITAR requirements. In GovCloud, EMR does not support spot instances or the enable-debugging feature. The EMR Management Console is not yet available in GovCloud.
How does Amazon EMR use Amazon EC2 and Amazon S3?
Managing your Cluster
Amazon EMR | Analytics
Customers upload their input data and a data processing application into Amazon S3. Amazon EMR then launches a number of Amazon EC2 instances as specified by the customer. The service begins the cluster execution while pulling the input data from Amazon S3 using S3N protocol into the launched Amazon EC2 instances. Once the cluster is finished, Amazon EMR transfers the output data to Amazon S3, where customers can then retrieve it or use as input in another cluster.
How is a computation done in Amazon EMR?
Managing your Cluster
Amazon EMR | Analytics
Amazon EMR uses the Hadoop data processing engine to conduct computations implemented in the MapReduce programming model. The customer implements their algorithm in terms of map() and reduce() functions. The service starts a customer-specified number of Amazon EC2 instances, comprised of one master and multiple slaves. Amazon EMR runs Hadoop software on these instances. The master node divides input data into blocks, and distributes the processing of the blocks to the slave node. Each slave node then runs the map function on the data it has been allocated, generating intermediate data. The intermediate data is then sorted and partitioned and sent to processes which apply the reducer function to it. These processes also run on the slave nodes. Finally, the output from the reducer tasks is collected in files. A single “cluster” may involve a sequence of such MapReduce steps.
How reliable is Amazon EMR?
Managing your Cluster
Amazon EMR | Analytics
Amazon EMR manages an Amazon EC2 cluster of compute instances using Amazon’s highly available, proven network infrastructure and datacenters. Amazon EMR uses industry proven, fault-tolerant Hadoop software as its data processing engine. Hadoop splits the data into multiple subsets and assigns each subset to more than one Amazon EC2 instance. So, if an Amazon EC2 instance fails to process one subset of data, the results of another Amazon EC2 instance can be used.
How quickly will my cluster be up and running and processing my input data?
Managing your Cluster
Amazon EMR | Analytics
Amazon EMR starts resource provisioning of Amazon EC2 On-Demand instances almost immediately. If the instances are not available, Amazon EMR will keep trying to provision the resources for your cluster until they are provisioned or you cancel your request. The instance provisioning is done on a best-efforts basis and depends on the number of instances requested, time when the cluster is created, and total number of requests in the system. After resources have been provisioned, it typically takes fewer than 15 minutes to start processing.
In order to guarantee capacity for your clusters at the time you need it, you may pay a one-time fee for Amazon EC2 Reserved Instances to reserve instance capacity in the cloud at a discounted hourly rate. Like On-Demand Instances, customers pay usage charges only for the time when their instances are running. In this way, Reserved Instances enable businesses with known instance requirements to maintain the elasticity and flexibility of On-Demand Instances, while also reducing their predictable usage costs even further.
Which Amazon EC2 instance types does Amazon EMR support?
Managing your Cluster
Amazon EMR | Analytics
Amazon EMR supports 12 EC2 instance types including Standard, High CPU, High Memory, Cluster Compute, High I/O, and High Storage. Standard Instances have memory to CPU ratios suitable for most general-purpose applications. High CPU instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications. High Memory instances offer large memory sizes for high throughput applications. Cluster Compute instances have proportionally high CPU with increased network performance and are well suited for High Performance Compute (HPC) applications and other demanding network-bound applications. High Storage instances offer 48 TB of storage across 24 disks and are ideal for applications that require sequential access to very large data sets such as data warehousing and log processing. See the EMR pricing page for details on available instance types and pricing per region.
How do I select the right Amazon EC2 instance type?
Managing your Cluster
Amazon EMR | Analytics
When choosing instance types, you should consider the characteristics of your application with regards to resource utilization and select the optimal instance family. One of the advantages of Amazon EMR with Amazon EC2 is that you pay only for what you use, which makes it convenient and inexpensive to test the performance of your clusters on different instance types and quantity. One effective way to determine the most appropriate instance type is to launch several small clusters and benchmark your clusters.
How do I select the right number of instances for my cluster?
Managing your Cluster
Amazon EMR | Analytics
The number of instances to use in your cluster is application-dependent and should be based on both the amount of resources required to store and process your data and the acceptable amount of time for your job to complete. As a general guideline, we recommend that you limit 60% of your disk space to storing the data you will be processing, leaving the rest for intermediate output. Hence, given 3x replication on HDFS, if you were looking to process 5 TB on m1.xlarge instances, which have 1,690 GB of disk space, we recommend your cluster contains at least (5 TB * 3) / (1,690 GB * .6) = 15 m1.xlarge core nodes. You may want to increase this number if your job generates a high amount of intermediate data or has significant I/O requirements. You may also want to include additional task nodes to improve processing performance. See Amazon EC2 Instance Types for details on local instance storage for each instance type configuration.
How long will it take to run my cluster?
Managing your Cluster
Amazon EMR | Analytics
The time to run your cluster will depend on several factors including the type of your cluster, the amount of input data, and the number and type of Amazon EC2 instances you choose for your cluster.
If the master node in a cluster goes down, can Amazon EMR recover it?
Managing your Cluster
Amazon EMR | Analytics
No. If the master node goes down, your cluster will be terminated and you’ll have to rerun your job. Amazon EMR currently does not support automatic failover of the master nodes or master node state recovery. In case of master node failure, the AWS Management console displays “The master node was terminated” message which is an indicator for you to start a new cluster. Customers can instrument check pointing in their clusters to save intermediate data (data created in the middle of a cluster that has not yet been reduced) on Amazon S3. This will allow resuming the cluster from the last check point in case of failure.
If a slave node goes down in a cluster, can Amazon EMR recover from it?
Managing your Cluster
Amazon EMR | Analytics
Yes. Amazon EMR is fault tolerant for slave failures and continues job execution if a slave node goes down. Amazon EMR will also provision a new node when a core node fails. However, Amazon EMR will not replace nodes if all nodes in the cluster are lost.