EMR Flashcards
What is EMR?
AWS Elastic MapReduce (EMR) is a cloud-based big data platform for processing large amounts of data. It utilizes open-source tools such as Apache Spark and Hadoop, along with several other leading open-source frameworks. It assists in data processing tasks such as web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.
Features of EMR?
Scalability: AWS EMR allows you to quickly and easily scale your processing capacity. You can add or subtract cluster instances as your needs change, and you only pay for what you use.
Flexibility: EMR supports multiple big data frameworks, including Apache Spark, Hadoop, HBase, Presto, and Flink. It also integrates with other AWS services like AWS Glue, Amazon S3, DynamoDB, and more.
Speed: EMR is designed to process large data sets quickly and efficiently. It distributes the data and processing across a resizable cluster of Amazon EC2 instances.
Security: AWS EMR ensures data is stored securely, with options for encryption at rest and in transit. It is also integrated with AWS Lake Formation to provide granular data access control.
Cost-Effective: With AWS EMR, you can use EC2 Spot Instances to save on computing costs. You also have the option to use Reserved Instances for long-term workloads, or On-Demand Instances for short-term workloads.
What are the different components of EMR Cluster?
An Amazon Elastic MapReduce (EMR) cluster is essentially a collection of Amazon EC2 instances, known as nodes, that are running Hadoop. Each node in the cluster has a specific role or node type: master node, core node, and task node.
- Master Node: Every EMR cluster will always have at least one master node. The master node manages the cluster, runs software components to coordinate data distribution, and supervises tasks among other nodes for processing. It monitors the overall health of the cluster and tracks the status of tasks. A minimal single-node EMR cluster could just consist of a master node doing everything.
- Core Node: Core nodes run tasks and store data in the Hadoop Distributed File System (HDFS) or in the EMR File System (EMRFS) which enables data writing into S3. These nodes perform the actual work of processing and storing data across the cluster. In a multi-node cluster, there will be at least one core node.
- Task Node: Task nodes, a relatively new addition, only run tasks without storing any data in HDFS or EMRFS. These nodes are optional and are typically added when there’s a need for more processing capacity but no additional storage. Task nodes can be especially beneficial in EMR as it often uses S3 for storage. Using task nodes helps save money as you don’t pay for unnecessary storage.
It’s also worth noting that there’s no risk of data loss when removing a task node as it doesn’t store any data. Task nodes are well-suited for spot instances, which are an efficient way to add more capacity and cut costs on an EMR cluster. It may appear in the exam. If a spot instance goes down, it doesn’t affect the data or the functioning of the cluster since task nodes are used for extra processing capacity. Hence, using spot instances for task nodes is a recommended strategy for cost-efficient and dynamic cluster expansion.
What are the different types of clusters?
The text describes two main ways of using Amazon Elastic MapReduce (EMR): transient clusters and long-running clusters.
Transient Clusters: Transient clusters are temporary and automatically terminate once all assigned steps are complete. When setting up a transient cluster, you specify the type of hardware for the EMR cluster and define the processing steps. The cluster carries out these steps—such as loading data, processing data, and storing results—and shuts down automatically when it’s done. This strategy is cost-effective as you only pay for the time the cluster is operational. If you occasionally run one-off jobs, transient clusters are a good choice. They spin up resources, execute your job, and then shut down, potentially saving money.
Long-Running Clusters: If you require a persistent data warehouse with continuous or periodic processing of large data sets, a long-running cluster is more suitable. In this scenario, you spin up a cluster with specified parameters and leave it running until manually terminated. To address occasional spikes in capacity needs, you can add more task nodes using spot instances. For long-running clusters, you can use reserved instances to save more money if you plan to keep the cluster operational for a prolonged period. By default, termination protection is enabled and auto termination is disabled on a long-running cluster, ensuring its preservation as long as possible.
In summary, choose transient clusters for predefined, one-time tasks, and long-running clusters for continuous data processing and access in a more persistent environment.
how can you interact with EMR?
Two primary ways to use and interact with Amazon Elastic MapReduce (EMR):
- Direct Interaction with Master Node: When launching an EMR cluster, you select the desired frameworks and applications, such as Apache Spark. The cluster automatically installs these when spinning up. If you have a long-lived cluster, you can connect directly to the master node and run your jobs from there. This approach is especially suitable for those comfortable with the command-line interface. For instance, you could set up a Spark-enabled EMR cluster, connect to the master node, and initiate your Spark driver script to leverage the full power of the cluster.
- Using AWS Console: Alternatively, you can submit steps via the AWS console. This process can be done purely graphically through the console. Basic tasks such as processing data in S3 or from the Hadoop Distributed File System (HDFS) can be defined as steps. Once defined, you can initiate these steps via the AWS console without needing to connect directly to the master node or use the command line. This data can then be output to S3 or another location.
In summary, there are two main ways to use EMR: directly interacting with the master node (usually for command-line users) or defining and initiating steps through the AWS console. The choice between the two largely depends on the user’s comfort with the command line and their specific use case.
What are different kind of storage options in EMR?
- HDFS (Hadoop Distributed File System)
- EMRFS (EMR File System)
In summary, you can use either HDFS or EMRFS for storage in an EMR cluster, with the main distinction being data persistence after cluster termination. The improved consistency of S3 has simplified using EMRFS.
Details of HDFS?
Since EMR is fundamentally a Hadoop cluster running on EC2, you can use HDFS for data storage. It utilizes the local storage of each instance, distributing the storage across the cluster.
Files are stored in blocks, and these blocks are distributed across the cluster.
To ensure redundancy, multiple copies of each file block are stored across instances. However, the storage is ephemeral, meaning all data is lost when the cluster is shut down. Despite this limitation, HDFS can still be used for caching intermediate results or workloads with substantial random IO.
Details of EMR File System?
EMRFS enables the use of S3 as if it were HDFS, providing persistent storage even after cluster termination.
There used to be a consistency issue when multiple nodes tried to write to the same S3 location simultaneously. This was resolved with the introduction of EMRFS Consistent View, which uses DynamoDB to track file access consistency. However, this added complexity and required careful management of read/write capacity for DynamoDB. As of 2021, Amazon S3 itself guarantees strong consistency, obviating the need for EMRFS Consistent View.
What are other alternate storage options for EMR storage clusters?
Other alternative storage options for Amazon Elastic MapReduce (EMR) clusters.
- Local File System: This is a fast option for data storage, but it’s ephemeral and only suitable for temporary data, like temporary buffers or caches. Data in the local file system will not be backed up and will be lost when the cluster is terminated.
- Elastic Block Store (EBS) for HDFS: This option allows the use of EBS-only instance types (e.g., M4, C4) for data storage. However, like the local file system, EBS storage is also deleted when the cluster is terminated. EBS volumes can only be attached when launching a cluster, so there’s no possibility to expand storage capacity later. If an EBS volume is manually detached while running, EMR will treat it as a failure and automatically replace it, showing resilience to this failure mode.
In summary, both the local file system and EBS for HDFS are transient storage options that don’t persist data after cluster termination. For persistent storage that survives after cluster termination, EMRFS with S3 should be used.
How does Amazon EMR charge?
Amazon Elastic MapReduce (EMR) charges by the hour. The longer the cluster runs, the higher the cost. Costly instance types like GPU instances can make this particularly expensive.
How can you save money while running tasks in Amazon EMR?
Running tasks as a set of steps that automatically start and stop a cluster when done is recommended. This minimizes the runtime of the cluster and thus reduces cost.
What happens in Amazon EMR when a core node fails?
EMR will automatically provision a new node in case of a core node failure, allowing tasks to pick up where they left off.
What is the best option to add capacity in Amazon EMR?
The best option to add capacity in Amazon EMR is often to add or remove task nodes on the fly. Task nodes are similar to core nodes but lack their own HDFS storage capacity.
How can you increase processing capacity if you are using EMRFS with S3 for persistent storage?
You can increase processing capacity by using task nodes. Even though task nodes don’t increase storage capacity, they can help in increasing the processing capacity.
How can you handle temporary surges in processing needs in Amazon EMR?
Adding and removing task nodes can effectively handle temporary surges in processing needs, for example during a high-traffic season for an e-commerce website.
How can you increase both processing and HDFS storage capacity in Amazon EMR?
You can increase both processing and HDFS storage capacity by resizing the cluster’s core nodes.
What is the risk of adding and removing core nodes on the fly in Amazon EMR?
Adding and removing core nodes on the fly in Amazon EMR carries the risk of data loss if using HDFS storage, as removing a core node also removes the underlying storage.
When was Managed Scaling in Amazon EMR introduced and what did it replace?
Managed Scaling in Amazon EMR was introduced in 2020, replacing the previous ‘EMR Automatic Scaling’ that was based on CloudWatch metrics.
What were the limitations of Amazon EMR automatic scaling before 2020?
Prior to 2020, automatic scaling in Amazon EMR could only add or remove capacity within instance groups. It did not support mixed instance types.
What does EMR Managed Scaling support?
EMR Managed Scaling supports instance groups as well as instance fleets. It can scale up and down spot instances, on-demand instances, and regular instances in a savings plan within the same cluster. This is applicable for Spark, Hive, or YARN workloads in EMR.
How does Managed Scaling in EMR scale up?
When scaling up, Managed Scaling first tries to add core nodes. If it reaches the limit, it then adds task nodes, up to the maximum number of units specified by the user.
How does Managed Scaling in EMR scale down?
Scaling down with Managed Scaling starts by removing task nodes and then core nodes, adhering to the minimum constraints set by the user.
What is the order of node removal when scaling down with Managed Scaling?
Spot nodes will always be removed before on-demand instances when scaling down with Managed Scaling.
What configuration does Managed Scaling in EMR allow?
Managed Scaling allows specifying a maximum and minimum number of units (core nodes and task nodes), and it can be applied across a fleet, not just an instance group.
What are the key modules of Hadoop architecture?
Hadoop architecture comprises several modules: Hadoop Common (or Hadoop Core), Hadoop Distributed File System (HDFS), YARN, and MapReduce. These form the basis of Hadoop.
What is Hadoop Common or Hadoop Core?
Hadoop Common or Hadoop Core includes libraries and utilities that other Hadoop modules build on. It provides all the file system and operating system level abstractions needed on top of the cluster, along with the JAR files and scripts required to start Hadoop.
What is HDFS in Hadoop?
Hadoop Distributed File System (HDFS) is a distributed, scalable file system that stores blocks of data across instances in the cluster. It ensures data redundancy by storing multiple copies of those blocks on different instances. However, on Amazon EMR, HDFS is ephemeral and data will be lost upon terminating the cluster.
What is YARN in Hadoop?
YARN (Yet Another Resource Negotiator) is an abstraction layer added in Hadoop 2.0 between MapReduce and HDFS. It allows more than one data processing framework and centrally manages cluster resources.
What is MapReduce in Hadoop?
MapReduce is a core data processing framework in Hadoop for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. It is made up of mappers (which map data to sets of key value pairs - the intermediate results of processing) and reducers (which combine those intermediate results, apply additional algorithms, and produce the final output).
What has largely supplanted MapReduce for distributed file processing on a Hadoop cluster?
Apache Spark has largely supplanted MapReduce for distributed file processing on a Hadoop cluster due to its faster speed, extensibility, and more versatile capabilities.
Amazon EMR Serverless
A serverless option in Amazon EMR that automatically scales resources up and down for data analysts and engineers to run open-source big data analytics frameworks without having to manage clusters or servers.