EMR Flashcards

1
Q

What is EMR?

A

AWS Elastic MapReduce (EMR) is a cloud-based big data platform for processing large amounts of data. It utilizes open-source tools such as Apache Spark and Hadoop, along with several other leading open-source frameworks. It assists in data processing tasks such as web indexing, data transformations (ETL), machine learning, financial analysis, scientific simulation, and bioinformatics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Features of EMR?

A

Scalability: AWS EMR allows you to quickly and easily scale your processing capacity. You can add or subtract cluster instances as your needs change, and you only pay for what you use.

Flexibility: EMR supports multiple big data frameworks, including Apache Spark, Hadoop, HBase, Presto, and Flink. It also integrates with other AWS services like AWS Glue, Amazon S3, DynamoDB, and more.

Speed: EMR is designed to process large data sets quickly and efficiently. It distributes the data and processing across a resizable cluster of Amazon EC2 instances.

Security: AWS EMR ensures data is stored securely, with options for encryption at rest and in transit. It is also integrated with AWS Lake Formation to provide granular data access control.

Cost-Effective: With AWS EMR, you can use EC2 Spot Instances to save on computing costs. You also have the option to use Reserved Instances for long-term workloads, or On-Demand Instances for short-term workloads.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the different components of EMR Cluster?

A

An Amazon Elastic MapReduce (EMR) cluster is essentially a collection of Amazon EC2 instances, known as nodes, that are running Hadoop. Each node in the cluster has a specific role or node type: master node, core node, and task node.

  1. Master Node: Every EMR cluster will always have at least one master node. The master node manages the cluster, runs software components to coordinate data distribution, and supervises tasks among other nodes for processing. It monitors the overall health of the cluster and tracks the status of tasks. A minimal single-node EMR cluster could just consist of a master node doing everything.
  2. Core Node: Core nodes run tasks and store data in the Hadoop Distributed File System (HDFS) or in the EMR File System (EMRFS) which enables data writing into S3. These nodes perform the actual work of processing and storing data across the cluster. In a multi-node cluster, there will be at least one core node.
  3. Task Node: Task nodes, a relatively new addition, only run tasks without storing any data in HDFS or EMRFS. These nodes are optional and are typically added when there’s a need for more processing capacity but no additional storage. Task nodes can be especially beneficial in EMR as it often uses S3 for storage. Using task nodes helps save money as you don’t pay for unnecessary storage.

It’s also worth noting that there’s no risk of data loss when removing a task node as it doesn’t store any data. Task nodes are well-suited for spot instances, which are an efficient way to add more capacity and cut costs on an EMR cluster. It may appear in the exam. If a spot instance goes down, it doesn’t affect the data or the functioning of the cluster since task nodes are used for extra processing capacity. Hence, using spot instances for task nodes is a recommended strategy for cost-efficient and dynamic cluster expansion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the different types of clusters?

A

The text describes two main ways of using Amazon Elastic MapReduce (EMR): transient clusters and long-running clusters.

Transient Clusters: Transient clusters are temporary and automatically terminate once all assigned steps are complete. When setting up a transient cluster, you specify the type of hardware for the EMR cluster and define the processing steps. The cluster carries out these steps—such as loading data, processing data, and storing results—and shuts down automatically when it’s done. This strategy is cost-effective as you only pay for the time the cluster is operational. If you occasionally run one-off jobs, transient clusters are a good choice. They spin up resources, execute your job, and then shut down, potentially saving money.

Long-Running Clusters: If you require a persistent data warehouse with continuous or periodic processing of large data sets, a long-running cluster is more suitable. In this scenario, you spin up a cluster with specified parameters and leave it running until manually terminated. To address occasional spikes in capacity needs, you can add more task nodes using spot instances. For long-running clusters, you can use reserved instances to save more money if you plan to keep the cluster operational for a prolonged period. By default, termination protection is enabled and auto termination is disabled on a long-running cluster, ensuring its preservation as long as possible.

In summary, choose transient clusters for predefined, one-time tasks, and long-running clusters for continuous data processing and access in a more persistent environment.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

how can you interact with EMR?

A

Two primary ways to use and interact with Amazon Elastic MapReduce (EMR):

  1. Direct Interaction with Master Node: When launching an EMR cluster, you select the desired frameworks and applications, such as Apache Spark. The cluster automatically installs these when spinning up. If you have a long-lived cluster, you can connect directly to the master node and run your jobs from there. This approach is especially suitable for those comfortable with the command-line interface. For instance, you could set up a Spark-enabled EMR cluster, connect to the master node, and initiate your Spark driver script to leverage the full power of the cluster.
  2. Using AWS Console: Alternatively, you can submit steps via the AWS console. This process can be done purely graphically through the console. Basic tasks such as processing data in S3 or from the Hadoop Distributed File System (HDFS) can be defined as steps. Once defined, you can initiate these steps via the AWS console without needing to connect directly to the master node or use the command line. This data can then be output to S3 or another location.

In summary, there are two main ways to use EMR: directly interacting with the master node (usually for command-line users) or defining and initiating steps through the AWS console. The choice between the two largely depends on the user’s comfort with the command line and their specific use case.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are different kind of storage options in EMR?

A
  1. HDFS (Hadoop Distributed File System)
  2. EMRFS (EMR File System)

In summary, you can use either HDFS or EMRFS for storage in an EMR cluster, with the main distinction being data persistence after cluster termination. The improved consistency of S3 has simplified using EMRFS.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Details of HDFS?

A

Since EMR is fundamentally a Hadoop cluster running on EC2, you can use HDFS for data storage. It utilizes the local storage of each instance, distributing the storage across the cluster.
Files are stored in blocks, and these blocks are distributed across the cluster.
To ensure redundancy, multiple copies of each file block are stored across instances. However, the storage is ephemeral, meaning all data is lost when the cluster is shut down. Despite this limitation, HDFS can still be used for caching intermediate results or workloads with substantial random IO.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Details of EMR File System?

A

EMRFS enables the use of S3 as if it were HDFS, providing persistent storage even after cluster termination.
There used to be a consistency issue when multiple nodes tried to write to the same S3 location simultaneously. This was resolved with the introduction of EMRFS Consistent View, which uses DynamoDB to track file access consistency. However, this added complexity and required careful management of read/write capacity for DynamoDB. As of 2021, Amazon S3 itself guarantees strong consistency, obviating the need for EMRFS Consistent View.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are other alternate storage options for EMR storage clusters?

A

Other alternative storage options for Amazon Elastic MapReduce (EMR) clusters.

  1. Local File System: This is a fast option for data storage, but it’s ephemeral and only suitable for temporary data, like temporary buffers or caches. Data in the local file system will not be backed up and will be lost when the cluster is terminated.
  2. Elastic Block Store (EBS) for HDFS: This option allows the use of EBS-only instance types (e.g., M4, C4) for data storage. However, like the local file system, EBS storage is also deleted when the cluster is terminated. EBS volumes can only be attached when launching a cluster, so there’s no possibility to expand storage capacity later. If an EBS volume is manually detached while running, EMR will treat it as a failure and automatically replace it, showing resilience to this failure mode.

In summary, both the local file system and EBS for HDFS are transient storage options that don’t persist data after cluster termination. For persistent storage that survives after cluster termination, EMRFS with S3 should be used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

How does Amazon EMR charge?

A

Amazon Elastic MapReduce (EMR) charges by the hour. The longer the cluster runs, the higher the cost. Costly instance types like GPU instances can make this particularly expensive.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

How can you save money while running tasks in Amazon EMR?

A

Running tasks as a set of steps that automatically start and stop a cluster when done is recommended. This minimizes the runtime of the cluster and thus reduces cost.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What happens in Amazon EMR when a core node fails?

A

EMR will automatically provision a new node in case of a core node failure, allowing tasks to pick up where they left off.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the best option to add capacity in Amazon EMR?

A

The best option to add capacity in Amazon EMR is often to add or remove task nodes on the fly. Task nodes are similar to core nodes but lack their own HDFS storage capacity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you increase processing capacity if you are using EMRFS with S3 for persistent storage?

A

You can increase processing capacity by using task nodes. Even though task nodes don’t increase storage capacity, they can help in increasing the processing capacity.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How can you handle temporary surges in processing needs in Amazon EMR?

A

Adding and removing task nodes can effectively handle temporary surges in processing needs, for example during a high-traffic season for an e-commerce website.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How can you increase both processing and HDFS storage capacity in Amazon EMR?

A

You can increase both processing and HDFS storage capacity by resizing the cluster’s core nodes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

What is the risk of adding and removing core nodes on the fly in Amazon EMR?

A

Adding and removing core nodes on the fly in Amazon EMR carries the risk of data loss if using HDFS storage, as removing a core node also removes the underlying storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

When was Managed Scaling in Amazon EMR introduced and what did it replace?

A

Managed Scaling in Amazon EMR was introduced in 2020, replacing the previous ‘EMR Automatic Scaling’ that was based on CloudWatch metrics.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

What were the limitations of Amazon EMR automatic scaling before 2020?

A

Prior to 2020, automatic scaling in Amazon EMR could only add or remove capacity within instance groups. It did not support mixed instance types.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

What does EMR Managed Scaling support?

A

EMR Managed Scaling supports instance groups as well as instance fleets. It can scale up and down spot instances, on-demand instances, and regular instances in a savings plan within the same cluster. This is applicable for Spark, Hive, or YARN workloads in EMR.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

How does Managed Scaling in EMR scale up?

A

When scaling up, Managed Scaling first tries to add core nodes. If it reaches the limit, it then adds task nodes, up to the maximum number of units specified by the user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

How does Managed Scaling in EMR scale down?

A

Scaling down with Managed Scaling starts by removing task nodes and then core nodes, adhering to the minimum constraints set by the user.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

What is the order of node removal when scaling down with Managed Scaling?

A

Spot nodes will always be removed before on-demand instances when scaling down with Managed Scaling.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

What configuration does Managed Scaling in EMR allow?

A

Managed Scaling allows specifying a maximum and minimum number of units (core nodes and task nodes), and it can be applied across a fleet, not just an instance group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

What are the key modules of Hadoop architecture?

A

Hadoop architecture comprises several modules: Hadoop Common (or Hadoop Core), Hadoop Distributed File System (HDFS), YARN, and MapReduce. These form the basis of Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

What is Hadoop Common or Hadoop Core?

A

Hadoop Common or Hadoop Core includes libraries and utilities that other Hadoop modules build on. It provides all the file system and operating system level abstractions needed on top of the cluster, along with the JAR files and scripts required to start Hadoop.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

What is HDFS in Hadoop?

A

Hadoop Distributed File System (HDFS) is a distributed, scalable file system that stores blocks of data across instances in the cluster. It ensures data redundancy by storing multiple copies of those blocks on different instances. However, on Amazon EMR, HDFS is ephemeral and data will be lost upon terminating the cluster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

What is YARN in Hadoop?

A

YARN (Yet Another Resource Negotiator) is an abstraction layer added in Hadoop 2.0 between MapReduce and HDFS. It allows more than one data processing framework and centrally manages cluster resources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

What is MapReduce in Hadoop?

A

MapReduce is a core data processing framework in Hadoop for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. It is made up of mappers (which map data to sets of key value pairs - the intermediate results of processing) and reducers (which combine those intermediate results, apply additional algorithms, and produce the final output).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

What has largely supplanted MapReduce for distributed file processing on a Hadoop cluster?

A

Apache Spark has largely supplanted MapReduce for distributed file processing on a Hadoop cluster due to its faster speed, extensibility, and more versatile capabilities.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

Amazon EMR Serverless

A

A serverless option in Amazon EMR that automatically scales resources up and down for data analysts and engineers to run open-source big data analytics frameworks without having to manage clusters or servers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

Frameworks supported by EMR Serverless

A

Apache Spark, Apache Hive, Apache HBase, Apache Pig, and Apache Zeppelin.

33
Q

Creating an EMR Serverless Application

A

To use EMR Serverless, you first need to create an EMR Serverless application. This can be done using the AWS Management Console, the AWS CLI, or the AWS SDKs. Jobs in EMR Serverless are defined using the Apache Spark programming language.

34
Q

Job submission to EMR Serverless

A

When a job is submitted to EMR Serverless, Amazon EMR will automatically provision the resources needed to run the job. The resources will be scaled up and down as needed and released when the job is finished.

35
Q

Monitoring Jobs in EMR Serverless

A

The progress of jobs can be monitored using the AWS Management Console, the AWS CLI, or the AWS SDKs. Logs for jobs can also be viewed.

36
Q

Benefits of EMR Serverless

A

Ease of use due to no need to manage clusters, cost-effectiveness as you only pay for resources used, and scalability as resources are automatically scaled up and down as needed.

37
Q

How to use EMR serverless?

A
  • The user interacts with AWS CLI to create a serverless cluster. As of the moment, only CLI is supported, but console and SDK support are expected in the future.
  • A job execution role is set up in IAM for the job, ensuring it has permission to access Amazon AWS EMR Serverless, scripts and data in S3, Glue metadata (if required), and KMS keys for security setup.
  • An EMR Serverless application is then created using Spark, Hive, or any other preferred framework.
  • The job is fed in through an EMR job request, with a link to the Spark script or Hive query.
  • Here’s a command line example to invoke an EMR Serverless job: AWS CMR serverless start job run. In this, application ID and the execution role path are passed in.
  • Under job driver, the entry point is the path to the script. Arguments can be passed in as parameters. The Spark submit parameters can be overridden if needed.
  • Even in the serverless setup, the user maintains control over parameters such as executor cores and driver cores.
  • Configuration overrides specific to EMR Serverless can be sent. For instance, outputting cloud watch logs to a specific path in S3.
  • Upon completion, the outputs and logs are stored in the pre-specified locations.
  • The system can be shut down when not in use, similar to standard EMR.
38
Q

EMR on EKS - What is it?

A

EMR on EKS is a serverless approach to EMR that allows you to submit a Spark job on Elastic Kubernetes Service without having to worry about provisioning clusters. It provides the benefits of Kubernetes as well as the fully managed aspect of EMR, enabling an automated setup for Spark applications within EKS.

39
Q

Advantages of EMR on EKS

A

EMR on EKS allows resource sharing between Spark and other applications running in Kubernetes, potentially making more efficient use of the hardware and saving costs. It also integrates with various Amazon services and can be spread across multiple availability zones.

40
Q

Deployment of EMR workload to Amazon EKS

A

With a few clicks in the console, the user can choose the Apache Spark version, deploy an EMR workload to Amazon EKS, and EMR will automatically package the workload into a container. EMR provides prebuilt connectors for integrating with other AWS services, manages the deployment of the container on the EKS cluster, and takes care of scaling, logging, and monitoring of the workload.

41
Q

What is Apache Spark?

A
  1. Apache Spark is an open source distributed processing framework used for big data workloads.
  2. Spark sits alongside MapReduce as a potential replacement for analysis tasks in big data processing stacks.
  3. Spark outperforms MapReduce due to in-memory caching and a query execution optimizer, resulting in more efficient operations.
  4. Apache Spark supports programming languages such as Java, Scala, Python, and R, with Scala and Python being more popular.
  5. Code in Spark is reusable across different applications due to its software architecture.
  6. Spark supports batch processing, interactive queries, real-time analytics, machine learning through the MLLib library, and graph processing via GraphX.
  7. Spark Streaming and structured streaming allow real-time data processing from a stream of data, which can integrate with Kinesis or Kafka on EMR.
  8. Spark is not suitable for Online Transaction Processing (OLTP), i.e., handling thousands of transactions per second. Instead, it’s designed for Online Analytical Processing (OLAP), running longer queries for analysis.
42
Q

What is Apache RDD?

A

RDD stands for Resilient Distributed Dataset and is a fundamental data structure in Apache Spark. It represents an immutable, partitioned collection of objects that can be processed in parallel across a cluster of machines. RDDs provide fault tolerance by allowing the data to be automatically recovered in case of failures.

Key characteristics of RDDs include:

  1. Resilient: RDDs can recover lost data partitions by utilizing lineage information to rebuild lost partitions.
  2. Distributed: RDDs are distributed across multiple nodes in a cluster, enabling parallel processing and data locality optimization.
  3. Immutable: RDDs are read-only and cannot be modified once created. However, new RDDs can be derived from existing ones through transformations.
  4. Lazily Evaluated: RDDs support lazy evaluation, meaning that transformations on RDDs are not executed immediately but are computed only when an action requires the result.

RDDs provide a programming interface for performing operations such as transformations (e.g., map, filter, reduce) and actions (e.g., count, collect, save). These operations enable developers to perform distributed data processing tasks in a concise and scalable manner.

It’s worth noting that while RDDs were the primary data abstraction in earlier versions of Apache Spark, newer versions introduced higher-level APIs like DataFrames and Datasets that provide optimizations and a more structured approach to working with data.

43
Q

What is Apache Streaming?

A

Key points about Spark Streaming:

  1. Spark Streaming is a component of Apache Spark designed for real-time processing and analysis of streaming data.
  2. It operates on micro-batches, allowing developers to use batch processing operations on continuous data streams.
  3. Spark Streaming provides fault tolerance and scalability, making it suitable for handling large-scale streaming data.
  4. It integrates seamlessly with other components of Apache Spark, such as Spark SQL, MLlib, and GraphX, enabling unified processing of both batch and streaming data.
  5. It supports windowed operations, allowing data to be processed over specific time intervals or sliding windows.
  6. Spark Streaming offers exactly-once processing semantics, ensuring that each data record is processed only once, even in the presence of failures.
  7. It provides connectors for various external systems, including Kafka, Flume, and Amazon Kinesis, enabling easy integration with different data sources and sinks.
  8. Spark Streaming is widely used for real-time data processing applications, such as real-time analytics, fraud detection, log analysis, and monitoring.
  9. It offers a high-level API in Scala, Java, Python, and R, making it accessible to developers with different programming backgrounds.
  10. With its speed, scalability, and flexibility, Spark Streaming has become a popular choice for building real-time streaming applications in the big data ecosystem.
44
Q

Spark SQL?

A

The key points about Spark SQL:

  1. Spark SQL is a component of Apache Spark that provides a programming interface for working with structured and semi-structured data.
  2. It allows developers to query structured data using SQL syntax and leverage the power of Spark’s distributed processing capabilities.
  3. Spark SQL supports various data sources, including Hive, Avro, Parquet, JSON, and JDBC, enabling seamless integration with existing data systems.
  4. It provides a DataFrame API, which is an abstraction over distributed collections of data, allowing developers to manipulate structured data using familiar SQL-like operations.
  5. Spark SQL supports both batch and streaming data processing, enabling unified processing of structured data from different sources.
  6. It optimizes queries using a cost-based optimizer, leveraging techniques like predicate pushdown, column pruning, and join reordering to improve query performance.
  7. It supports complex data types, user-defined functions (UDFs), and custom aggregations, allowing developers to handle complex data transformations and analytics.
  8. Spark SQL can be seamlessly integrated with other Spark components, such as Spark MLlib for machine learning and Spark Streaming for real-time data processing.
  9. It provides interoperability with popular data analysis tools and libraries, such as Apache Hive, Apache Kafka, and Apache Parquet.
  10. Spark SQL is widely used in various applications, including data exploration, ad-hoc querying, data integration, and ETL (Extract, Transform, Load) processes, offering a unified platform for data processing and analytics in the Apache Spark ecosystem.
45
Q

What is Hive?

A
  • Hive is a tool that allows executing SQL-like queries on unstructured data stored in Hadoop YARN or S3 (in the case of EMR).
  • Hive uses MapReduce or Tez as an underlying engine to distribute the processing of SQL queries on the data.
  • Tez is an alternative to MapReduce and provides faster processing using in-memory directed acyclic graphs.
  • Hive provides a familiar SQL syntax, called HiveQL, and an interactive interface for querying data.
  • It is scalable and suitable for data warehouse and OLAP applications.
  • Hive is not as fast as other technologies like MapReduce or Apache Spark, but it is easier to use for simple OLAP queries.
  • Hive is optimized and extensible, supporting user-defined functions and providing interfaces like Thrift server, JDBC, and ODBC drivers.
  • It can be accessed by external applications for analytics or web services.
  • However, Hive is not designed for Online Transaction Processing (OLTP) and should not be used for high-frequency, real-time queries.
46
Q

Exam Question

What is Hive Metstore?

A

The Hive Metastore in Hive is used to provide structure to unstructured data. The Hive Metastore stores information about the columns, data types, and other details that define the structure of the data. It acts as a reference point for querying the underlying data, such as CSV files, and treating it as a SQL table. The example provided demonstrates how a structured table is created in Hive using raw rating CSV data, specifying column names, data types, data format, and location. The Hive Metastore plays a crucial role in organizing and accessing the structured information of the underlying data.

47
Q

By default, where is the Hive Metastore stored?

A

In a MySQL database on the master node of the cluster.

48
Q

Does Hive allow the use of an external metastore?

A

Yes.

49
Q

Where can an external metastore be hosted?

A

Outside the cluster or on another node within it.

50
Q

What can serve as a Hive Metastore?

A

AWS Glue Data Catalog.

51
Q

What does the AWS Glue Data Catalog provide?

A

Centralized metadata for unstructured data.

52
Q

What other AWS services can utilize the AWS Glue Data Catalog as a Hive Metastore?

A

Amazon EMR, Redshift, and Athena.

53
Q

How can the Hive Metastore be stored externally for persistence?

A

In an external Amazon RDS instance.

54
Q

What is the benefit of storing the Hive Metastore in an external RDS instance?

A

Ensuring persistence even if the cluster is shut down.

55
Q

Hive integrated with which AWS services?

A
  1. Hive integrates with AWS in several ways. It can be used with Amazon S3 to automatically load table partitions from different subdirectories in S3.
  2. Hive on Amazon EMR allows for specifying an off-instance metadata store and writing data directly to S3 without temporary files. It also supports referencing resources such as scripts and libraries stored in S3.
  3. Hive on EMR can integrate with Amazon DynamoDB by defining an external Hive table based on a DynamoDB table, enabling analysis of DynamoDB data and data movement between DynamoDB, EMRFS, and S3. Additionally, Hive on EMR enables joint operations between DynamoDB tables.
56
Q

Apache Pig

A

Apache Pig is an important component of the Hadoop ecosystem, included in Amazon EMR. It offers an alternative interface to MapReduce, addressing the complexity of writing code for mappers and reducers.

57
Q

What is Apache Pig Latin

A

Pig Latin is a scripting language introduced by Apache Pig. It allows users to define map and reduce steps using SQL-style syntax, simplifying development compared to writing Java code directly.

58
Q

Extensibility of Apache Pig

A

Apache Pig is highly extensible with user-defined functions, enabling users to expand on its functionalities by writing custom code.

59
Q

Apache Pig Integration

A

Pig operates on top of MapReduce or Tez, which sit on top of YARN and HDFS/EMRFS. It shares similarities with Hive in terms of its architecture and integration within the Hadoop ecosystem.

60
Q

Apache Pig Relevance

A

Although Pig is considered an older technology, it is still relevant and may appear in exams. Understanding its purpose, features, and syntax is important for comprehensive knowledge of the Hadoop ecosystem.

61
Q

Pig and AWS Integration

A

Pig and AWS have integration capabilities that enhance the functionality of Pig on EMR. Pig can work with data on both HDFS and S3 through EMRFS, similar to Hive. It can load external JARs and scripts from S3. However, the integration between Pig and AWS is limited to these features, and the core functionality of Pig remains unchanged.

62
Q

what is HBase?

A

HBase is a non-relational database designed for petabyte-scale data within the Hadoop ecosystem. It operates on distributed data across a Hadoop cluster and is based on Google’s BigTable technology. HBase treats unstructured data as a NoSQL database, allowing fast queries due to its in-memory operation. It integrates with Hive, enabling SQL-style commands to be issued on data exposed through HBase. The combination of HBase’s distributed nature and integration with Hive makes it a powerful tool for managing and querying large-scale data within the Hadoop ecosystem.

63
Q

HBase Features

A

HBase treats unstructured data as a NoSQL database, allowing fast queries due to its in-memory operation. It integrates with Hive, enabling SQL-style commands on data exposed through HBase.

64
Q

HBase Benefits

A

The combination of HBase’s distributed nature and integration with Hive makes it a powerful tool for managing and querying large-scale data within the Hadoop ecosystem.

65
Q

HBase vs DynamoDB

A

HBase and DynamoDB are both NoSQL databases designed for similar use cases. However, when choosing between the two for use with EMR and storing data on an EMR cluster, DynamoDB offers some advantages. It is fully managed and scales automatically, separate from the EMR cluster, providing a serverless solution. DynamoDB also has better integration with other AWS services and AWS Glue. On the other hand, HBase may be a better choice if there is a possibility of moving to a non-AWS Hadoop cluster in the future or if dealing with sparse data or high-frequency counters. HBase offers consistent reads, better performance for writes and updates, and integration with Hadoop services. Ultimately, the choice between HBase and DynamoDB depends on the specific ecosystem and integration requirements, with DynamoDB being well-suited for AWS integration and HBase offering more compatibility with Hadoop.

66
Q

HBase Advantages

A

HBase offers consistent reads, better write/update performance, and integration with Hadoop services. It is suitable for non-AWS Hadoop clusters, sparse data, and high-frequency counters.

67
Q

DynamoDB Advantages

A

DynamoDB is fully managed, scales automatically, and provides a serverless solution. It has better integration with AWS services and AWS Glue. It is well-suited for AWS integration.

68
Q

What is the difference between HBase and MapReduce?

A

MapReduce and HBase are both components of the Hadoop ecosystem, but they serve different purposes and have distinct characteristics:

  1. Purpose:
    • MapReduce: MapReduce is a programming model and software framework designed for processing and analyzing large datasets in a distributed manner. It focuses on data processing tasks such as filtering, sorting, and aggregating data.
    • HBase: HBase, on the other hand, is a distributed, scalable, and non-relational database that is built on top of Hadoop. It is designed for storing and managing structured and semi-structured data in a fault-tolerant and highly available manner.
  2. Data Storage:
    • MapReduce: MapReduce does not provide its own storage system. It processes data stored in a distributed file system, such as Hadoop Distributed File System (HDFS), by dividing the data into smaller chunks and processing them in parallel.
    • HBase: HBase stores data in a distributed manner directly in its own storage system, which is based on the concept of BigTable. It organizes data into tables with rows and columns and allows random access to data using a key-value model.
  3. Data Processing:
    • MapReduce: MapReduce processes data by splitting it into smaller chunks, which are then processed in parallel across a cluster of nodes. It follows a two-step process: the map phase and the reduce phase. The map phase applies a function to each data item, and the reduce phase aggregates the results of the map phase to produce the final output.
    • HBase: HBase provides random read and write access to data, allowing fast and efficient retrieval and modification of individual records. It supports real-time data processing and enables high-speed queries by leveraging its in-memory capabilities.
  4. Use Cases:
    • MapReduce: MapReduce is suitable for batch processing and analyzing large volumes of data where data processing can be divided into map and reduce tasks. It is commonly used for tasks such as log analysis, data aggregation, and ETL (Extract, Transform, Load) operations.
    • HBase: HBase is well-suited for applications that require low-latency random access to large amounts of structured data, such as time series data, sensor data, or user profiles. It is often used for use cases involving real-time data processing, real-time analytics, and serving as a distributed database for web applications.

In summary, MapReduce is a distributed data processing framework, while HBase is a distributed database designed for structured data storage and retrieval. MapReduce focuses on batch processing and analysis, whereas HBase provides real-time, random access to large-scale structured data.

69
Q

What is Presto?

A

Presto is a technology pre-installed on Amazon EMR that enables connection to various big data databases and data stores simultaneously. It allows SQL-style queries across multiple databases and supports SQL join commands to combine data from different technologies within a cluster. Presto offers interactive queries at a petabyte scale, has a familiar SQL syntax, and is optimized for OLAP applications.

70
Q

Who developed Presto?

A

Presto was initially developed by Facebook and is partially maintained by them as an open-source project. It provides high-performance querying capabilities for analyzing massive data sets stored in different databases within an ecosystem.

71
Q

How Presto is realted to Amazon Athena?

A

Amazon Athena is a serverless version of Presto that utilizes the same technology. It provides JDBC, command line, and Tableau interfaces for accessing and analyzing data from various sources.

72
Q

Presto Connectors?

A

Presto supports connectors for multiple data sources, including HDFS, S3, Cassandra, MongoDB, HBase, Redshift, and Teradata. It allows users to unify data from disparate sources and perform queries across the entire cluster.

73
Q

How is the performance of Presto?

A

Presto is known for its high performance, processing data in-memory and minimizing unnecessary IO overhead. It is suitable for efficient interactive querying of massive data sets but not for OLTP or batch processing.

74
Q

What is Zeppelin?

A

Zeppelin, which comes pre-installed on Amazon EMR, is an interactive notebook on your cluster that allows you to run Python scripts and code against your data. It supports iPython notebook-like functionality, where you can write code blocks and intersperse them with comments and notes. Zeppelin integrates with Apache Spark, JDBC, HBase, Elasticsearch, and more, allowing you to kick off various tasks from the notebook. It enables interactive Spark code execution, speeding up development cycles and facilitating experimentation. Zeppelin also provides visualization capabilities for charts and graphs, making it easier to analyze and interpret results. Additionally, it supports Spark SQL for issuing SQL queries directly against the data. Zeppelin makes Spark more accessible as a data science tool rather than just a programming environment.

75
Q

Similarity between Zeppeline and EMR Notebook?

A

Amazon EMR offers a similar concept called EMR Notebook, which includes AWS integration and features such as automatic backup to S3 and the ability to provision and manage clusters from the notebook. EMR Notebooks are hosted in a VPC for security and come with graphical libraries from the Anaconda repository for prototyping and exploratory analysis. They can be attached to existing clusters or used to create new clusters. EMR Notebooks are provided at no additional charge to Amazon EMR customers, offering value to Hadoop clusters running on EMR.

76
Q

What is Hue?

A

Hue, short for Hadoop User Experience, is the front-end interface and management console for an Amazon EMR cluster. It serves as a centralized tool for managing the cluster, including spinning up services, monitoring operational insights, and facilitating data movement between HDFS, EMRFS, and S3. Hue can integrate with IAM to ensure appropriate access control for users. While using Hue, it’s important to remember that it primarily functions as a management and monitoring tool for the EMR cluster, providing a front-end console for cluster operations.

77
Q

What is Splunk?

A
  1. Splunk is an operational tool used for monitoring and gaining insights into your Amazon EMR cluster.
  2. It continuously collects and indexes data to provide real-time information about the performance and activities of your cluster.
  3. Splunk can be deployed on EMR or set up as a separate cluster.
  4. Amazon offers public AMIs with Splunk Enterprise for easy deployment and monitoring of your EMR cluster.
  5. Splunk helps visualize and analyze data from EMR and S3 within your cluster.
  6. While Splunk may be mentioned in a list of technologies, its main purpose is to provide operational insights and should not be a distraction in the context of a question.
78
Q

What is Flume?

A
  1. Flume is a distributed and reliable service used for streaming data into your cluster, similar to Kinesis or Kafka.
  2. It is designed specifically for efficiently collecting, aggregating, and moving large amounts of log data.
  3. Flume operates based on the concept of sources, channels, and sinks.
  4. A source (such as a web server) provides events to Flume, which are then stored in one or more channels. Channels act as passive stores for events until they are consumed by a Flume sink.
  5. Sinks remove events from channels and place them in external repositories like HDFS or Hive.
  6. Examples of sinks include HDFS sink for writing events to HDFS and Hive sink for streaming events to Hive tables.
  7. Flume is used to stream log data from external sources into various destinations, such as HDFS, Hive, or HBase.
  8. Understanding Flume’s purpose as a log data streaming tool is important, and it may be presented as an alternative technology for streaming applications in an EMR cluster.
79
Q

What is MxNet

A
  1. MXNet is an alternative to TensorFlow and a library for building and accelerating neural networks.
  2. It is included in EMR and is considered the preferred framework for deep learning on EMR.
  3. For the purpose of the exam, it is not necessary to design neural networks.
  4. MXNet is a framework used for building distributed deep learning applications on an entire EMR cluster.