EMR Flashcards

Question

What are the key modules of Hadoop architecture?

Answer 1

Hadoop architecture comprises several modules: Hadoop Common (or Hadoop Core), Hadoop Distributed File System (HDFS), YARN, and MapReduce. These form the basis of Hadoop.

Answer 2

Hadoop Common or Hadoop Core includes libraries and utilities that other Hadoop modules build on. It provides all the file system and operating system level abstractions needed on top of the cluster, along with the JAR files and scripts required to start Hadoop.

Answer 3

Hadoop Distributed File System (HDFS) is a distributed, scalable file system that stores blocks of data across instances in the cluster. It ensures data redundancy by storing multiple copies of those blocks on different instances. However, on Amazon EMR, HDFS is ephemeral and data will be lost upon terminating the cluster.

Answer 4

YARN (Yet Another Resource Negotiator) is an abstraction layer added in Hadoop 2.0 between MapReduce and HDFS. It allows more than one data processing framework and centrally manages cluster resources.

Answer 5

MapReduce is a core data processing framework in Hadoop for processing vast amounts of data in parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner. It is made up of mappers (which map data to sets of key value pairs - the intermediate results of processing) and reducers (which combine those intermediate results, apply additional algorithms, and produce the final output).

Answer 6

Apache Spark has largely supplanted MapReduce for distributed file processing on a Hadoop cluster due to its faster speed, extensibility, and more versatile capabilities.

Answer 7

A serverless option in Amazon EMR that automatically scales resources up and down for data analysts and engineers to run open-source big data analytics frameworks without having to manage clusters or servers.

Answer 8

Apache Spark, Apache Hive, Apache HBase, Apache Pig, and Apache Zeppelin.

Answer 9

To use EMR Serverless, you first need to create an EMR Serverless application. This can be done using the AWS Management Console, the AWS CLI, or the AWS SDKs. Jobs in EMR Serverless are defined using the Apache Spark programming language.

Answer 10

When a job is submitted to EMR Serverless, Amazon EMR will automatically provision the resources needed to run the job. The resources will be scaled up and down as needed and released when the job is finished.

Answer 11

The progress of jobs can be monitored using the AWS Management Console, the AWS CLI, or the AWS SDKs. Logs for jobs can also be viewed.

Answer 12

Ease of use due to no need to manage clusters, cost-effectiveness as you only pay for resources used, and scalability as resources are automatically scaled up and down as needed.

Answer 13

- The user interacts with AWS CLI to create a serverless cluster. As of the moment, only CLI is supported, but console and SDK support are expected in the future. - A job execution role is set up in IAM for the job, ensuring it has permission to access Amazon AWS EMR Serverless, scripts and data in S3, Glue metadata (if required), and KMS keys for security setup. - An EMR Serverless application is then created using Spark, Hive, or any other preferred framework. - The job is fed in through an EMR job request, with a link to the Spark script or Hive query. - Here's a command line example to invoke an EMR Serverless job: `AWS CMR serverless start job run`. In this, application ID and the execution role path are passed in. - Under `job driver`, the entry point is the path to the script. Arguments can be passed in as parameters. The Spark submit parameters can be overridden if needed. - Even in the serverless setup, the user maintains control over parameters such as executor cores and driver cores. - Configuration overrides specific to EMR Serverless can be sent. For instance, outputting cloud watch logs to a specific path in S3. - Upon completion, the outputs and logs are stored in the pre-specified locations. - The system can be shut down when not in use, similar to standard EMR.

Answer 14

EMR on EKS is a serverless approach to EMR that allows you to submit a Spark job on Elastic Kubernetes Service without having to worry about provisioning clusters. It provides the benefits of Kubernetes as well as the fully managed aspect of EMR, enabling an automated setup for Spark applications within EKS.

Answer 15

EMR on EKS allows resource sharing between Spark and other applications running in Kubernetes, potentially making more efficient use of the hardware and saving costs. It also integrates with various Amazon services and can be spread across multiple availability zones.

Answer 16

With a few clicks in the console, the user can choose the Apache Spark version, deploy an EMR workload to Amazon EKS, and EMR will automatically package the workload into a container. EMR provides prebuilt connectors for integrating with other AWS services, manages the deployment of the container on the EKS cluster, and takes care of scaling, logging, and monitoring of the workload.

Answer 17

1. Apache Spark is an open source distributed processing framework used for big data workloads. 1. Spark sits alongside MapReduce as a potential replacement for analysis tasks in big data processing stacks. 1. Spark outperforms MapReduce due to in-memory caching and a query execution optimizer, resulting in more efficient operations. 1. Apache Spark supports programming languages such as Java, Scala, Python, and R, with Scala and Python being more popular. 1. Code in Spark is reusable across different applications due to its software architecture. 1. Spark supports batch processing, interactive queries, real-time analytics, machine learning through the MLLib library, and graph processing via GraphX. 1. Spark Streaming and structured streaming allow real-time data processing from a stream of data, which can integrate with Kinesis or Kafka on EMR. 1. Spark is not suitable for Online Transaction Processing (OLTP), i.e., handling thousands of transactions per second. Instead, it's designed for Online Analytical Processing (OLAP), running longer queries for analysis.

Answer 18

RDD stands for Resilient Distributed Dataset and is a fundamental data structure in Apache Spark. It represents an immutable, partitioned collection of objects that can be processed in parallel across a cluster of machines. RDDs provide fault tolerance by allowing the data to be automatically recovered in case of failures. Key characteristics of RDDs include: 1. Resilient: RDDs can recover lost data partitions by utilizing lineage information to rebuild lost partitions. 2. Distributed: RDDs are distributed across multiple nodes in a cluster, enabling parallel processing and data locality optimization. 3. Immutable: RDDs are read-only and cannot be modified once created. However, new RDDs can be derived from existing ones through transformations. 4. Lazily Evaluated: RDDs support lazy evaluation, meaning that transformations on RDDs are not executed immediately but are computed only when an action requires the result. RDDs provide a programming interface for performing operations such as transformations (e.g., map, filter, reduce) and actions (e.g., count, collect, save). These operations enable developers to perform distributed data processing tasks in a concise and scalable manner. It's worth noting that while RDDs were the primary data abstraction in earlier versions of Apache Spark, newer versions introduced higher-level APIs like DataFrames and Datasets that provide optimizations and a more structured approach to working with data.

Answer 19

Key points about Spark Streaming: 1. Spark Streaming is a component of Apache Spark designed for real-time processing and analysis of streaming data. 2. It operates on micro-batches, allowing developers to use batch processing operations on continuous data streams. 3. Spark Streaming provides fault tolerance and scalability, making it suitable for handling large-scale streaming data. 4. It integrates seamlessly with other components of Apache Spark, such as Spark SQL, MLlib, and GraphX, enabling unified processing of both batch and streaming data. 5. It supports windowed operations, allowing data to be processed over specific time intervals or sliding windows. 6. Spark Streaming offers exactly-once processing semantics, ensuring that each data record is processed only once, even in the presence of failures. 7. It provides connectors for various external systems, including Kafka, Flume, and Amazon Kinesis, enabling easy integration with different data sources and sinks. 8. Spark Streaming is widely used for real-time data processing applications, such as real-time analytics, fraud detection, log analysis, and monitoring. 9. It offers a high-level API in Scala, Java, Python, and R, making it accessible to developers with different programming backgrounds. 10. With its speed, scalability, and flexibility, Spark Streaming has become a popular choice for building real-time streaming applications in the big data ecosystem.

Answer 20

The key points about Spark SQL: 1. Spark SQL is a component of Apache Spark that provides a programming interface for working with structured and semi-structured data. 2. It allows developers to query structured data using SQL syntax and leverage the power of Spark's distributed processing capabilities. 3. Spark SQL supports various data sources, including Hive, Avro, Parquet, JSON, and JDBC, enabling seamless integration with existing data systems. 4. It provides a DataFrame API, which is an abstraction over distributed collections of data, allowing developers to manipulate structured data using familiar SQL-like operations. 5. Spark SQL supports both batch and streaming data processing, enabling unified processing of structured data from different sources. 6. It optimizes queries using a cost-based optimizer, leveraging techniques like predicate pushdown, column pruning, and join reordering to improve query performance. 7. It supports complex data types, user-defined functions (UDFs), and custom aggregations, allowing developers to handle complex data transformations and analytics. 8. Spark SQL can be seamlessly integrated with other Spark components, such as Spark MLlib for machine learning and Spark Streaming for real-time data processing. 9. It provides interoperability with popular data analysis tools and libraries, such as Apache Hive, Apache Kafka, and Apache Parquet. 10. Spark SQL is widely used in various applications, including data exploration, ad-hoc querying, data integration, and ETL (Extract, Transform, Load) processes, offering a unified platform for data processing and analytics in the Apache Spark ecosystem.

Answer 21

- Hive is a tool that allows executing SQL-like queries on unstructured data stored in Hadoop YARN or S3 (in the case of EMR). - Hive uses MapReduce or Tez as an underlying engine to distribute the processing of SQL queries on the data. - Tez is an alternative to MapReduce and provides faster processing using in-memory directed acyclic graphs. - Hive provides a familiar SQL syntax, called HiveQL, and an interactive interface for querying data. - It is scalable and suitable for data warehouse and OLAP applications. - Hive is not as fast as other technologies like MapReduce or Apache Spark, but it is easier to use for simple OLAP queries. - Hive is optimized and extensible, supporting user-defined functions and providing interfaces like Thrift server, JDBC, and ODBC drivers. - It can be accessed by external applications for analytics or web services. - However, Hive is not designed for Online Transaction Processing (OLTP) and should not be used for high-frequency, real-time queries.

Answer 22

The Hive Metastore in Hive is used to provide structure to unstructured data. The Hive Metastore stores information about the columns, data types, and other details that define the structure of the data. It acts as a reference point for querying the underlying data, such as CSV files, and treating it as a SQL table. The example provided demonstrates how a structured table is created in Hive using raw rating CSV data, specifying column names, data types, data format, and location. The Hive Metastore plays a crucial role in organizing and accessing the structured information of the underlying data.

Answer 23

In a MySQL database on the master node of the cluster.

Answer 24

Outside the cluster or on another node within it.

Answer 25

AWS Glue Data Catalog.

Answer 26

Centralized metadata for unstructured data.

Answer 27

Amazon EMR, Redshift, and Athena.

Answer 28

In an external Amazon RDS instance.

Answer 29

Ensuring persistence even if the cluster is shut down.

Answer 30

1. Hive integrates with AWS in several ways. It can be used with Amazon S3 to automatically load table partitions from different subdirectories in S3. 1. Hive on Amazon EMR allows for specifying an off-instance metadata store and writing data directly to S3 without temporary files. It also supports referencing resources such as scripts and libraries stored in S3. 2. Hive on EMR can integrate with Amazon DynamoDB by defining an external Hive table based on a DynamoDB table, enabling analysis of DynamoDB data and data movement between DynamoDB, EMRFS, and S3. Additionally, Hive on EMR enables joint operations between DynamoDB tables.

Answer 31

Apache Pig is an important component of the Hadoop ecosystem, included in Amazon EMR. It offers an alternative interface to MapReduce, addressing the complexity of writing code for mappers and reducers.

Answer 32

Pig Latin is a scripting language introduced by Apache Pig. It allows users to define map and reduce steps using SQL-style syntax, simplifying development compared to writing Java code directly.

Answer 33

Apache Pig is highly extensible with user-defined functions, enabling users to expand on its functionalities by writing custom code.

Answer 34

Pig operates on top of MapReduce or Tez, which sit on top of YARN and HDFS/EMRFS. It shares similarities with Hive in terms of its architecture and integration within the Hadoop ecosystem.

Answer 35

Although Pig is considered an older technology, it is still relevant and may appear in exams. Understanding its purpose, features, and syntax is important for comprehensive knowledge of the Hadoop ecosystem.

Answer 36

Pig and AWS have integration capabilities that enhance the functionality of Pig on EMR. Pig can work with data on both HDFS and S3 through EMRFS, similar to Hive. It can load external JARs and scripts from S3. However, the integration between Pig and AWS is limited to these features, and the core functionality of Pig remains unchanged.

Answer 37

HBase is a non-relational database designed for petabyte-scale data within the Hadoop ecosystem. It operates on distributed data across a Hadoop cluster and is based on Google's BigTable technology. HBase treats unstructured data as a NoSQL database, allowing fast queries due to its in-memory operation. It integrates with Hive, enabling SQL-style commands to be issued on data exposed through HBase. The combination of HBase's distributed nature and integration with Hive makes it a powerful tool for managing and querying large-scale data within the Hadoop ecosystem.

Answer 38

HBase treats unstructured data as a NoSQL database, allowing fast queries due to its in-memory operation. It integrates with Hive, enabling SQL-style commands on data exposed through HBase.

Answer 39

The combination of HBase's distributed nature and integration with Hive makes it a powerful tool for managing and querying large-scale data within the Hadoop ecosystem.

Answer 40

HBase and DynamoDB are both NoSQL databases designed for similar use cases. However, when choosing between the two for use with EMR and storing data on an EMR cluster, DynamoDB offers some advantages. It is fully managed and scales automatically, separate from the EMR cluster, providing a serverless solution. DynamoDB also has better integration with other AWS services and AWS Glue. On the other hand, HBase may be a better choice if there is a possibility of moving to a non-AWS Hadoop cluster in the future or if dealing with sparse data or high-frequency counters. HBase offers consistent reads, better performance for writes and updates, and integration with Hadoop services. Ultimately, the choice between HBase and DynamoDB depends on the specific ecosystem and integration requirements, with DynamoDB being well-suited for AWS integration and HBase offering more compatibility with Hadoop.

Answer 41

HBase offers consistent reads, better write/update performance, and integration with Hadoop services. It is suitable for non-AWS Hadoop clusters, sparse data, and high-frequency counters.

Answer 42

DynamoDB is fully managed, scales automatically, and provides a serverless solution. It has better integration with AWS services and AWS Glue. It is well-suited for AWS integration.

Answer 43

MapReduce and HBase are both components of the Hadoop ecosystem, but they serve different purposes and have distinct characteristics: 1. Purpose: - MapReduce: MapReduce is a programming model and software framework designed for processing and analyzing large datasets in a distributed manner. It focuses on data processing tasks such as filtering, sorting, and aggregating data. - HBase: HBase, on the other hand, is a distributed, scalable, and non-relational database that is built on top of Hadoop. It is designed for storing and managing structured and semi-structured data in a fault-tolerant and highly available manner. 2. Data Storage: - MapReduce: MapReduce does not provide its own storage system. It processes data stored in a distributed file system, such as Hadoop Distributed File System (HDFS), by dividing the data into smaller chunks and processing them in parallel. - HBase: HBase stores data in a distributed manner directly in its own storage system, which is based on the concept of BigTable. It organizes data into tables with rows and columns and allows random access to data using a key-value model. 3. Data Processing: - MapReduce: MapReduce processes data by splitting it into smaller chunks, which are then processed in parallel across a cluster of nodes. It follows a two-step process: the map phase and the reduce phase. The map phase applies a function to each data item, and the reduce phase aggregates the results of the map phase to produce the final output. - HBase: HBase provides random read and write access to data, allowing fast and efficient retrieval and modification of individual records. It supports real-time data processing and enables high-speed queries by leveraging its in-memory capabilities. 4. Use Cases: - MapReduce: MapReduce is suitable for batch processing and analyzing large volumes of data where data processing can be divided into map and reduce tasks. It is commonly used for tasks such as log analysis, data aggregation, and ETL (Extract, Transform, Load) operations. - HBase: HBase is well-suited for applications that require low-latency random access to large amounts of structured data, such as time series data, sensor data, or user profiles. It is often used for use cases involving real-time data processing, real-time analytics, and serving as a distributed database for web applications. In summary, MapReduce is a distributed data processing framework, while HBase is a distributed database designed for structured data storage and retrieval. MapReduce focuses on batch processing and analysis, whereas HBase provides real-time, random access to large-scale structured data.

Answer 44

Presto is a technology pre-installed on Amazon EMR that enables connection to various big data databases and data stores simultaneously. It allows SQL-style queries across multiple databases and supports SQL join commands to combine data from different technologies within a cluster. Presto offers interactive queries at a petabyte scale, has a familiar SQL syntax, and is optimized for OLAP applications.

Answer 45

Presto was initially developed by Facebook and is partially maintained by them as an open-source project. It provides high-performance querying capabilities for analyzing massive data sets stored in different databases within an ecosystem.

Answer 46

Amazon Athena is a serverless version of Presto that utilizes the same technology. It provides JDBC, command line, and Tableau interfaces for accessing and analyzing data from various sources.

Answer 47

Presto supports connectors for multiple data sources, including HDFS, S3, Cassandra, MongoDB, HBase, Redshift, and Teradata. It allows users to unify data from disparate sources and perform queries across the entire cluster.

Answer 48

Presto is known for its high performance, processing data in-memory and minimizing unnecessary IO overhead. It is suitable for efficient interactive querying of massive data sets but not for OLTP or batch processing.

Answer 49

Zeppelin, which comes pre-installed on Amazon EMR, is an interactive notebook on your cluster that allows you to run Python scripts and code against your data. It supports iPython notebook-like functionality, where you can write code blocks and intersperse them with comments and notes. Zeppelin integrates with Apache Spark, JDBC, HBase, Elasticsearch, and more, allowing you to kick off various tasks from the notebook. It enables interactive Spark code execution, speeding up development cycles and facilitating experimentation. Zeppelin also provides visualization capabilities for charts and graphs, making it easier to analyze and interpret results. Additionally, it supports Spark SQL for issuing SQL queries directly against the data. Zeppelin makes Spark more accessible as a data science tool rather than just a programming environment.

Answer 50

Amazon EMR offers a similar concept called EMR Notebook, which includes AWS integration and features such as automatic backup to S3 and the ability to provision and manage clusters from the notebook. EMR Notebooks are hosted in a VPC for security and come with graphical libraries from the Anaconda repository for prototyping and exploratory analysis. They can be attached to existing clusters or used to create new clusters. EMR Notebooks are provided at no additional charge to Amazon EMR customers, offering value to Hadoop clusters running on EMR.

Answer 51

Hue, short for Hadoop User Experience, is the front-end interface and management console for an Amazon EMR cluster. It serves as a centralized tool for managing the cluster, including spinning up services, monitoring operational insights, and facilitating data movement between HDFS, EMRFS, and S3. Hue can integrate with IAM to ensure appropriate access control for users. While using Hue, it's important to remember that it primarily functions as a management and monitoring tool for the EMR cluster, providing a front-end console for cluster operations.

Answer 52

1. Splunk is an operational tool used for monitoring and gaining insights into your Amazon EMR cluster. 1. It continuously collects and indexes data to provide real-time information about the performance and activities of your cluster. 1. Splunk can be deployed on EMR or set up as a separate cluster. 1. Amazon offers public AMIs with Splunk Enterprise for easy deployment and monitoring of your EMR cluster. 1. Splunk helps visualize and analyze data from EMR and S3 within your cluster. 1. While Splunk may be mentioned in a list of technologies, its main purpose is to provide operational insights and should not be a distraction in the context of a question.

Answer 53

1. Flume is a distributed and reliable service used for streaming data into your cluster, similar to Kinesis or Kafka. 1. **It is designed specifically for efficiently collecting, aggregating, and moving large amounts of log data.** 1. Flume operates based on the concept of sources, channels, and sinks. 1. A source (such as a web server) provides events to Flume, which are then stored in one or more channels. Channels act as passive stores for events until they are consumed by a Flume sink. 1. Sinks remove events from channels and place them in external repositories like HDFS or Hive. 1. Examples of sinks include HDFS sink for writing events to HDFS and Hive sink for streaming events to Hive tables. 1. Flume is used to stream log data from external sources into various destinations, such as HDFS, Hive, or HBase. 1. Understanding Flume's purpose as a log data streaming tool is important, and it may be presented as an alternative technology for streaming applications in an EMR cluster.

Answer 54

1. MXNet is an alternative to TensorFlow and a library for building and accelerating neural networks. 1. It is included in EMR and is considered the preferred framework for deep learning on EMR. 1. For the purpose of the exam, it is not necessary to design neural networks. 1. MXNet is a framework used for building distributed deep learning applications on an entire EMR cluster.

EMR Flashcards

(79 cards)