Big Data Refresher Flashcards
What is Spark?
Spark is an open-source parallel processing framework for running large-scale data analytics applications across clustered computers.
What is Hadoop?
Hadoop is an open-source framework that utilizes a network of clustered computers to store and process large datasets.
What is Hive?
Hive™ is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL on top of Apache Hadoop.
What are the core components of Spark?
Spark Core, Spark SQL, Spark Streaming, MLib, Graph X, Spark R
What are the core components of Hadoop?
HDFS, YARN, Map Reduce.
What is HDFS?
HDFS stands for Hadoop Distributed File system, and it is the storage component of Hadoop. It is responsible for storing large datasets of structured and unstructured data across various nodes. It consists of two core components Namenode and Datanode. The namenode is the primary or master node and contains the metadata of the data. The datanode is where the actual data is stored and reads, writes, processes, and replicates the data.
What is YARN?
Yarn stands for Yet another resource negotiator. It is the resource management component of Hadoop. Yarn consists of three components the Resource Manager, Node Manager, and the Application Master. The resource manager is in charge of allocating resources to all the applications in the system, the node manager is responsible for containers, monitoring their resource usage such as cpu, memory, and disk. The application master works as an interface between the resource manager and node manager and performs negotiations as per the requirement of the two.
What is MapReduce?
MapReduce is the processing component of Hadoop. MapReduce makes use of two functions map() and reduce(). Map() performs sorting and filtering of data and organizing them in the form of groups. Map generates a key-value pair based result which is later on processed by the Reduce() method.
Reduce() does the summarization by aggregating the mapped data. In simple, Reduce() takes the output generated by Map() as input and combines those tuples into smaller set of tuples.
What are the characteristics of HDFS?
Fault tolerant - Hadoop framework divides data into blocks. After that creates multiple copies of blocks on different machines in the cluster.
Scalable - whenever requirements increase you can scale the cluster. Two scalability mechanisms are available in HDFS: Vertical and Horizontal Scalability.
High Availability - At the time of unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks are present on the other nodes in the HDFS cluster.
How is Apache Spark different from MapReduce?
- Spark processes data in real time and in batches whereas MapReduce only does batch processing.
- Spark is 100 times faster than map reduce.
- Spark stored data in RAM whereas MapReduce stores data to disk.
How does Spark run its applications with the help of its architecture?
Spark applications run as independent processes that are coordinated by the SparkSession object in the driver program. The resource manager or cluster manager assigns tasks to the worker nodes with one task per partition. Iterative algorithms apply operations repeatedly to the data so they can benefit from caching datasets across iterations. A task applies its unit of work to the dataset in its partition and outputs a new partition dataset. Finally, the results are sent back to the driver application or can be saved to the disk.
What are RDD?
RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel.
What is lazy evaluation in Spark?
When spark operates on any dataset, it remembers the instructions. For example when a transformation is called on an RDD the operation is not performed instantly. Transformations in spark are not evaluated until you perform an action which aids in optimizing the overall data processing workflow.
What is a Parquet file and what are its advantages?
Parquet is a columnar storage file format that is used to efficiently store large datasets. Some of the advantages are that it enables you to fetch specific columns for access, consumes less space, follows type-specific encoding, and supports limited I/O operations.
What is Shuffling in Spark?
Shuffling is the process of redistributing data across partitions.