3.Big Data Technologies Flashcards

Question 1

Q

What is the Hadoop ecosystem, and how does it work?

Answer

A

The Hadoop ecosystem is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Question 2

Q

How does Apache Spark differ from MapReduce?

Answer

A

Apache Spark is faster than MapReduce due to its in-memory processing capabilities, while MapReduce writes intermediate results to disk.

Question 3

Q

Explain the working of Spark’s DAG (Directed Acyclic Graph).

Answer

A

Spark’s DAG represents the sequence of computations to be performed on data, allowing for optimization and fault tolerance.

Question 4

Q

What is HDFS, and how does it achieve fault tolerance?

Answer

A

HDFS (Hadoop Distributed File System) is designed to store large files across multiple machines, achieving fault tolerance through data replication.

Question 5

Q

How do you optimize Spark jobs for performance?

Answer

A

Optimizing Spark jobs can be done by using efficient data formats, caching data, and tuning configurations.

Question 6

Q

Explain the difference between RDD, DataFrame, and Dataset in Spark.

Answer

A

RDD (Resilient Distributed Dataset) is a low-level abstraction, DataFrame is a higher-level abstraction for structured data, and Dataset combines the benefits of both with type safety.

Question 7

Q

How does partitioning work in Spark?

Answer

A

Partitioning in Spark divides data into smaller chunks, allowing for parallel processing and improved performance.

Question 8

Q

What is the role of Apache Kafka in a data pipeline?

Answer

A

Apache Kafka serves as a distributed messaging system that allows for the real-time processing and streaming of data.

Question 9

Q

How does data shuffling impact Spark performance?

Answer

A

Data shuffling can significantly slow down Spark jobs as it involves redistributing data across partitions, leading to increased I/O operations.

Question 10

Q

What are the advantages of using Parquet or ORC formats for big data?

Answer

A

Parquet and ORC formats provide efficient data compression and encoding schemes, leading to reduced storage costs and improved query performance.