3.Big Data Technologies Flashcards
What is the Hadoop ecosystem, and how does it work?
The Hadoop ecosystem is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
How does Apache Spark differ from MapReduce?
Apache Spark is faster than MapReduce due to its in-memory processing capabilities, while MapReduce writes intermediate results to disk.
Explain the working of Spark’s DAG (Directed Acyclic Graph).
Spark’s DAG represents the sequence of computations to be performed on data, allowing for optimization and fault tolerance.
What is HDFS, and how does it achieve fault tolerance?
HDFS (Hadoop Distributed File System) is designed to store large files across multiple machines, achieving fault tolerance through data replication.
How do you optimize Spark jobs for performance?
Optimizing Spark jobs can be done by using efficient data formats, caching data, and tuning configurations.
Explain the difference between RDD, DataFrame, and Dataset in Spark.
RDD (Resilient Distributed Dataset) is a low-level abstraction, DataFrame is a higher-level abstraction for structured data, and Dataset combines the benefits of both with type safety.
How does partitioning work in Spark?
Partitioning in Spark divides data into smaller chunks, allowing for parallel processing and improved performance.
What is the role of Apache Kafka in a data pipeline?
Apache Kafka serves as a distributed messaging system that allows for the real-time processing and streaming of data.
How does data shuffling impact Spark performance?
Data shuffling can significantly slow down Spark jobs as it involves redistributing data across partitions, leading to increased I/O operations.
What are the advantages of using Parquet or ORC formats for big data?
Parquet and ORC formats provide efficient data compression and encoding schemes, leading to reduced storage costs and improved query performance.