Kafka Spark Flashcards
What is Apache Kafka?
Apache Kafka is a distributed streaming platform that allows for publishing, subscribing to, storing, and processing streams of records in real-time.
What are the main components of Kafka?
The main components of Kafka are: Producers, Consumers, Topics, Partitions, Brokers, and ZooKeeper.
What is a Kafka topic?
A Kafka topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber and can have zero, one, or many consumers that subscribe to the data written to it.
Explain the role of ZooKeeper in Kafka.
ZooKeeper is used for managing and coordinating Kafka brokers. It maintains metadata about the Kafka cluster, such as information about topics, brokers, consumer offsets, and more.
What is the purpose of partitions in Kafka?
Partitions allow Kafka to distribute data across multiple brokers, enabling parallel processing and scalability. Each partition is an ordered, immutable sequence of records.
How does Kafka ensure fault tolerance?
Kafka ensures fault tolerance through data replication across multiple brokers. Each partition can have multiple replicas, with one serving as the leader and others as followers.
What is the role of a Kafka Producer?
A Kafka Producer is responsible for publishing data to one or more Kafka topics. It can automatically load balance data across multiple brokers.
Explain the concept of Consumer Groups in Kafka.
Consumer Groups in Kafka allow for scalable and fault-tolerant consumption of messages. Multiple consumers can work together to consume messages from one or more topics, with each record delivered to only one consumer within each subscribing group.
What is Apache Spark?
Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It provides high-level APIs in Java, Scala, Python, and R, and supports various libraries.
What are the main components of Apache Spark?
The main components of Apache Spark are: Spark Core, Spark SQL, Spark Streaming, MLlib (for machine learning), and GraphX (for graph processing).
What is an RDD in Spark?
RDD stands for Resilient Distributed Dataset. It’s a fundamental data structure of Spark that is an immutable distributed collection of objects that can be processed in parallel.
How does Spark achieve fault tolerance?
Spark achieves fault tolerance through the lineage of RDDs. If a partition of an RDD is lost, Spark can rebuild it using the lineage graph, which tracks how the RDD was derived from other datasets.
What is the difference between transform and action operations in Spark?
Transform operations (like map, filter) create a new RDD from an existing one without executing computations. Action operations (like count, collect) trigger the execution of computations and return results to the driver program.
Explain the concept of lazy evaluation in Spark.
Lazy evaluation in Spark means that the execution of transformations is delayed until an action is called. This allows Spark to optimize the execution plan.
What is a Spark DataFrame?
A Spark DataFrame is a distributed collection of data organized into named columns. It’s conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.