Kafka Spark Flashcards

Question 1

Q

What is Apache Kafka?

Answer

A

Apache Kafka is a distributed streaming platform that allows for publishing, subscribing to, storing, and processing streams of records in real-time.

Question 2

Q

What are the main components of Kafka?

Answer

A

The main components of Kafka are: Producers, Consumers, Topics, Partitions, Brokers, and ZooKeeper.

Question 3

Q

What is a Kafka topic?

Answer

A

A Kafka topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber and can have zero, one, or many consumers that subscribe to the data written to it.

Question 4

Q

Explain the role of ZooKeeper in Kafka.

Answer

A

ZooKeeper is used for managing and coordinating Kafka brokers. It maintains metadata about the Kafka cluster, such as information about topics, brokers, consumer offsets, and more.

Question 5

Q

What is the purpose of partitions in Kafka?

Answer

A

Partitions allow Kafka to distribute data across multiple brokers, enabling parallel processing and scalability. Each partition is an ordered, immutable sequence of records.

Question 6

Q

How does Kafka ensure fault tolerance?

Answer

A

Kafka ensures fault tolerance through data replication across multiple brokers. Each partition can have multiple replicas, with one serving as the leader and others as followers.

Question 7

Q

What is the role of a Kafka Producer?

Answer

A

A Kafka Producer is responsible for publishing data to one or more Kafka topics. It can automatically load balance data across multiple brokers.

Question 8

Q

Explain the concept of Consumer Groups in Kafka.

Answer

A

Consumer Groups in Kafka allow for scalable and fault-tolerant consumption of messages. Multiple consumers can work together to consume messages from one or more topics, with each record delivered to only one consumer within each subscribing group.

Question 9

Q

What is Apache Spark?

Answer

A

Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It provides high-level APIs in Java, Scala, Python, and R, and supports various libraries.

Question 10

Q

What are the main components of Apache Spark?

Answer

A

The main components of Apache Spark are: Spark Core, Spark SQL, Spark Streaming, MLlib (for machine learning), and GraphX (for graph processing).

Question 11

Q

What is an RDD in Spark?

Answer

A

RDD stands for Resilient Distributed Dataset. It’s a fundamental data structure of Spark that is an immutable distributed collection of objects that can be processed in parallel.

Question 12

Q

How does Spark achieve fault tolerance?

Answer

A

Spark achieves fault tolerance through the lineage of RDDs. If a partition of an RDD is lost, Spark can rebuild it using the lineage graph, which tracks how the RDD was derived from other datasets.

Question 13

Q

What is the difference between transform and action operations in Spark?

Answer

A

Transform operations (like map, filter) create a new RDD from an existing one without executing computations. Action operations (like count, collect) trigger the execution of computations and return results to the driver program.

Question 14

Q

Explain the concept of lazy evaluation in Spark.

Answer

A

Lazy evaluation in Spark means that the execution of transformations is delayed until an action is called. This allows Spark to optimize the execution plan.

Question 15

Q

What is a Spark DataFrame?

Answer

A

A Spark DataFrame is a distributed collection of data organized into named columns. It’s conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

Question 16

Q

How does Spark Streaming work?

Answer

Study These Flashcards

A

Spark Streaming works by dividing the streaming data into small batches and processing them using Spark’s batch processing engine. This approach is called micro-batch processing.

Question 17

Q

How can Apache Kafka and Apache Spark be used together?

Answer

Study These Flashcards

A

Kafka can be used as a data source for Spark Streaming, allowing real-time data ingestion. Spark can then process this streaming data and potentially write results back to Kafka or other systems.

Question 18

Q

What are the benefits of using Kafka with Spark?

Answer

Study These Flashcards

A

The combination allows for building end-to-end real-time data pipelines. Kafka provides reliable data ingestion and buffering, while Spark offers powerful processing capabilities, enabling complex analytics on streaming data.

Kafka Spark Flashcards

(18 cards)