Kafka Spark Flashcards

1
Q

What is Apache Kafka?

A

Apache Kafka is a distributed streaming platform that allows for publishing, subscribing to, storing, and processing streams of records in real-time.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are the main components of Kafka?

A

The main components of Kafka are: Producers, Consumers, Topics, Partitions, Brokers, and ZooKeeper.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What is a Kafka topic?

A

A Kafka topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber and can have zero, one, or many consumers that subscribe to the data written to it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Explain the role of ZooKeeper in Kafka.

A

ZooKeeper is used for managing and coordinating Kafka brokers. It maintains metadata about the Kafka cluster, such as information about topics, brokers, consumer offsets, and more.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is the purpose of partitions in Kafka?

A

Partitions allow Kafka to distribute data across multiple brokers, enabling parallel processing and scalability. Each partition is an ordered, immutable sequence of records.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does Kafka ensure fault tolerance?

A

Kafka ensures fault tolerance through data replication across multiple brokers. Each partition can have multiple replicas, with one serving as the leader and others as followers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the role of a Kafka Producer?

A

A Kafka Producer is responsible for publishing data to one or more Kafka topics. It can automatically load balance data across multiple brokers.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Explain the concept of Consumer Groups in Kafka.

A

Consumer Groups in Kafka allow for scalable and fault-tolerant consumption of messages. Multiple consumers can work together to consume messages from one or more topics, with each record delivered to only one consumer within each subscribing group.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is Apache Spark?

A

Apache Spark is an open-source, distributed computing system used for big data processing and analytics. It provides high-level APIs in Java, Scala, Python, and R, and supports various libraries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What are the main components of Apache Spark?

A

The main components of Apache Spark are: Spark Core, Spark SQL, Spark Streaming, MLlib (for machine learning), and GraphX (for graph processing).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is an RDD in Spark?

A

RDD stands for Resilient Distributed Dataset. It’s a fundamental data structure of Spark that is an immutable distributed collection of objects that can be processed in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

How does Spark achieve fault tolerance?

A

Spark achieves fault tolerance through the lineage of RDDs. If a partition of an RDD is lost, Spark can rebuild it using the lineage graph, which tracks how the RDD was derived from other datasets.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is the difference between transform and action operations in Spark?

A

Transform operations (like map, filter) create a new RDD from an existing one without executing computations. Action operations (like count, collect) trigger the execution of computations and return results to the driver program.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Explain the concept of lazy evaluation in Spark.

A

Lazy evaluation in Spark means that the execution of transformations is delayed until an action is called. This allows Spark to optimize the execution plan.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a Spark DataFrame?

A

A Spark DataFrame is a distributed collection of data organized into named columns. It’s conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How does Spark Streaming work?

A

Spark Streaming works by dividing the streaming data into small batches and processing them using Spark’s batch processing engine. This approach is called micro-batch processing.

17
Q

How can Apache Kafka and Apache Spark be used together?

A

Kafka can be used as a data source for Spark Streaming, allowing real-time data ingestion. Spark can then process this streaming data and potentially write results back to Kafka or other systems.

18
Q

What are the benefits of using Kafka with Spark?

A

The combination allows for building end-to-end real-time data pipelines. Kafka provides reliable data ingestion and buffering, while Spark offers powerful processing capabilities, enabling complex analytics on streaming data.