8. Streaming Data and Real-Time Processing Flashcards

Question 1

Q

What are the key components of Apache Kafka?

Answer

A

The key components of Apache Kafka include producers, brokers, consumers, and topics.

Question 2

Q

How does Kafka achieve fault tolerance?

Answer

A

Kafka achieves fault tolerance through data replication across multiple brokers.

Question 3

Q

What is the difference between batch and streaming processing?

Answer

A

Batch processing handles large volumes of data at once, while streaming processing deals with data in real-time as it arrives.

Question 4

Q

Explain window functions in streaming frameworks like Flink or Spark Streaming.

Answer

A

Window functions allow for grouping of data over a specified time frame for processing.

Question 5

Q

How does Spark Streaming ensure exactly-once semantics?

Answer

A

Spark Streaming ensures exactly-once semantics through the use of write-ahead logs.

Question 6

Q

What is Kafka offset management, and why is it important?

Answer

A

Kafka offset management tracks the position of messages consumed, ensuring messages are not reprocessed or lost.

Question 7

Q

Explain the concepts of watermarking and late arrival data.

Answer

A

Watermarking is a technique to handle late data by defining a threshold for event time, allowing for late arrivals to be processed.

Question 8

Q

How do you handle out-of-order data in streaming pipelines?

Answer

A

Out-of-order data can be handled using techniques like buffering and watermarking.

Question 9

Q

What are the differences between Apache Flink and Spark Streaming?

Answer

A

Apache Flink is designed for low-latency processing and supports event time, while Spark Streaming is micro-batch oriented.

Question 10

Q

How does Kinesis compare with Kafka?

Answer

A

Kinesis is a fully managed service for real-time data processing, while Kafka is an open-source distributed streaming platform.

(10 cards)