8. Streaming Data and Real-Time Processing Flashcards
What are the key components of Apache Kafka?
The key components of Apache Kafka include producers, brokers, consumers, and topics.
How does Kafka achieve fault tolerance?
Kafka achieves fault tolerance through data replication across multiple brokers.
What is the difference between batch and streaming processing?
Batch processing handles large volumes of data at once, while streaming processing deals with data in real-time as it arrives.
Explain window functions in streaming frameworks like Flink or Spark Streaming.
Window functions allow for grouping of data over a specified time frame for processing.
How does Spark Streaming ensure exactly-once semantics?
Spark Streaming ensures exactly-once semantics through the use of write-ahead logs.
What is Kafka offset management, and why is it important?
Kafka offset management tracks the position of messages consumed, ensuring messages are not reprocessed or lost.
Explain the concepts of watermarking and late arrival data.
Watermarking is a technique to handle late data by defining a threshold for event time, allowing for late arrivals to be processed.
How do you handle out-of-order data in streaming pipelines?
Out-of-order data can be handled using techniques like buffering and watermarking.
What are the differences between Apache Flink and Spark Streaming?
Apache Flink is designed for low-latency processing and supports event time, while Spark Streaming is micro-batch oriented.
How does Kinesis compare with Kafka?
Kinesis is a fully managed service for real-time data processing, while Kafka is an open-source distributed streaming platform.