Cloud Dataflow - Pipelines Flashcards
What is Cloud Dataflow?
- A fully managed service for transforming and enriching data in stream (real-time) and batch (historical) modes with equal reliability and expressiveness.
- Serverless approach to resource provisioning and management provides limitless capacity for the biggest data processing challenges, pay only for what you use.
What are some Use Cases for Cloud Dataflow?
- Clickstream, Point-of-Sale and segmentation analysis in retail
- Check fraud detection in financial services
- Check personalized user experience in gaming
- Check IoT analytics in manufacturing, healthcare and logistics
What are the advantages to using Cloud Dataflow?
Particularly for out-of-order processing and real-time session management.
1) Accelerate development for batch and streaming: provides simple pipeline development via Java and Python API’s in the Apache Beam SDK, code reuse.
2) Simplify operations and management: serverless approach removes operational overhead.
3) Integrates with Stackdriver for a unified logging and monitoring and alerting solution.
4) Builds on a foundation for machine learning, by adding ML models and API’s to your data processing pipelines.
5) Integrates seamlessly with GCP services for streaming events ingestion (Pub/Sub), data-warehousing (BigQuery), machine learning (Cloud Machine Learning), and more. Beam-based SDK allows customer extenstions for alternative execution engines like Apache Spark via Cloud Dataproc or on-premises.
What are features of Cloud Dataflow?
- Automated resource management
- Dynamic work re-balancing
- Reliable and consistent ‘exactly-once’ processing, fault-tolerant execution
- Horizontal auto-scaling, results in better price-to-performance
- Unified programming model (Apache Bean SDK) provides MapReduce-like operations, powerful data-windowing and fine-grained correctness control for streaming and batch data
- Community-driven Innovation: Dataflow can be extended by forking or contributing to Apache Beam
What are the different types of Triggers provided by Dataflow?
- Time-based (Default)
- operate on a time reference, either event time or processing time
- AfterWatermark, AFterProcessingTime
- Data-driven
- examines data as it arrives and fires on a specified data condition, for example after a set number
- AfterPane.elementCountAtLeast
- Composite
- comines time-based and data-driven, fires when all triggers are met
What are the two basic primitives in Dataflow?
- PCollections: represents data-sets, across which parallel transformations may be performed. The ‘P’ is for parallel.
- PTransforms: applied to PCollections to create new PCollections. The perform Element tranformations, aggregate multiple elements together, or may be a composite combination of other PTransforms.
What does Triggering provide for Windowing and Watermarks?
- Tackles the problem of Watermarks being too fast or too slow
- Allows control of the materialization for a Window given some regular interval or other data condition
- Too slow might use processing time to trigger
- Too fast might use element or transaction counts to trigger
What are some differences between Dataflow and Spark?
- Spark uses per processing-time window, or per micro-batch, reflects events as they arrived at the pipeline, lack of event-time windowing support means emulating with available API’s, additional code and duplicated logic.
- Dataflow uses per event-time window with Watermarks, reflects events as they happened. Clear and modular.
What are the four major concepts when you think about data processing with Dataflow?
- Pipelines
- PCollections
- Transforms
- I/O Sources and Sinks
What is a Pipeline?
A pipeline encapsulates an entire series of computations that accepts some input data from external sources, transforms that data to provide some useful intelligence, and produces some output data. That output data is often written to an external data sink. The input source and output sink can be the same, or they can be of different types, allowing you to easily convert data from one format to another.
Each pipeline represents a single, potentially repeatable job, from start to finish, in the Dataflow service.
What is a PCollection?
A PCollection represents a set of data in your pipeline. The Dataflow PCollection classes are specialized container classes that can represent data sets of virtually unlimited size. A PCollection can hold a data set of a fixed size (such as data from a text file or a BigQuery table), or an unbounded data set from a continuously updating data source (such as a subscription from Google Cloud Pub/Sub).
PCollections are the inputs and outputs for each step in your pipeline.
PCollections are optimized for parallelism, unlike the standard JDK Collection class.
What is a Transform?
A transform is a data processing operation, or a step, in your pipeline. A transform takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection.
Your transforms don’t need to be in a strict linear sequence within your pipeline. You can use conditionals, loops, and other common programming structures to create a branching pipeline or a pipeline with repeated structures. You can think of your pipeline as a directed graph of steps, rather than a linear sequence.
In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data. A transform can perform nearly any kind of processing operation, including performing mathematical computations on data, converting data from one format to another, grouping data together, reading and writing data, filtering data to output only the elements you want, or combining data elements into single values.
Transforms in the Dataflow model can be nested—that is, transforms can contain and invoke other transforms, thus forming composite transforms.
What are I/O Sources and Sinks?
The Dataflow SDKs provide data source and data sink APIs for pipeline I/O. You use the source APIs to read data into your pipeline, and the sink APIs to write output data from your pipeline. These source and sink operations represent the roots and endpoints of your pipeline.
The Dataflow source and sink APIs let your pipeline work with data from a number of different data storage formats, such as files in Google Cloud Storage, BigQuery tables, and more. You can also use a custom data source (or sink) by teaching Dataflow how to read from (or write to) it in parallel.
What are the Dataflow SDK options and advantages?
For SDK 2.x:
Language: Java and Python
Automatic patches and updates by Google
Issue and new version communications
Tested by Google
Eclipse Plugin Support
What questions should you consider when designing your pipeline?
- Where is your input data stored?
- What does your data look like?
- What do you want to do with your data?
- What does your output data look like, where should it go?