Cloud Dataflow - Pipelines Flashcards

1
Q

What is Cloud Dataflow?

A
  • A fully managed service for transforming and enriching data in stream (real-time) and batch (historical) modes with equal reliability and expressiveness.
  • Serverless approach to resource provisioning and management provides limitless capacity for the biggest data processing challenges, pay only for what you use.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What are some Use Cases for Cloud Dataflow?

A
  • Clickstream, Point-of-Sale and segmentation analysis in retail
  • Check fraud detection in financial services
  • Check personalized user experience in gaming
  • Check IoT analytics in manufacturing, healthcare and logistics
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the advantages to using Cloud Dataflow?

A

Particularly for out-of-order processing and real-time session management.

1) Accelerate development for batch and streaming: provides simple pipeline development via Java and Python API’s in the Apache Beam SDK, code reuse.
2) Simplify operations and management: serverless approach removes operational overhead.
3) Integrates with Stackdriver for a unified logging and monitoring and alerting solution.
4) Builds on a foundation for machine learning, by adding ML models and API’s to your data processing pipelines.
5) Integrates seamlessly with GCP services for streaming events ingestion (Pub/Sub), data-warehousing (BigQuery), machine learning (Cloud Machine Learning), and more. Beam-based SDK allows customer extenstions for alternative execution engines like Apache Spark via Cloud Dataproc or on-premises.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are features of Cloud Dataflow?

A
  • Automated resource management
  • Dynamic work re-balancing
  • Reliable and consistent ‘exactly-once’ processing, fault-tolerant execution
  • Horizontal auto-scaling, results in better price-to-performance
  • Unified programming model (Apache Bean SDK) provides MapReduce-like operations, powerful data-windowing and fine-grained correctness control for streaming and batch data
  • Community-driven Innovation: Dataflow can be extended by forking or contributing to Apache Beam
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are the different types of Triggers provided by Dataflow?

A
  1. Time-based (Default)
    • operate on a time reference, either event time or processing time
    • AfterWatermark, AFterProcessingTime
  2. Data-driven
    • examines data as it arrives and fires on a specified data condition, for example after a set number
    • AfterPane.elementCountAtLeast
  3. Composite
    • comines time-based and data-driven, fires when all triggers are met
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the two basic primitives in Dataflow?

A
  1. PCollections: represents data-sets, across which parallel transformations may be performed. The ‘P’ is for parallel.
  2. PTransforms: applied to PCollections to create new PCollections. The perform Element tranformations, aggregate multiple elements together, or may be a composite combination of other PTransforms.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What does Triggering provide for Windowing and Watermarks?

A
  • Tackles the problem of Watermarks being too fast or too slow
  • Allows control of the materialization for a Window given some regular interval or other data condition
  • Too slow might use processing time to trigger
  • Too fast might use element or transaction counts to trigger
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are some differences between Dataflow and Spark?

A
  • Spark uses per processing-time window, or per micro-batch, reflects events as they arrived at the pipeline, lack of event-time windowing support means emulating with available API’s, additional code and duplicated logic.
  • Dataflow uses per event-time window with Watermarks, reflects events as they happened. Clear and modular.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What are the four major concepts when you think about data processing with Dataflow?

A
  1. Pipelines
  2. PCollections
  3. Transforms
  4. I/O Sources and Sinks
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a Pipeline?

A

A pipeline encapsulates an entire series of computations that accepts some input data from external sources, transforms that data to provide some useful intelligence, and produces some output data. That output data is often written to an external data sink. The input source and output sink can be the same, or they can be of different types, allowing you to easily convert data from one format to another.

Each pipeline represents a single, potentially repeatable job, from start to finish, in the Dataflow service.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a PCollection?

A

A PCollection represents a set of data in your pipeline. The Dataflow PCollection classes are specialized container classes that can represent data sets of virtually unlimited size. A PCollection can hold a data set of a fixed size (such as data from a text file or a BigQuery table), or an unbounded data set from a continuously updating data source (such as a subscription from Google Cloud Pub/Sub).

PCollections are the inputs and outputs for each step in your pipeline.

PCollections are optimized for parallelism, unlike the standard JDK Collection class.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Transform?

A

A transform is a data processing operation, or a step, in your pipeline. A transform takes one or more PCollections as input, performs a processing function that you provide on the elements of that PCollection, and produces an output PCollection.

Your transforms don’t need to be in a strict linear sequence within your pipeline. You can use conditionals, loops, and other common programming structures to create a branching pipeline or a pipeline with repeated structures. You can think of your pipeline as a directed graph of steps, rather than a linear sequence.

In a Dataflow pipeline, a transform represents a step, or a processing operation that transforms data. A transform can perform nearly any kind of processing operation, including performing mathematical computations on data, converting data from one format to another, grouping data together, reading and writing data, filtering data to output only the elements you want, or combining data elements into single values.

Transforms in the Dataflow model can be nested—that is, transforms can contain and invoke other transforms, thus forming composite transforms.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are I/O Sources and Sinks?

A

The Dataflow SDKs provide data source and data sink APIs for pipeline I/O. You use the source APIs to read data into your pipeline, and the sink APIs to write output data from your pipeline. These source and sink operations represent the roots and endpoints of your pipeline.

The Dataflow source and sink APIs let your pipeline work with data from a number of different data storage formats, such as files in Google Cloud Storage, BigQuery tables, and more. You can also use a custom data source (or sink) by teaching Dataflow how to read from (or write to) it in parallel.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What are the Dataflow SDK options and advantages?

A

For SDK 2.x:

Language: Java and Python

Automatic patches and updates by Google

Issue and new version communications

Tested by Google

Eclipse Plugin Support

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What questions should you consider when designing your pipeline?

A
  1. Where is your input data stored?
  2. What does your data look like?
  3. What do you want to do with your data?
  4. What does your output data look like, where should it go?
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the basic design behind a Pipeline?

A

A simple pipeline is linear: Input > Transform(PCollection) > Output

But can be much more complex, a pipeline represents a Directed Acyclic Graph of steps. It can have multiple input sources, multiple output sinks, and its operations (transforms) can output multiple PCollections.

It’s important to understand that transforms do not consume PCollections; instead, they consider each individual element of a PCollection and create a new PCollection as output. This way, you can do different things to different elements in the same PCollection.

17
Q

What are the parts of a Pipeline?

A

A pipeline consists of two parts:

  1. Data - PCollection for input, intermediate and output
  2. Transforms applied to that data
    • Core Transforms - basic processing operation
      • The Dataflow SDKs supply core transforms such as ParDo and GroupByKey, Combine and Flatten(merge), as well as other core transforms for combining, merging, and splitting data sets.
    • Composite Transforms - combining multiple transforms into larger composite transforms. Cloud SDK also provides composite tranforms for combining data, map/shuffle/reduce and statistical analysis.
    • Root Transforms - often used to create the initial PCollection, reading data from a data source - pipeline i/o
18
Q

How does Dataflow perform data encoding?

A

When you create or output pipeline data, you’ll need to specify how the elements in your PCollections are encoded and decoded to and from byte strings. Byte strings are used for intermediate storage as well reading from sources and writing to sinks. The Dataflow SDKs use objects called coders to describe how the elements of a given PCollectionshould be encoded and decoded.

You typically need to specify a coder when reading data into your pipeline from an external source (or creating pipeline data from local data), and also when you output pipeline data to an external sink.

You set the Coder by calling the method .withCoderwhen you apply your pipeline’s Read or Write transform.

19
Q

What are the PCollection limitations?

A
  1. A PCollection is immutable. Once created, you cannot add, remove, or change individual elements. (new PCollection is used)
  2. A PCollection does not support random access to individual elements.
  3. A PCollection belongs to the pipeline in which it is created. You cannot share a PCollection between Pipeline objects.
20
Q

What is meant by bounded and unbounded PCollections, and what data-sources/sinks create and accept them?

A
  1. Bounded
    • fixed data set, known size that does not change
    • server logs from a month, orders from last week
    • TextIO, BigQueryIO, DataStoreIO and Custom Bounded IO produce and accept bounded PCollections
  2. Unbounded
    • continuously updating dat set, steaming data
    • server logs as they are generated; all new orders as they are processed
    • PubsubIO, Custom Unbounded IO produce
    • PubsubIO, BigQueryIO accept
21
Q

What is Dataflow’s default Windowing behavior?

A

Dataflow’s default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections. Before you use a grouping transform such as GroupByKey on an unbounded PCollection, you must set a non-global windowing function.

If you don’t set a non-global windowing function for your unbounded PCollection and subsequently use a grouping transform such as GroupByKey or Combine, your pipeline will generate an error upon construction and your Dataflow job will fail.

You can alternatively set a non-default Trigger for a PCollection to allow the global window to emit “early” results under some other conditions. Triggers can also be used to allow for conditions such as late-arriving data.

22
Q

What I/O API’s (read and write transforms) are available in the Dataflow SDK’s?

A
  1. Avro files
  2. BigQuery files
  3. Bigtable
  4. Datastore
  5. Pub/Sub
  6. Text files
  7. Custom data sources and sinks
23
Q

What security does Dataflow provide to keep your data secure and private?

A
  • Pipeline Submission - project level account permission (email address based); auth required for gcloud and then pipelines submitted over HTTPS.
  • Pipeline Evaluation - temporary data stored in Cloud Storage are encrypted at rest and do not persist after pipeline eval concludes.
  • Cloud Logging only shows what your code produces
  • In-Flight Data - communications between sources, workers and sinks are encrypted and carried over HTTPS.
  • Data Locality - Zone selection and VPN provide narrow and secure options for pipeline logic evaluation on the workers or compute instances
24
Q

How is a Datflow Pipeline constructed, what are the general steps?

A
  1. Create a Pipeline object
  2. Use a Read or Create tansform to create one or more PCollections for your pipeline data.
  3. Apply transforms to each PCollection to change, filter, group, analyze or process elements in a PCollection. Each transform creates a new output PCollection, to which additional transforms may be applied until processing is complete.
  4. Write or otherwise output the final, transformed PCollection(s)
  5. Run the pipeline

————– code example ———————————————

// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline.
Pipeline p = Pipeline.create(options);
// Read the data
PCollection<string> lines = p.apply(<br> TextIO.Read.named("ReadMyFile").from("gs://some/inputData.txt"));</string>
// Apply transforms
PCollection<string> words = ...;<br>PCollection<string> reversedWords = words.apply(new ReverseWords());</string></string>
// Write out final pipeline data
PCollection<string> filteredWords = ...;<br>filteredWords.apply(TextIO.Write.named("WriteMyFile").to("gs://some/outputData.txt"));</string>
// Run the pipeline
p.run();
25
Q

What are common uses for Streaming Data?

A

Telemetry data—Internet of Things (IoT) devices are network-connected devices that gather data from the surrounding environment through sensors. Although each device might send only a single data point every minute, when you multiply that data by a large number of devices, you quickly need to apply big data strategies and patterns.

User events and analytics—A mobile app might log user events when the user opens the app and whenever an error or crash occurs. The aggregate of this data, across all mobile devices where the app is installed, can provide valuable information about usage, metrics, and code quality.

26
Q

What is Windowing?

A

In the Dataflow model, any PCollection can be subdivided into logical windows. Each element in a PCollection gets assigned to one or more windows according to the PCollection’s windowing function, and each individual window contains a finite number of elements. Grouping transforms then consider each PCollection’s elements on a per-window basis. GroupByKey, for example, implicitly groups the elements of a PCollection by key and window. Dataflow only groups data within the same window, and doesn’t group data in other windows.