4_Dataflow Flashcards

1
Q

What is Dataflow?

  • Auto-scaling, NoOps (fully-managed), serverless, Stream and Batch Processing.
  • Global in scope
  • Built on Apache Beam.
  • Supports expressive SQL, Java and Python APIs.
  • Integrates with other tools (GCP and external):
    • Natively: Pub/Sub, BigQuery, AI Platform
    • Connectors: Bigtable, Apache Kafka
  • Pipelines are regional-based
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

IAM - Access Control

  • Project-level only - all pipelines in the project (or none)
  • Dataflow Admin: Minimal role for creating and managing dataflow jobs.
  • Dataflow Developer: Provides the permissions necessary to execute and manipulate Dataflow jobs.
  • Dataflow viewer: Provides read-only access to all Dataflow-related resources.
  • Dataflow worker: Provides the permissions necessary for a Compute Engine service account to execute work units for a Dataflow pipeline.

Pipeline data access is separated from pipeline access.

For Dataflow users, use roles to limit access to only Dataflow Resources, not just the Project

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Data Access and Security

The Dataflow service uses several security mechanisms to keep your data secure and private. These mechanisms apply to the following scenarios:

  • When you submit a pipeline to the service
  • When the service evaluates your pipeline
  • When you request access to telemetry and metrics during and after pipeline execution
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Security and Permissions

Dataflow pipelines can be run either locally (to perform tests on small datasets), or on managed Google Cloud resources using the Dataflow managed service. Whether running locally or in the cloud, your pipeline and its workers use a permissions system to maintain secure access to pipeline files and resources.

When you run your pipeline, Dataflow uses two service accounts to manage security and permissions: the Dataflow service account and the controller service account. The Dataflow service uses the Dataflow service account as part of the job creation request, and during job execution to manage the job. Worker instances use the controller service account to access input and output resources after you submit your job.

Cloud Dataflow service account:

  • Automatically created
  • Manipulates job resources
  • Cloud Dataflow service agent role
  • Read/Write access to project resources

Controller service account:

  • Compute Engine instances
  • Metadata operations
  • User-managed controller service account
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Apache Beam concepts

  • Pipeline: encapsulates the entire series of computations involved in reading input data, transforming that data, and writing output data. The input source and output sink can be the same or of different types, allowing you to convert data from one format to another. Apache Beam programs start by constructing a Pipeline object, and then using that object as the basis for creating the pipeline’s datasets. Each pipeline represents a single, repeatable job.
  • Element: single entry of data (e.g. table row).
  • PCollection: represents a potentially distributed, multi-element dataset that acts as the pipeline’s data. Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline. A PCollection can hold a dataset of a fixed size (bounded, for batch) or an unbounded dataset from a continuously updating data source (for stream).
  • Transform: represents a processing operation that transforms data. A transform takes one or more PCollections as input, performs an operation that you specify on each element in that collection, and produces one or more PCollections as output. A transform can perform nearly any kind of processing operation, including performing mathematical computations on data, converting data from one format to another, grouping data together, reading and writing data, filtering data to output only the elements you want, or combining data elements into single values.
  • Runners: are the software that accepts a pipeline and executes it. Most runners are translators or adapters to massively parallel big-data processing systems. Other runners exist for local testing and debugging. Supported runners:
    • Apache Flink
    • Apache Samza
    • Apache Spark
    • Google Cloud Dataflow
    • Hazelcast Jet
    • Twister2
    • Direct Runner
      • Using a DirectRunner configuration with a staging storage bucket is the quickest and easiest way of testing a new pipeline, without risking changes to a pipeline that is currently in production.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Pipeline Creation

  • Create pipeline object
  • Create a PCollection using read or create Transform
  • Apply multiple transforms as required
  • Write out final PCollection
  • A pipeline is basically a DAG with PCollections as nodes and Transforms as edges.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Updating Existing Pipeline

When you update a job on the Dataflow service, you replace the existing job with a new job that runs your updated pipeline code. The Dataflow service retains the job name, but runs the replacement job with an updated Job ID.

The replacement job preserves any intermediate state data from the prior job, as well as any buffered data records or metadata currently “in-flight” from the prior job. For example, some records in your pipeline might be buffered while waiting for a window to resolve.

“In-flight” data will still be processed by the transforms in your new pipeline. However, additional transforms that you add in your replacement pipeline code may or may not take effect, depending on where the records are buffered.

When you launch your replacement job, you’ll need to set the following pipeline options to perform the update process in addition to the job’s regular options:

  • Pass the –update option.
  • Set the –jobName option in PipelineOptions to the same name as the job you want to update.
  • If any transform names in your pipeline have changed, you must supply a transform mapping and pass it using the –transformNameMapping option.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

ParDo Transforms

ParDo is the core parallel processing operation in the Apache Beam SDKs, invoking a user-specified function on each of the elements of the input PCollection. DoFn is a template you use to create user defined functions that are referenced by a ParDo.

ParDo collects zero or more output elements into an output PCollection. The ParDo transform processes elements independently and possibly in parallel.

The ParDo processing paradigm is similar to the “Map” phase of a Map/Shuffle/Reduce-style algorithm.

ParDo is useful for:

  • Filtering amd emitting input
  • Type conversion
  • Extracting parts of input and calculating values from different parts of inputs
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Aggregation Transforms

Aggregation is the process of computing some value from multiple input elements. The primary computational pattern for aggregation in Apache Beam is to group all elements with a common key and window. Then, it combines each group of elements using an associative and commutative operation.

Example: GroupByKey is a Beam transform for processing collections of key/value pairs. It’s a parallel reduction operation, analogous to the Shuffle phase of a Map/Shuffle/Reduce-style algorithm.

Note: The Flatten transformation merges multiple PCollection objects into a single logical PCollection, whereas Join transforms like CoGroupByKey attempt to merge data where there are related keys in the two datasets

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Dealing with late/out of order data

  • Latency is to be expected (network latency, processing time, etc.)
  • Pub/Sub does not care about late data, that is resolved in Dataflow
  • This is resolved with Windows, Watermarks and Triggers
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Event Time vs Processing Time

Event Time is the time a data event occurs, determined by the timestamp on the data element itself (i.e. when data was generated). This contrasts with the time the actual data element gets processed at any stage in the pipeline (i.e. Processing time).

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Windowing

Windowing enables grouping operations over unbounded collections by dividing the collection into windows of finite collections according to the timestamps of the individual elements. A windowing function tells the runner how to assign elements to an initial window, and how to merge windows of grouped elements. Apache Beam lets you define different kinds of windows or use the predefined windowing functions.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Watermarks

Apache Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Apache Beam tracks a watermark because data is not guaranteed to arrive in a pipeline in time order or at predictable intervals. In addition, there are no guarantees that data events will appear in the pipeline in the same order that they were generated.

Data that arrives with a timestamp that is inside the window but past the watermark is considered late data.

A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Triggers

Triggers determine when to emit aggregated results as data arrives. For bounded data, results are emitted after all of the input has been processed. For unbounded data, results are emitted when the watermark passes the end of the window, indicating that the system believes all input data for that window has been processed. Apache Beam provides several predefined triggers and lets you combine them.

  • Allow late-arriving data in allowed time window to re-aggregate previously submitted results
  • Trigger can be based on (can be a combination):
    • Event time, as indicated by the timestamp on each data element.
    • Processing time, which is the time that the data element is processed at any given stage in the pipeline.
    • Number of data elements in a collection.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Windows and Windowing functions

Windowing functions divide unbounded collections (i.e. streaming data) into logical components, or windows. Windowing functions group unbounded collections by the timestamps of the individual elements. Each window contains a finite number of elements.

You set the following windows with the Apache Beam SDK or Dataflow SQL streaming extensions:

  • Tumbling windows (called fixed windows in Apache Beam)
  • Hopping windows (called sliding windows in Apache Beam)
  • Session windows
  • Single Global: default, uses a single window for the entire pipeline.
A
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Tumbling windows

A tumbling window represents a consistent, disjoint time interval in the data stream.

For example, if you set to a thirty-second tumbling window, the elements with timestamp values [0:00:00-0:00:30) are in the first window. Elements with timestamp values [0:00:30-0:01:00) are in the second window.

A
17
Q

Hopping windows

A hopping window represents a consistent time interval in the data stream. Hopping windows can overlap, whereas tumbling windows are disjoint.

For example, a hopping window can start every ten seconds and capture one minute of data and the window. The frequency with which hopping windows begin is called the period. This example has a one-minute window and thirty-second period.

To take running averages of data, use hopping windows. You can use one-minute hopping windows with a thirty-second period to compute a one-minute running average every thirty seconds.

A
18
Q

Session windows

A session window contains elements within a gap duration of another element. The gap duration is an interval between new data in a data stream. If data arrives after the gap duration, the data is assigned to a new window.

For example, session windows can divide a data stream representing user mouse activity. This data stream might have long periods of idle time interspersed with many clicks. A session window can contain the data generated by the clicks.

Session windowing assigns different windows to each data key. Tumbling and hopping windows contain all elements in the specified time interval, regardless of data keys.

A
19
Q

Dataflow Templates

To allow different access between Dataflow developers and Dataflow consumers, we can use Dataflow Templates.

Templates help separate the development activities and the developers from the execution activities.

A
20
Q

Side Inputs

In streaming analytics applications, it is common to enrich data with additional information that might be useful for further analysis. For example, if you have the storeId for a transaction, you might want to add information about the store location. You would typically add this additional information by taking an element and “denormalizing” it by bringing in information from a lookup table.

A side input is an additional input that your DoFn can access each time it processes an element in the input PCollection.

Side inputs are loaded in memory and are therefore cached automatically.

A