Dataproc and Pub/Sub Flashcards

1
Q

What is Cloud Dataproc and when would I use it?

A
  • fully-managed service for running Apache Spark and Apache Hadoop clusters
  • Integrates with other GCP services for data-processing, analytics and machine learning
  • Apache Spark: compute engine for Hadoop data
  • Apache Hadoop: distributed processing of large data sets
  • Great for existing data processing pipelines that use Spark, Hive or Pig, to move your pipelines
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is the difference between Dataflow and Dataproc?

A
  • Dataflow is a fully-managed, self-optimizing service that uses Apache Beam for write batch and streaming data processing pipelines. It is integrated with many other GCP sources and sinks
  • Dataproc is a fully-managed service for running Apache Hadoop and Apache Spark clusters. Great if you have existing pipelines using Spark, Hive or Pig; also integrated with other GCP services
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are the main features of Dataproc?

A
  1. Automated cluster management
  2. Resizable clusters
  3. Integrated
  4. Versioning
  5. Highly available
  6. Developer tools
  7. Initialization Actions
  8. Automatic or Manual Configuration
  9. Flexible Virtual Machines- custom or preemptible
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is Cloud Pub/Sub, how would I use it?

A
  • a scalable foundation for stream analytics and event-driven systems
  • Enterprise message-oriented middleware *
  • Provides asynchronous messaging between senders and receivers
  • As part of GCP’s ‘stream analytics solution’, ingests event streams and delivers them to Cloud Dataflow for processing and BigQuery analysis
  • Scales to millions of messages per second and only pay for what you use
  • Use Cloud Pub/Sub to simplify scalable, distributed systems. All published data is synchronously replicated across availability zones to ensure that messages are available to consumers for processing as soon as they are ready. Fine-grained access controls allow for sophisticated cross-team and organizational data sharing. And end-to-end encryption adds security to your pipelines.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What are some common use-cases for Pub/Sub?

A
  1. Balancing workloads, ie among Compute instances
  2. Implementing asynchronous workflows
  3. Distributing event notifcations
  4. Refreshing distributed caches
  5. Logging to multple systems
  6. Data streaming from various processes or devices
  7. Reliability improvement: failover recovery
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are the benefits and features of Pub/Sub?

A
  1. Unified messaging - durability and low-latency in one place
  2. Global presence - connect services anywhere in world
  3. Flexible delivery options - push and pull supported
  4. Data reliablity - replicate storage and guaranteed msg delivery
  5. End-to-end reliability - app level acknowledgement
  6. Data security and protection- encrypted on wire and at rest
  7. Flow control - dynamic rate limiting
  8. Simplicity- easy to use REST/JSON API
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What four critical question must be asked when building a data processing pipeline?

A
  1. What results are calculated? Sums, joins, histograms, machine learning models?
  2. Where in event time are results calculated? Does the time each event originally occurred affect results? Are results aggregated in fixed windows, sessions, or a single global window?
  3. When in processing time are results materialized? Does the time each event is observed within the system affect results? When are results emitted? Speculatively, as data evolve? When data arrive late and results must be revised? Some combination of these?
  4. How do refinements of results relate? If additional data arrive and results change, are they independent and distinct, do they build upon one another, etc.?

Example:

What? Sums of integers, keyed by team.

Where? Within fixed event-time windows of one hour.

When?

Early: Every 5 minutes of processing time.

On-time: When the watermark passes the end of the window.

Late: Every 10 minutes of processing time.

Final: When the watermark passes the end of the window + two hours.

How? Panes accumulate new values into prior results.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What are the core out-of-order processing concepts?

A
  1. Event time
    • time at which events actually occured
    • gold standard of Windowing
  2. Processing time
    • time at which events are observed in the system
  3. Windowing
    • By processing time or by event time
    • Fixed (size) windows: a common way to process an unbounded data set into fixed-size windows, then processing those as a separate, bounded data source
    • Sliding windows: defined by a fixed length and time that may overlap other windows
    • Sessions is a Windowing strategy - a User’s session of activity and subsequent inactivity
  4. Watermarks
    • Tracks point in time when all the data in a certain window can be expected to have arrived in the pipeline. Data that arrives after that is considered ‘late data’.
  5. Triggers
    • Used to determine when enough data has been collected in a window and to emit the aggregated results of that window, called a ‘pane’
  6. Accumulation
    • relationships between multiple results observed for the same Window
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What is streaming data processing?

A
  • an execution engine designed for unbounded data sets
  • infinite, unbounded data-sets, unordered
  • unbounded data processing
  • low-latency, approximate or speculative results
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is the Lambda Architecture?

A

Running a streaming system alongside a batch system, performing the same calculation.

The streaming gives you low-latency, innnacurate results, and later a batch system provides the correct input

  • maintenance headache running two systems and merging pipelines

* great for ‘re-processing’ requirements

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the Kappa Architecture? (Kafka)

A
  • append only immutable log
  • log streams to compute and data stores
  • simplification of the Lambda Architecure, with the batch system removed
How well did you know this?
1
Not at all
2
3
4
5
Perfectly