Data Engineering for Streaming Data Flashcards
What is batch processing?
Batch processing is when the processing and analysis happen on a set of stored data.
What is Streaming data?
Streaming data is a flow of data records generated by various data sources.
What does streaming data processing mean?
Streaming data processing means that the data is analyzed in near real-time and that actions will be taken on the data as quickly as possible.
What is a tool for distributed message-oriented architectures?
A tool to handle distributed message-oriented architectures at scale, Pub/Sub.
What is Pub/Sub?
The name is short for Publisher/Subscriber, or publish messages to subscribers.
Pub/Sub is a distributed messaging service that can receive messages from various device streams, such as gaming events, IoT devices, and application streams.
It ensures
- at-least-once delivery of received messages to subscribing applications,
- no provisioning required
- Pub/Sub’s APIs are open,
- the service is global by default
- offers end-to-end encryption.
Give a general usecase for Pub/Sub.
- Pub/Sub reads, stores, and broadcasts to any subscribers of this data topic that new messages are available.
- As a subscriber of Pub/Sub, Dataflow can ingest and transform those messages in an elastic streaming pipeline and output the results into an analytics data warehouse like BigQuery.
- Finally, you can connect a data visualization tool, like Looker, to visualize and monitor the results of a pipeline, or an AI or ML tool such as Vertex AI to explore the data to uncover business insights or help with predictions
How Data Flow isused?
Dataflow creates a pipeline to process both streaming data and batch data.
Here ETL happens.
What is a popular solution for pipeline design?
Apache Beam.
It’s an open source, unified programming model to define and execute data processing
List advantages of Apache Beam
- Apache Beam is unified, which means it uses a single programming model for both batch and streaming data.
- It’s portable, which means it can work on multiple execution environments, like Dataflow and Apache Spark, among others.
- it’s extensible, which means it allows you to write and share your own connectors and transformation libraries.
- It provides pipeline templates, so you don’t need to build a pipeline from nothing. And it can write pipelines in Java, Python, or Go.
- The Apache Beam software development kit, or SDK, is a collection of software development tools in one installable package. It provides a variety of libraries for transformations and data connectors to sources and sinks.
- Apache Beam creates a model representation from your code that is portable across many runners. Runners pass off your model for execution on a variety of different possible engines, with Dataflow being a popular choice.
What is DataFlow?
- Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud ecosystem.
- It handles much of the complexity relating to infrastructure setup and maintenance and is built on Google’s infrastructure. This allows for reliable auto scaling to meet data pipeline demands.
- Dataflow is serverless and NoOps, which means No Operations.
What is NoOps environment?
A NoOps environment is one that doesn’t require management from an operations team, because maintenance, monitoring, and scaling are automated
What is Serverless computing?
Serverless computing is a cloud computing execution model. This is when Google Cloud, for example, manages infrastructure tasks on behalf of the users. This includes tasks like resource provisioning, performance tuning, and ensuring pipeline reliability.
Describe the list of Dataflow templates.
They can be broken down into three categories:
- streaming templates (processing continuous, or real-time, data.)
- batch templates (processing bulk data, or batch load data)
- utility templates (activities related to bulk compression, deletion, and conversion)
What is available to interact with and visualize data?
Looker and Google Data Studio
Describe key features of Looker.
- Looker supports BigQuery, as well as more than 60 different SQL databases.
- It allows developers to define a semantic modeling layer on top of databases using Looker Modeling Language, or LookML
- The Looker platform is 100% web-based
- There is also a Looker API, which can be used to embed Looker reports in other applications