Data Engineering for Streaming Data Flashcards

Question 1

Q

What is batch processing?

Answer

A

Batch processing is when the processing and analysis happen on a set of stored data.

Question 2

Q

What is Streaming data?

Answer

A

Streaming data is a flow of data records generated by various data sources.

Question 3

Q

What does streaming data processing mean?

Answer

A

Streaming data processing means that the data is analyzed in near real-time and that actions will be taken on the data as quickly as possible.

Question 4

Q

What is a tool for distributed message-oriented architectures?

Answer

A

A tool to handle distributed message-oriented architectures at scale, Pub/Sub.

Question 5

Q

What is Pub/Sub?

Answer

A

The name is short for Publisher/Subscriber, or publish messages to subscribers.

Pub/Sub is a distributed messaging service that can receive messages from various device streams, such as gaming events, IoT devices, and application streams.

It ensures
- at-least-once delivery of received messages to subscribing applications,
- no provisioning required
- Pub/Sub’s APIs are open,
- the service is global by default
- offers end-to-end encryption.

Question 6

Q

Give a general usecase for Pub/Sub.

Answer

A

Pub/Sub reads, stores, and broadcasts to any subscribers of this data topic that new messages are available.
As a subscriber of Pub/Sub, Dataflow can ingest and transform those messages in an elastic streaming pipeline and output the results into an analytics data warehouse like BigQuery.
Finally, you can connect a data visualization tool, like Looker, to visualize and monitor the results of a pipeline, or an AI or ML tool such as Vertex AI to explore the data to uncover business insights or help with predictions

Question 7

Q

How Data Flow isused?

Answer

A

Dataflow creates a pipeline to process both streaming data and batch data.
Here ETL happens.

Question 8

Q

What is a popular solution for pipeline design?

Answer

A

Apache Beam.
It’s an open source, unified programming model to define and execute data processing

Question 9

Q

List advantages of Apache Beam

Answer

A

Apache Beam is unified, which means it uses a single programming model for both batch and streaming data.
It’s portable, which means it can work on multiple execution environments, like Dataflow and Apache Spark, among others.
it’s extensible, which means it allows you to write and share your own connectors and transformation libraries.
It provides pipeline templates, so you don’t need to build a pipeline from nothing. And it can write pipelines in Java, Python, or Go.
The Apache Beam software development kit, or SDK, is a collection of software development tools in one installable package. It provides a variety of libraries for transformations and data connectors to sources and sinks.
Apache Beam creates a model representation from your code that is portable across many runners. Runners pass off your model for execution on a variety of different possible engines, with Dataflow being a popular choice.

Question 10

Q

What is DataFlow?

Answer

A

Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud ecosystem.
It handles much of the complexity relating to infrastructure setup and maintenance and is built on Google’s infrastructure. This allows for reliable auto scaling to meet data pipeline demands.
Dataflow is serverless and NoOps, which means No Operations.

Question 11

Q

What is NoOps environment?

Answer

A

A NoOps environment is one that doesn’t require management from an operations team, because maintenance, monitoring, and scaling are automated

Question 12

Q

What is Serverless computing?

Answer

A

Serverless computing is a cloud computing execution model. This is when Google Cloud, for example, manages infrastructure tasks on behalf of the users. This includes tasks like resource provisioning, performance tuning, and ensuring pipeline reliability.

Question 13

Q

Describe the list of Dataflow templates.

Answer

A

They can be broken down into three categories:
- streaming templates (processing continuous, or real-time, data.)
- batch templates (processing bulk data, or batch load data)
- utility templates (activities related to bulk compression, deletion, and conversion)

Question 14

Q

What is available to interact with and visualize data?

Answer

A

Looker and Google Data Studio

Question 15

Q

Describe key features of Looker.

Answer

A

Looker supports BigQuery, as well as more than 60 different SQL databases.
It allows developers to define a semantic modeling layer on top of databases using Looker Modeling Language, or LookML
The Looker platform is 100% web-based
There is also a Looker API, which can be used to embed Looker reports in other applications

Question 16

Q

What is the biggest administrative difference between Data Studio and Looker?

Answer

Study These Flashcards

A

Data Studio is integrated into BigQuery, which makes data visualization possible with just a few clicks.

This means that leveraging Data Studio doesn’t require support from an administrator to establish a data connection, which is a requirement with Looker.

Question 17

Q

When you build scalable and reliable pipelines, data often needs to be processed in near-real time, as soon as it reaches the system. Which type of challenge might this present to data engineers?

Answer

Study These Flashcards

A

Velocity

Data Engineering for Streaming Data Flashcards

(17 cards)