Data Engineering for Streaming Data Flashcards

1
Q

What is batch processing?

A

Batch processing is when the processing and analysis happen on a set of stored data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is Streaming data?

A

Streaming data is a flow of data records generated by various data sources.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What does streaming data processing mean?

A

Streaming data processing means that the data is analyzed in near real-time and that actions will be taken on the data as quickly as possible.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is a tool for distributed message-oriented architectures?

A

A tool to handle distributed message-oriented architectures at scale, Pub/Sub.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Pub/Sub?

A

The name is short for Publisher/Subscriber, or publish messages to subscribers.

Pub/Sub is a distributed messaging service that can receive messages from various device streams, such as gaming events, IoT devices, and application streams.

It ensures
- at-least-once delivery of received messages to subscribing applications,
- no provisioning required
- Pub/Sub’s APIs are open,
- the service is global by default
- offers end-to-end encryption.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Give a general usecase for Pub/Sub.

A
  • Pub/Sub reads, stores, and broadcasts to any subscribers of this data topic that new messages are available.
  • As a subscriber of Pub/Sub, Dataflow can ingest and transform those messages in an elastic streaming pipeline and output the results into an analytics data warehouse like BigQuery.
  • Finally, you can connect a data visualization tool, like Looker, to visualize and monitor the results of a pipeline, or an AI or ML tool such as Vertex AI to explore the data to uncover business insights or help with predictions
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

How Data Flow isused?

A

Dataflow creates a pipeline to process both streaming data and batch data.
Here ETL happens.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a popular solution for pipeline design?

A

Apache Beam.
It’s an open source, unified programming model to define and execute data processing

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

List advantages of Apache Beam

A
  • Apache Beam is unified, which means it uses a single programming model for both batch and streaming data.
  • It’s portable, which means it can work on multiple execution environments, like Dataflow and Apache Spark, among others.
  • it’s extensible, which means it allows you to write and share your own connectors and transformation libraries.
  • It provides pipeline templates, so you don’t need to build a pipeline from nothing. And it can write pipelines in Java, Python, or Go.
  • The Apache Beam software development kit, or SDK, is a collection of software development tools in one installable package. It provides a variety of libraries for transformations and data connectors to sources and sinks.
  • Apache Beam creates a model representation from your code that is portable across many runners. Runners pass off your model for execution on a variety of different possible engines, with Dataflow being a popular choice.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is DataFlow?

A
  • Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud ecosystem.
  • It handles much of the complexity relating to infrastructure setup and maintenance and is built on Google’s infrastructure. This allows for reliable auto scaling to meet data pipeline demands.
  • Dataflow is serverless and NoOps, which means No Operations.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is NoOps environment?

A

A NoOps environment is one that doesn’t require management from an operations team, because maintenance, monitoring, and scaling are automated

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is Serverless computing?

A

Serverless computing is a cloud computing execution model. This is when Google Cloud, for example, manages infrastructure tasks on behalf of the users. This includes tasks like resource provisioning, performance tuning, and ensuring pipeline reliability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Describe the list of Dataflow templates.

A

They can be broken down into three categories:
- streaming templates (processing continuous, or real-time, data.)
- batch templates (processing bulk data, or batch load data)
- utility templates (activities related to bulk compression, deletion, and conversion)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is available to interact with and visualize data?

A

Looker and Google Data Studio

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Describe key features of Looker.

A
  • Looker supports BigQuery, as well as more than 60 different SQL databases.
  • It allows developers to define a semantic modeling layer on top of databases using Looker Modeling Language, or LookML
  • The Looker platform is 100% web-based
  • There is also a Looker API, which can be used to embed Looker reports in other applications
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

What is the biggest administrative difference between Data Studio and Looker?

A

Data Studio is integrated into BigQuery, which makes data visualization possible with just a few clicks.

This means that leveraging Data Studio doesn’t require support from an administrator to establish a data connection, which is a requirement with Looker.

17
Q

When you build scalable and reliable pipelines, data often needs to be processed in near-real time, as soon as it reaches the system. Which type of challenge might this present to data engineers?

A

Velocity