Design and Develop Data Processing Flashcards

1
Q

What is Spark structured streaming?

A

Structured Streaming is the Apache Spark API that lets you express computation on streaming data in the same way you express a batch computation on static data. The Spark SQL engine performs the computation incrementally and continuously updates the result as streaming data arrives.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What kind of Azure Stream Analytics window is fixed size, repeating, non-overlapping and in which events cannot belong to more than one window?

A

Tumbling Window

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Define Azure Stream Analytics Window: Sliding

A

Produces an output only when an event occurs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Describe a non-relational database.

A

They are good for varying data types and simple, fast queries. Typically non-relational databases utilize Cassandra or something similar.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Describe the types of ADF triggers.

A

Schedule Trigger: Invokes a pipeline on a wall-clock schedule.

Tumbling Window Trigger: A trigger that operates on a periodic interval, while also retaining state.

Event-based Trigger: A trigger that responds to an event.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

List the 5 types of Azure Stream Analytics Windows.

A
Tumbling,
Hopping,
Sliding,
Session, 
Snapshot
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Describe Stream Analytics Tumbling Window.

A

Repeat, do not overlap, and an event cannot belong to more than one tumbling window. “Tell me the count of Tweets per time zone every 10 seconds.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Describe Stream Analytics Hopping Window.

A

Hop forward in time by a fixed period. “Tumbling windows that can overlap and can be emitted more often than the window size.” “Every 5 seconds give me the count of tweets over the last 10 seconds.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Describe Stream Analytics Sliding Window.

A

Only output events for points in time when the content of the window actually changes, ie, when an event enters or exits the window. Events can belong to more than one sliding window.

“Alert me whenever a topic is mentioned more than 3 times in under 10 seconds.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Describe Stream Analytics Session Window.

A

Group events that arrive at similar times, filtering out periods of time where there is no data.

“Tell me the count of tweets that occur within 5 minutes of each other.”

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Describe Stream Analytics Snapshot Window.

A

Group events that have the same timestamp.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What are recommended optimization techniques for Azure Databricks?

A

Group jobs into pools by weight.

Optimize Watermark Policies to improve performance by better handling late data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

How can PolyBase reduce the cost of transformation on a big data architecture?

A

By allowing for ELT which could greatly reduce the transformation needed on the data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What coding languages can be used for cells in Databricks?

A

Scala, R, Python & SparkSQL. NOT C#

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What are 3 valid inputs for Stream Analytics?

A

IoT Hub
Blob Storage
Event Hubs

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

How are you charged using Data Factory?

A

Pipeline Orchestration

Inactive Pipelines

17
Q

What are the two options for storing Reference Data in Azure Stream Analytics?

A

SQL Database

Blob Storage

18
Q

What is a Databricks cluster?

A

A group of compute resources.

19
Q

How long is pipeline-run monitoring data kept in ADF?

A

45 days.

20
Q

What data ingestion tools can be used for maximum parallelization in ADLS?

A

DistCp: -m (mapper)
Azure Data Factory: parallelCopies
Sqoop: fs.azure.block.size, -m (mapper)

21
Q

Describe what a Stream Analytics window is used for.

A

Windows are subsets of events that fall within some period of time. The window is a simple way to work with the time component of query logic in the system.