Design and Develop Data Processing Flashcards
What is Spark structured streaming?
Structured Streaming is the Apache Spark API that lets you express computation on streaming data in the same way you express a batch computation on static data. The Spark SQL engine performs the computation incrementally and continuously updates the result as streaming data arrives.
What kind of Azure Stream Analytics window is fixed size, repeating, non-overlapping and in which events cannot belong to more than one window?
Tumbling Window
Define Azure Stream Analytics Window: Sliding
Produces an output only when an event occurs.
Describe a non-relational database.
They are good for varying data types and simple, fast queries. Typically non-relational databases utilize Cassandra or something similar.
Describe the types of ADF triggers.
Schedule Trigger: Invokes a pipeline on a wall-clock schedule.
Tumbling Window Trigger: A trigger that operates on a periodic interval, while also retaining state.
Event-based Trigger: A trigger that responds to an event.
List the 5 types of Azure Stream Analytics Windows.
Tumbling, Hopping, Sliding, Session, Snapshot
Describe Stream Analytics Tumbling Window.
Repeat, do not overlap, and an event cannot belong to more than one tumbling window. “Tell me the count of Tweets per time zone every 10 seconds.”
Describe Stream Analytics Hopping Window.
Hop forward in time by a fixed period. “Tumbling windows that can overlap and can be emitted more often than the window size.” “Every 5 seconds give me the count of tweets over the last 10 seconds.”
Describe Stream Analytics Sliding Window.
Only output events for points in time when the content of the window actually changes, ie, when an event enters or exits the window. Events can belong to more than one sliding window.
“Alert me whenever a topic is mentioned more than 3 times in under 10 seconds.”
Describe Stream Analytics Session Window.
Group events that arrive at similar times, filtering out periods of time where there is no data.
“Tell me the count of tweets that occur within 5 minutes of each other.”
Describe Stream Analytics Snapshot Window.
Group events that have the same timestamp.
What are recommended optimization techniques for Azure Databricks?
Group jobs into pools by weight.
Optimize Watermark Policies to improve performance by better handling late data.
How can PolyBase reduce the cost of transformation on a big data architecture?
By allowing for ELT which could greatly reduce the transformation needed on the data.
What coding languages can be used for cells in Databricks?
Scala, R, Python & SparkSQL. NOT C#
What are 3 valid inputs for Stream Analytics?
IoT Hub
Blob Storage
Event Hubs