Data Analytics in Azure Flashcards
Batch vs Stream Processing
Batch processing, in which multiple data records are collected and stored before being processed together in a single operation.
Stream processing, in which a source of data is constantly monitored and processed in real time as new data events occur.
Advantages of Batch Processing
Large volumes of data can be processed at a convenient time.
It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.
Disadvantages of Batch Processing
The time delay between ingesting the data and getting the results.
Dependencies between data observations as all of a batch job’s input data must be ready before a batch can be processed. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a halt.
General four step architecture for stream processing
- Event generates data
- Data is captured at a streaming source
- Data is processed
- Results are written to an output (a.k.a sink)
Stream processing sources in Azure
Azure Event Hubs
Azure IoT Hub
Azure Data Lake Store Gen 2
Apache Kafka
Stream processing sinks in Azure
Azure Event Hubs Azure Data Lake Store Gen 2 Azure blob storage Azure SQL database Azure Synapse Analytics Azure Databricks Microsoft Power BI
Azure Stream Analytics
Azure Stream Analytics is a service for complex event processing and analysis of streaming data.
Stream Analytics is used to:
- Ingest data from an input
- Process the data by using a query to select, project and aggregate data values
- Write the results to an output
Apache Spark on Microsoft Azure
Apache Spark is a distributed processing framework for large scale data analytics. You can use Spark on Microsoft Azure in the following services:
Azure Synapse Analytics
Azure Databricks
Azure HDInsight
Delta Lake
Delta Lake is an open-source storage layer that adds support for transactional consistency, schema enforcement, and other common data warehousing features to data lake storage.
Azure Data Explorer
is a standalone Azure service for efficiently analyzing data. You can use the service as the output for analyzing large volumes of diverse data from data sources such as websites, applications, IoT devices, and more.
Kusto Query Language (KQL)
language that is specifically optimized for fast read performance – particularly with telemetry data that includes a timestamp attribute.