Chapter 5 - Developing a Batch Processing Solution Flashcards
Define Streaming Processing vs Batch Processing
Credit Card Example.
Streaming processing is an instant credit card example to determine if a transaction is approved or denied.
Batch processing is like a monthly statement, it includes more data over a set period of time.
What is Dynamic Data Masking (DDM)?
Dynamic data masking is a cutting-edge data protection technology that helps organizations safeguard sensitive information within their databases. It allows you to define masking rules for specific columns, ensuring that only authorized users see the original data while concealing it for others.
What is row-level security (RLS)?
Row-level security simplifies the design and coding of security in your application. RLS helps you implement restrictions on data row access. For example, you can ensure that workers access only those data rows that are pertinent to their department. Another example is to restrict customers’ data access to only the data relevant to their company.
Implement RLS by using the CREATE SECURITY POLICY Transact-SQL statement, and predicates created as inline table-valued functions.
What is PolyBase? How is it used to load data into a SQL Pool?
PolyBase is a tool that enables services such as SQL Server and Synapse dedicated SQL pools to copy and query data directly from external locations, including sources such as Azure Storage, Oracle, Teradata, Hadoop, and MongoDB.
PolyBase is integrated into T-SQL, so every time you use a COPY INTO <table> FROM command to read data from an external storage location, PolyBase kicks in. It is one of the fastest and most scalable ways to copy data.
What services is PolyBase available for?
ADF: This version of PolyBase can be used as an activity within ADF. The Copy Data activity can be defined as a pipeline that can be scheduled regularly.
ADB: This version of PolyBase can be used to transfer data between Azure Databricks and Synapse SQL pools.
SQL Server: This version can be used if the source is SQL Server. The SQL Server Integration Services (SSIS) platform can be used to define the source and destination mappings and do the orchestration while using the SSIS version of PolyBase.
What is the difference between transactional workloads versus analytical workloads?
- Transactional workloads consist of insertion, update, or deletion of production data and the main concern is data quality and consistency.
- Analytical workloads consist of analysis of data after their production and typically created or updated only as a result of an import or refresh, main concern is the performance of query results and reports.
What Azure services are primarily transactional workloads?
T-SQL stands for Transactional-SQL.
SQL Server and Azure SQL are optimized for transactional workloads.
SQL Server - default isolation level of Read Committed where only committed transactions are visible.
SQL Server - uses rowstore indexes, optimized for the retrieval of all columns for a single row.
What Azure service is primarily used for analytical workloads?
Azure Synapse - uses columnstore indexes.
Azure Synapse provides the ability to run multiple tasks in parallel. Allows you to decide how your data should be spread across different nodes in your pools when creating a table.
What does Synapse Link provide?
Synapse Link enables integration between Azure SQL and SQL Server from production to analytical data warehouse for analytical workloads.
Requires a linked service to be setup and established.
What are the different types of scaling resources called?
Referenced as Data Integration Units (DIUs) and double in size beginning from 2 to 32.
Can be set manually or autoscale. Best approach is to start with a smaller number of units, monitor the performance, then adjust the allocation as needed.
What does the degree of parallelism impact?
The higher the number of the degree, the better performance for copying data by multiple executors.
However, higher level of parallelism will result in a higher number of DIUs allocated.
What is configuring batch size? Why is it important?
Batch size refers to the number of data records processed together during a task, influencing the speed of task completion.
i.e. A Copy activity is used to copy data between a Source and a Sink, where the sink represents the destination location.
What are Data Warehouse Units (DWUs)?
Data Warehouse Units are a collection from a dedicated SQL pool (i.e. Synapse SQL) that are a combination of CPU, memory, and I/O.
DWUs impact how quickly a standard query scans the data rows and performs complex aggregations and influence the speed of data ingestion from Azure Blob Storage or Azure Data Lake.