Chapter 2 - Implementing a Partition Strategy Flashcards
What is data partitioning?
The process of dividing data and storing it in physically different locations.
Involves vertical splitting of data across different files within the same machine.
Same machine ⇒ parallelism relies on local CPU cores and memory. (Slower)
What is data distribution?
Horizontal splitting of data across different machines.
Different machines ⇒ parallelism scales out to a cluster, letting each node handle its own chunk of the data independently. (Faster)
Describe the benefits of data partitioning?
- helps improve parallelization of queries by splitting monolithic data into smaller, easily consumable chunks
- functions as data pruning where unnecessary data is ignored reducing Input/Output (I/O) operations.
- helpful with deleting or archiving old data, i.e. any data that is + 12 months old can be easily DELETED with on cmd.
What are the best practices for horizontal scaling in Azure?
- ADLS Gen2 : used to store large volume, data lakes are based on horizontal scaling
- Synapse Analytics can handle large data amounts either provisioned or serverless workloads.
- Autoscaling (Virtual Machine Scale Sets) to automatically increase the amount of VM instances. i.e. autoscale nodes in Azure Kubernetes Service (AKS) cluster.
- Serverless computing: Azure Functions or Azure Logic Apps which autoscale based on demand and automate workflow with other Azure services.
- Data partitioning / Sharding: Cosmos DB provides automatic and instant scalability, global distribution and low-latency capable.
Examples of enhancing security on Azure Data Analytics platform via partitioning?
Dynamic Data Masking: Masks sensitive data in query results by replacing actual data with obfuscated values (e.g., asterisks) based on defined rules. Azure SQL Database mask sensitive data in query results.
- Encryption of data at rest: Azure Disk Encryption (VMs and managed disks). Encrypts stored data on disks to prevent unauthorized access to data files.
- Row-level Security: Implemented in Synapse Analytics based on characteristics of the user executing the query. Controls access to rows in a database based on characteristics of the user executing the query.
- Transparent data encryption: (TDE) automatically encrypts the database, backups, and transaction log files. Protects data at rest unless accessed by auth. user.
What are five benefits of partitioning data?
- Increased availability (reduce impact of failure)
- Improved performance (focused on data)
- Reduced costs (i.e. archiving schedule)
- Manageability (smaller chunks)
- Scalability
What type of storage does an Azure Storage Account support?
An Azure storage account contains all of your Azure Storage data objects: blobs, files, queues, and tables.
The storage account provides a unique namespace for your Azure Storage data that’s accessible from anywhere in the world over HTTP or HTTPS. Data in your storage account is durable and highly available, secure, and scalable.
How does Azure Storage store and manage blobs?
Azure Storage uses <account name + container name + blob name>
It continues to store blobs in the same partition until is reaches the internal limit, then is repartitions and rebalances data amongst partitions automatically. Adding a 3-digit hash to filenames to improve rebalancing.
What does ADLS G2 Hierarchical Namespace option do?
Provides folder indexing and security (ability to create Access Control Lists at the folder / file level).
What is a horizontal partition?
Divided data table with subsets of rows stored in different data stores (same schema as the parent table) is stored in different database instances (i.e. index block 1000-1999, 2000-2999).
What is a vertical partition?
Retain the primary key, split the data by most utilized columns to make reading the row faster.
What is a functional partition?
Separating data based on business sense, such as sensitivity (CustomerID, CustomerName, etc.).
What is Azure Event Hub?
Scalable service that processes events in real time, streaming data is partitioned based on user discretion.
What is Azure Stream Analytics?
Real-time analytics service designed to help analyze and visualize streaming data in real time. Takes partitioned data for processing to monitor, trigger alerts, or provide real-time reporting.
Receives data from Azure Event Hub.
What is Azure Databricks?
“Spark on Azure” streamlines big-data analytics and machine learning.
Collaborative notebooks for data science and data engineering.
Auto-scaling Spark clusters.
Integration with Azure Data Lake (Gen2), Cosmos DB (NoSQL), Synapse, and other Azure services.