Chapter 2 - Implementing a Partition Strategy Flashcards

Question 1

Q

What is data partitioning?

Answer

A

The process of dividing data and storing it in physically different locations.

Involves vertical splitting of data across different files within the same machine.

Same machine ⇒ parallelism relies on local CPU cores and memory. (Slower)

Question 2

Q

What is data distribution?

Answer

A

Horizontal splitting of data across different machines.

Different machines ⇒ parallelism scales out to a cluster, letting each node handle its own chunk of the data independently. (Faster)

Question 3

Q

Describe the benefits of data partitioning?

Answer

A

helps improve parallelization of queries by splitting monolithic data into smaller, easily consumable chunks
functions as data pruning where unnecessary data is ignored reducing Input/Output (I/O) operations.
helpful with deleting or archiving old data, i.e. any data that is + 12 months old can be easily DELETED with on cmd.

Question 4

Q

What are the best practices for horizontal scaling in Azure?

Answer

A

ADLS Gen2 : used to store large volume, data lakes are based on horizontal scaling
Synapse Analytics can handle large data amounts either provisioned or serverless workloads.
Autoscaling (Virtual Machine Scale Sets) to automatically increase the amount of VM instances. i.e. autoscale nodes in Azure Kubernetes Service (AKS) cluster.
Serverless computing: Azure Functions or Azure Logic Apps which autoscale based on demand and automate workflow with other Azure services.
Data partitioning / Sharding: Cosmos DB provides automatic and instant scalability, global distribution and low-latency capable.

Question 5

Q

Examples of enhancing security on Azure Data Analytics platform via partitioning?

Answer

A

Dynamic Data Masking: Masks sensitive data in query results by replacing actual data with obfuscated values (e.g., asterisks) based on defined rules. Azure SQL Database mask sensitive data in query results.

Encryption of data at rest: Azure Disk Encryption (VMs and managed disks). Encrypts stored data on disks to prevent unauthorized access to data files.
Row-level Security: Implemented in Synapse Analytics based on characteristics of the user executing the query. Controls access to rows in a database based on characteristics of the user executing the query.
Transparent data encryption: (TDE) automatically encrypts the database, backups, and transaction log files. Protects data at rest unless accessed by auth. user.

Question 6

Q

What are five benefits of partitioning data?

Answer

A

Increased availability (reduce impact of failure)
Improved performance (focused on data)
Reduced costs (i.e. archiving schedule)
Manageability (smaller chunks)
Scalability

Question 7

Q

What type of storage does an Azure Storage Account support?

Answer

A

An Azure storage account contains all of your Azure Storage data objects: blobs, files, queues, and tables.

The storage account provides a unique namespace for your Azure Storage data that’s accessible from anywhere in the world over HTTP or HTTPS. Data in your storage account is durable and highly available, secure, and scalable.

Question 8

Q

How does Azure Storage store and manage blobs?

Answer

A

Azure Storage uses <account name + container name + blob name>

It continues to store blobs in the same partition until is reaches the internal limit, then is repartitions and rebalances data amongst partitions automatically. Adding a 3-digit hash to filenames to improve rebalancing.

Question 9

Q

What does ADLS G2 Hierarchical Namespace option do?

Answer

A

Provides folder indexing and security (ability to create Access Control Lists at the folder / file level).

Question 10

Q

What is a horizontal partition?

Answer

A

Divided data table with subsets of rows stored in different data stores (same schema as the parent table) is stored in different database instances (i.e. index block 1000-1999, 2000-2999).

Question 11

Q

What is a vertical partition?

Answer

A

Retain the primary key, split the data by most utilized columns to make reading the row faster.

Question 12

Q

What is a functional partition?

Answer

A

Separating data based on business sense, such as sensitivity (CustomerID, CustomerName, etc.).

Question 13

Q

What is Azure Event Hub?

Answer

A

Scalable service that processes events in real time, streaming data is partitioned based on user discretion.

Question 14

Q

What is Azure Stream Analytics?

Answer

A

Real-time analytics service designed to help analyze and visualize streaming data in real time. Takes partitioned data for processing to monitor, trigger alerts, or provide real-time reporting.

Receives data from Azure Event Hub.

Question 15

Q

What is Azure Databricks?

Answer

A

“Spark on Azure” streamlines big-data analytics and machine learning.

Collaborative notebooks for data science and data engineering.

Auto-scaling Spark clusters.

Integration with Azure Data Lake (Gen2), Cosmos DB (NoSQL), Synapse, and other Azure services.

Question 16

Q

List the various services that Azure Synapse Analytics contain.

Answer

Study These Flashcards

A

SQL Data Warehouse (Serverless or Dedicated Pool), Synapse Spark, Azure Data Files (SMB, NFS protocols), Cosmos DB (NoSQL), Azure AD (IAM), Azure Purview (Governance)

Question 17

Q

Define a dedicated SQL Pool.

Answer

Study These Flashcards

A

Dedicated SQL pool is a Massively Parallel Processing (MPP) system that splits queries into 60 parallel queries. Each smaller query runs on a “DISTRIBUTION,” which acts as the basic unit of processing and storage for the dedicated SQL pool. This architecture allows for high throughput and scalable performance by distributing data and workloads across multiple distributions. It also ensures efficient resource utilization and faster query results by processing data in parallel across many nodes.

Question 18

Q

What are the three different distributions (shards)?

Answer

Study These Flashcards

A

Lookup. Think of a library using a master catalog to decide which shelf a book should go on. This strategy uses a catalog-like system to flexibly map data to physical partitions, making rebalancing easier by updating the map without altering how applications access data. However, just like checking a catalog before finding a book, this approach introduces extra lookup overhead.
Range. Imagine organizing books by alphabet ranges (A-M on one shelf, N-Z on another). This method groups similar data ranges together, making it easy to fetch several related items at once and manage them locally (like scheduling region-based updates). The downside is that some ranges (shelves) might get overloaded if they’re very popular, and rebalancing these ranges becomes difficult, much like trying to move a crowded shelf.
Hash. Picture assigning each book a random number to decide its shelf, spreading books evenly across shelves. Hash sharding uses a hash function to distribute data evenly, directly routing requests without a lookup table, which reduces hotspots. However, computing the hash for each lookup is like calculating a random number for every book, adding some overhead, and rebalancing can be hard once books are spread out.

Question 19

Q

______ is the rate at which you ingest data into an Azure Storage systems and the rate at which you move the data out of the Azure Storage system is called the ______.

Answer

Study These Flashcards

A

ingress, egress

Question 20

Q

What are the three types of tables supported by a dedicated SQL pool?

Answer

Study These Flashcards

A

Clustered Columnstore Tables:
Stores data in a columnar format, placing each column’s data contiguously to optimize compression and speed up analytics queries by scanning only necessary columns.

Ideal for data warehousing, big data analytics, and read-heavy operations where you perform aggregations, filtering, or scans on large datasets but don’t require frequent individual row updates. It excels in scenarios requiring high compression and fast query performance over many rows and columns.

Clustered Index Tables:
Organizes data rows according to a clustered index, physically ordering rows by key to facilitate fast row-based lookups at the cost of potentially slower inserts/updates.

Best for transactional systems or operational databases where quick retrieval of rows by key is critical. Use when you have frequent single-row reads, updates, or deletes based on indexed columns, and maintaining sorted order can greatly improve performance.

HEAP Tables:
Stores data in unstructured pages without a specific order, enabling fast bulk inserts but resulting in slower lookups and updates due to lack of organization.

Suitable for staging tables, temporary data storage, or scenarios where rapid bulk inserts are needed without immediate query performance requirements. Use when data organization isn’t a priority or when you plan to reorganize or index the data later after bulk loading.

Question 21

Q

What is range partitioning?

Answer

Study These Flashcards

A

Range partitioning involves organizing data into partitions based on specific value intervals (e.g., all records from January 2023 in one partition, February 2023 in another), which is a different approach than grouping by alphabetical order.

Question 22

Q

What are three key considerations for implementing a partitioning strategy for analytical workloads?

Answer

Study These Flashcards

A

Even Distribution - Choose partition key that ensures data is evenly distributed to avoid skewed hotspots
Queriable - Partition data to align with common query patterns to minimize data movement.
Storage Considerations - Be mindful of the cost implications of your partitioning strategy, it may impact storage and resources

Question 23

Q

What Azure services can you utilize to implement a partition strategy for streaming workloads?

Answer

Study These Flashcards

A

Azure Event Hubs
Azure Stream Analytics (ASA)
Azure Databricks (ADB)

Question 24

Q

What are the default limits for the number of storage accounts and storage account capacity in Azure?

Answer

Study These Flashcards

A

Number of Storage Accounts: 250 by default, 500 by request
Default maximum storage account capacity, 5 PiB

Chapter 2 - Implementing a Partition Strategy Flashcards

(24 cards)