Chapter 2 - Implementing a Partition Strategy Flashcards

1
Q

What is data partitioning?

A

The process of dividing data and storing it in physically different locations.

Involves vertical splitting of data across different files within the same machine.

Same machine ⇒ parallelism relies on local CPU cores and memory. (Slower)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is data distribution?

A

Horizontal splitting of data across different machines.

Different machines ⇒ parallelism scales out to a cluster, letting each node handle its own chunk of the data independently. (Faster)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Describe the benefits of data partitioning?

A
  • helps improve parallelization of queries by splitting monolithic data into smaller, easily consumable chunks
  • functions as data pruning where unnecessary data is ignored reducing Input/Output (I/O) operations.
  • helpful with deleting or archiving old data, i.e. any data that is + 12 months old can be easily DELETED with on cmd.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What are the best practices for horizontal scaling in Azure?

A
  • ADLS Gen2 : used to store large volume, data lakes are based on horizontal scaling
  • Synapse Analytics can handle large data amounts either provisioned or serverless workloads.
  • Autoscaling (Virtual Machine Scale Sets) to automatically increase the amount of VM instances. i.e. autoscale nodes in Azure Kubernetes Service (AKS) cluster.
  • Serverless computing: Azure Functions or Azure Logic Apps which autoscale based on demand and automate workflow with other Azure services.
  • Data partitioning / Sharding: Cosmos DB provides automatic and instant scalability, global distribution and low-latency capable.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Examples of enhancing security on Azure Data Analytics platform via partitioning?

A

Dynamic Data Masking: Masks sensitive data in query results by replacing actual data with obfuscated values (e.g., asterisks) based on defined rules. Azure SQL Database mask sensitive data in query results.

  • Encryption of data at rest: Azure Disk Encryption (VMs and managed disks). Encrypts stored data on disks to prevent unauthorized access to data files.
  • Row-level Security: Implemented in Synapse Analytics based on characteristics of the user executing the query. Controls access to rows in a database based on characteristics of the user executing the query.
  • Transparent data encryption: (TDE) automatically encrypts the database, backups, and transaction log files. Protects data at rest unless accessed by auth. user.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What are five benefits of partitioning data?

A
  • Increased availability (reduce impact of failure)
  • Improved performance (focused on data)
  • Reduced costs (i.e. archiving schedule)
  • Manageability (smaller chunks)
  • Scalability
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What type of storage does an Azure Storage Account support?

A

An Azure storage account contains all of your Azure Storage data objects: blobs, files, queues, and tables.

The storage account provides a unique namespace for your Azure Storage data that’s accessible from anywhere in the world over HTTP or HTTPS. Data in your storage account is durable and highly available, secure, and scalable.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

How does Azure Storage store and manage blobs?

A

Azure Storage uses <account name + container name + blob name>

It continues to store blobs in the same partition until is reaches the internal limit, then is repartitions and rebalances data amongst partitions automatically. Adding a 3-digit hash to filenames to improve rebalancing.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

What does ADLS G2 Hierarchical Namespace option do?

A

Provides folder indexing and security (ability to create Access Control Lists at the folder / file level).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a horizontal partition?

A

Divided data table with subsets of rows stored in different data stores (same schema as the parent table) is stored in different database instances (i.e. index block 1000-1999, 2000-2999).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is a vertical partition?

A

Retain the primary key, split the data by most utilized columns to make reading the row faster.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a functional partition?

A

Separating data based on business sense, such as sensitivity (CustomerID, CustomerName, etc.).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What is Azure Event Hub?

A

Scalable service that processes events in real time, streaming data is partitioned based on user discretion.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is Azure Stream Analytics?

A

Real-time analytics service designed to help analyze and visualize streaming data in real time. Takes partitioned data for processing to monitor, trigger alerts, or provide real-time reporting.

Receives data from Azure Event Hub.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is Azure Databricks?

A

“Spark on Azure” streamlines big-data analytics and machine learning.

Collaborative notebooks for data science and data engineering.

Auto-scaling Spark clusters.

Integration with Azure Data Lake (Gen2), Cosmos DB (NoSQL), Synapse, and other Azure services.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

List the various services that Azure Synapse Analytics contain.

A

SQL Data Warehouse (Serverless or Dedicated Pool), Synapse Spark, Azure Data Files (SMB, NFS protocols), Cosmos DB (NoSQL), Azure AD (IAM), Azure Purview (Governance)

17
Q

Define a dedicated SQL Pool.

A

Dedicated SQL pool is a Massively Parallel Processing (MPP) system that splits queries into 60 parallel queries. Each smaller query runs on a “DISTRIBUTION,” which acts as the basic unit of processing and storage for the dedicated SQL pool. This architecture allows for high throughput and scalable performance by distributing data and workloads across multiple distributions. It also ensures efficient resource utilization and faster query results by processing data in parallel across many nodes.

18
Q

What are the three different distributions (shards)?

A
  1. Lookup. Think of a library using a master catalog to decide which shelf a book should go on. This strategy uses a catalog-like system to flexibly map data to physical partitions, making rebalancing easier by updating the map without altering how applications access data. However, just like checking a catalog before finding a book, this approach introduces extra lookup overhead.
  2. Range. Imagine organizing books by alphabet ranges (A-M on one shelf, N-Z on another). This method groups similar data ranges together, making it easy to fetch several related items at once and manage them locally (like scheduling region-based updates). The downside is that some ranges (shelves) might get overloaded if they’re very popular, and rebalancing these ranges becomes difficult, much like trying to move a crowded shelf.
  3. Hash. Picture assigning each book a random number to decide its shelf, spreading books evenly across shelves. Hash sharding uses a hash function to distribute data evenly, directly routing requests without a lookup table, which reduces hotspots. However, computing the hash for each lookup is like calculating a random number for every book, adding some overhead, and rebalancing can be hard once books are spread out.
19
Q

______ is the rate at which you ingest data into an Azure Storage systems and the rate at which you move the data out of the Azure Storage system is called the ______.

A

ingress, egress

20
Q

What are the three types of tables supported by a dedicated SQL pool?

A

Clustered Columnstore Tables:
Stores data in a columnar format, placing each column’s data contiguously to optimize compression and speed up analytics queries by scanning only necessary columns.

Ideal for data warehousing, big data analytics, and read-heavy operations where you perform aggregations, filtering, or scans on large datasets but don’t require frequent individual row updates. It excels in scenarios requiring high compression and fast query performance over many rows and columns.

Clustered Index Tables:
Organizes data rows according to a clustered index, physically ordering rows by key to facilitate fast row-based lookups at the cost of potentially slower inserts/updates.

Best for transactional systems or operational databases where quick retrieval of rows by key is critical. Use when you have frequent single-row reads, updates, or deletes based on indexed columns, and maintaining sorted order can greatly improve performance.

HEAP Tables:
Stores data in unstructured pages without a specific order, enabling fast bulk inserts but resulting in slower lookups and updates due to lack of organization.

Suitable for staging tables, temporary data storage, or scenarios where rapid bulk inserts are needed without immediate query performance requirements. Use when data organization isn’t a priority or when you plan to reorganize or index the data later after bulk loading.

21
Q

What is range partitioning?

A

Range partitioning involves organizing data into partitions based on specific value intervals (e.g., all records from January 2023 in one partition, February 2023 in another), which is a different approach than grouping by alphabetical order.

22
Q

What are three key considerations for implementing a partitioning strategy for analytical workloads?

A
  1. Even Distribution - Choose partition key that ensures data is evenly distributed to avoid skewed hotspots
  2. Queriable - Partition data to align with common query patterns to minimize data movement.
  3. Storage Considerations - Be mindful of the cost implications of your partitioning strategy, it may impact storage and resources
23
Q

What Azure services can you utilize to implement a partition strategy for streaming workloads?

A
  • Azure Event Hubs
  • Azure Stream Analytics (ASA)
  • Azure Databricks (ADB)
24
Q

What are the default limits for the number of storage accounts and storage account capacity in Azure?

A
  • Number of Storage Accounts: 250 by default, 500 by request
  • Default maximum storage account capacity, 5 PiB