Azure Synapse Analytics Performance tuning Flashcards

1
Q

In a data warehouse, the simplest type of schema is called a

A

star schema

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Why is it called a star schema?

A

It’s called that because it looks a little bit like a star. It has a FACT TABLE IN THE MIDDLE and DIMENSION TABLES AROUND it. A fact table usually contains either individual events or summaries. For example, a sales database might have a fact table with either one record for each sales transaction or the total value of all sales transactions in a given period of time.

Dimension tables give more details about the items in the fact table. For example, if each sales transaction in the fact table contains a product ID, then there would be a product dimension table that describes that particular product in more detail, including things like brand, model number, and product category.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

A Fact table usually contains?

A

A fact table usually contains either individual events or summaries.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

What is an integration table or staging table?

A

One type of table that isn’t part of a star schema is called an integration table. This is generally a table that sits in between a data source and the data warehouse. For example, a common practice when loading data into a dedicated SQL pool is to first load the data into a staging table, perform some transformations on that data, and then load it into the SQL pool.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is Polybase?

A

The fastest way to load data into Synapse Analytics is to use Polybase. It lets you read data from external sources using T-SQL. There are quite a few steps involved, though. Here’s what you need to do:

Convert your data into structured text files, such as CSV or Parquet, and put the files in either Blob storage or Data Lake Storage.
Create external tables by using these three T-SQL commands in this order:
CREATE EXTERNAL DATA SOURCE,

CREATE EXTERNAL FILE FORMAT, and

CREATE EXTERNAL TABLE

Load the data into a staging table in Synapse Analytics. This is a best practice so you can deal with data loading issues without affecting production tables.
Insert the data into production tables.
When you’re loading data into staging tables, you should use a round-robin distribution method. This’ll require a bit of explanation.

Tables in Synapse Analytics are actually spread out across 60 data distributions. This is why queries are so fast on this service–they’re massively parallelized. When you run a query, it spawns 60 queries that each run on one data distribution.

To make this work efficiently, you have to decide how the data will be distributed. This is also known as sharding. Synapse Analytics offers three choices: round-robin, hash-distributed, and replicated. A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it’s the fastest distribution type for loading data into a staging table. It doesn’t perform any optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Round-Robin table?

A

A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it’s the fastest distribution type for loading data into a staging table. It doesn’t perform any optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is a hash-distributed table?

A

The idea is that you designate one of the columns as the hash key (or semantic key, ex: billing number or other key that has to go on same package). Then the hash function uses the value in this column to determine which data distribution to store a particular row on. As long as you choose a hash key that’s appropriate for the most commonly run queries on this table, then query performance will be much better than it would be with a round-robin table.

Ideally, you should choose a distribution column that will spread the rows fairly evenly among the data distributions. If too many of the rows are on the same data distribution, then it will be a hot spot that reduces the advantages of Synapse Analytics’ massively parallel architecture. For example, if you were to choose a date column for the hash key, then all of the rows for a particular date would end up on the same distribution. So a query on that date would only run on that one distribution, which would make the query take much longer than if it were to run across all 60 distributions in parallel.

Here are some characteristics of a good distribution column:

It has many unique values so the rows will be spread out over the 60 distributions.
It’s frequently used in JOINs. If two fact tables are often joined together, then distribute both of the tables on the same join column. That way, rows from the two tables that have the same value in the join column will be stored on the same distribution, so they can be joined together easily. If you don’t have frequent joins, then choose a column that’s often in GROUP BY clauses.
It’s not used in WHERE clauses, as this would limit query matches to only a few distributions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is a Replicated table?

A

In contrast to a hash-distributed table, a replicated table is really simple. The entire table gets stored on each of the 60 data distributions. That seems to defeat the purpose of dividing data into 60 distributions, doesn’t it, so why would you ever use it? Well, if a relatively small dimension table is frequently used in joins and aggregations, then it will be much more efficient to have it on every distribution.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

In summary, what are the distribution methods for the three types of tables?

A

To summarize, here are the recommended distribution methods for the three types of tables. Fact tables should be hash-distributed. Dimension tables should be replicated if they’re small or hash-distributed if they’re large. Staging tables should use round-robin distribution.

In addition to distributing a table, you can also partition it by date range. For example, you could partition your sales data by month. That is, each month’s data would be in a separate partition. The biggest benefit of doing this is that you could use something called partition switching.

Let’s say you only keep five years’ worth of sales data in your SQL pool, and at the end of every month, you add the data for the latest month and delete the data for the oldest month. If you have a huge number of rows to add and remove, then it can take a long time and could potentially run into problems during the process. With partition switching, you can load the latest month’s data into a temporary table, and then in the production table, replace the old partition with the new one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

When you’re loading data into staging tables, you should use….and why?….

A

a round-robin distribution method

A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it’s the fastest distribution type for loading data into a staging table. It doesn’t perform any optimization.

Is massivelly parallelizes the transformations

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What is the recommended distribution method for fact tables with a clustered columnstore index?

A

Hash distribution

Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a columnstore index?

A

A columnstore index is a technology for storing, retrieving, and managing data by using a columnar data format, called a columnstore.

Key terms and concepts
The following key terms and concepts are associated with columnstore indexes.

Columnstore
A columnstore is data that’s logically organized as a table with rows and columns, and physically stored in a column-wise data format.

Rowstore
A rowstore is data that’s logically organized as a table with rows and columns, and physically stored in a row-wise data format. This format is the traditional way to store relational table data. In SQL Server, rowstore refers to a table where the underlying data storage format is a heap, a clustered index, or a memory-optimized table.

Note

In discussions about columnstore indexes, the terms rowstore and columnstore are used to emphasize the format for the data storage.

Rowgroup
A rowgroup is a group of rows that are compressed into columnstore format at the same time. A rowgroup usually contains the maximum number of rows per rowgroup, which is 1,048,576 rows.

For high performance and high compression rates, the columnstore index slices the table into rowgroups, and then compresses each rowgroup in a column-wise manner. The number of rows in the rowgroup must be large enough to improve compression rates, and small enough to benefit from in-memory operations.

A rowgroup from where all data has been deleted transitions from COMPRESSED into TOMBSTONE state, and is later removed by a background process named the tuple-mover. For more information about rowgroup statuses, see sys.dm_db_column_store_row_group_physical_stats (Transact-SQL).

Tip

Having too many small rowgroups decreases the columnstore index quality. Until SQL Server 2017 (14.x), a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups.
Starting with SQL Server 2019 (15.x), a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted.
After merging smaller rowgroups, the index quality should be improved.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Why would you need Replicated tables?

A

For Dimension tables (master data or attributes)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Dimension tables should be…

A

replicated if they’re small or hash-distributed if they’re large.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

How do I partition by date range?

A

In addition to distributing a table, you can also partition it by date range. For example, you could partition your sales data by month. That is, each month’s data would be in a separate partition. The biggest benefit of doing this is that you could use something called partition switching.

Let’s say you only keep five years’ worth of sales data in your SQL pool, and at the end of every month, you add the data for the latest month and delete the data for the oldest month. If you have a huge number of rows to add and remove, then it can take a long time and could potentially run into problems during the process. With partition switching, you can load the latest month’s data into a temporary table, and then in the production table, replace the old partition with the new one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Before running a Query…

A

the SQL pool’s query optimizer decides which query plan to use. For example, it will estimate how many rows will be returned by the query, and it will use one query plan if it’s a small number of rows and a different query plan if it’s a large number of rows.

17
Q

How does the SQL’s Pool Query optimizer decide which query plan to use or generate a new one?

A

In order to make the right decision, it needs to know your data pretty well. The way it does that is by generating statistics about important columns, such as columns that are used to join two tables together. Fortunately, SQL pools generate statistics automatically by default. However, these statistics are only generated when the SQL pool sees that statistics it needs for an incoming query are missing. So, if the statistics aren’t already there, the query will be slower the first time it’s run because the statistics have to be generated. This is why it’s important to generate statistics ahead of time, if possible.

18
Q

Do I need to update statistics one time only?

A

Even if statistics have already been created, they will become out-of-date as new data gets added. One approach is to update the statistics whenever new data is loaded. However, it would probably be excessive to update the statistics on every column, so you should focus on the ones that are important for query performance, such as date columns and columns that are used in JOIN, GROUP BY, ORDER BY, or DISTINCT clauses in SQL queries.

19
Q

What can I do if query takes too long?

A

If a query takes much longer than you expect, then you can investigate it by querying the logs. You need to have the VIEW DATABASE STATE permission to do this. Even if you don’t know of a particular query that took a long time, you can query the logs to find the ones that took the longest time to execute.

Once you know which query you want to investigate, you can retrieve both the SQL statement that was used and the query plan that was executed. You can also find out how many rows came from each distribution. If a large number of rows came from one distribution, this could indicate that you have a hot spot. This is also known as data skew. To fix data skew, use a more appropriate column as the hash key so the rows are distributed more evenly.

20
Q

Can I use caching here in query executions?

A

One feature that dramatically speeds up queries is caching. If you enable result set caching, then the results from all queries (with a few exceptions) get cached. If you or someone else runs the same query again, and the data hasn’t changed in the meantime, then the results are retrieved from the cache instead of from executing the query. Not only does this make the query lightning-fast, but it’s also “free” because it doesn’t use one of your concurrency slots to execute the query.

21
Q

What is the maximum size of cache? how long does it last?

A

The maximum size of the cache is one terabyte per database, which is a lot, but it would still fill up eventually if it wasn’t managed. When the cache is almost full, the SQL pool removes results that haven’t been retrieved recently or that are no longer valid due to data changes. Even if the cache is not close to being full, result sets that haven’t been retrieved for 48 hours are removed.

22
Q

Best practice says that If planning on running queries that will return large amounts of data…

A

then you should disable result set caching on the database before running them.

23
Q

How does Synapse ensure each query has enough resources to execute efficiently?

A

To ensure each query has enough resources to execute efficiently, Synapse SQL tracks resource utilization by assigning concurrency slots to each query. The system puts queries into a queue based on importance and concurrency slots. Queries wait in the queue until enough concurrency slots are available. Importance and concurrency slots determine CPU prioritization.

24
Q

What are workload groups?

A

Workload groups are CONTAINERS FOR A SET OF REQUESTS and are the basis for how workload management, including workload isolation, is configured on a system. Workload groups are created using the CREATE WORKLOAD GROUP syntax. A simple workload management configuration can manage data loads and user queries. For example, a workload group named wgDataLoads will define workload aspects for data being loaded into the system. Also, a workload group named wgUserQueries will define workload aspects for users running queries to read data from the system.

25
Q

What is Resource governance?

A

Workload groups govern memory and CPU resources. Disk and network IO as well as tempdb are not governed. Resource governance for memory and CPU is as follows:

Memory is governed at the request level and held throughout the duration of the request. See Resources per request definition for further details on how to configure the amount of memory per request. The MIN_PERCENTAGE_RESOURCE parameter for the workload group dedicates memory to that workload group exclusively. The CAP_PERCENTAGE_RESOURCE parameter for the workload group is a hard limit on the memory a workload group can consume.

CPU resources are governed at the workload group level and shared by all requests within a workload group. CPU resources are fluid compared to memory which is dedicated to a request for the duration of execution. Given CPU is a fluid resource, unused CPU resources can be consumed by all workload groups. This means that CPU utilization can exceed the CAP_PERCENTAGE_RESOURCE parameter for the workload group. This also means that the MIN_PERCENTAGE_RESOURCE parameter for the workload group is not a hard reservation like memory is. When CPU resources are under contention, utilization will align to the CAP_PERCENTAGE_RESOURCE definition for workload groups.

26
Q

What is Workload Isolation?

A

Workload isolation means resources are reserved, exclusively, for a workload group. Workload isolation is achieved by configuring the MIN_PERCENTAGE_RESOURCE parameter to greater than zero in the CREATE WORKLOAD GROUP syntax. For continuous execution workloads that need to adhere to tight SLAs, isolation ensures resources are always available for the workload group.

Configuring workload isolation implicitly defines a guaranteed level of concurrency. For example, a workload group with a MIN_PERCENTAGE_RESOURCE set to 30% and REQUEST_MIN_RESOURCE_GRANT_PERCENT set to 2% is guaranteed 15 concurrency. The level of concurrency is guaranteed because 15-2% slots of resources are reserved within the workload group at all times (regardless of how REQUEST_MAX_RESOURCE_GRANT_PERCENT is configured). If REQUEST_MAX_RESOURCE_GRANT_PERCENT is greater than REQUEST_MIN_RESOURCE_GRANT_PERCENT and CAP_PERCENTAGE_RESOURCE is greater than MIN_PERCENTAGE_RESOURCE additional resources can be added per request (based on resource availability). If REQUEST_MAX_RESOURCE_GRANT_PERCENT and REQUEST_MIN_RESOURCE_GRANT_PERCENT are equal and CAP_PERCENTAGE_RESOURCE is greater than MIN_PERCENTAGE_RESOURCE, additional concurrency is possible. Consider the below method for determining guaranteed concurrency:

[Guaranteed Concurrency] = [MIN_PERCENTAGE_RESOURCE] / [REQUEST_MIN_RESOURCE_GRANT_PERCENT]

Note

There are specific service level minimum values for min_percentage_resource. For more information, see Effective Values for further details.

In the absence of workload isolation, requests operate in the shared pool of resources. Access to resources in the shared pool is not guaranteed and is assigned on an importance basis.

Configuring workload isolation should be done with caution as the resources are allocated to the workload group even if there are no active requests in the workload group. Over-configuring isolation can lead to diminished overall system utilization.

Users should avoid a workload management solution that configures 100% workload isolation: 100% isolation is achieved when the sum of min_percentage_resource configured across all workload groups equals 100%. This type of configuration is overly restrictive and rigid, leaving little room for resource requests that are accidentally mis-classified. There is a provision to allow one request to execute from workload groups not configured for isolation. The resources allocated to this request will show up as a zero in the systems DMVs and borrow a smallrc level of resource grant from system reserved resources.

27
Q

What is Workload containment?

A

Workload containment refers to limiting the amount of resources a workload group can consume. Workload containment is achieved by configuring the CAP_PERCENTAGE_RESOURCE parameter to less than 100 in the CREATE WORKLOAD GROUP syntax. Consider the scenario whereby users need read access to the system so they can run a what-if analysis via ad-hoc queries. These types of requests could have a negative impact on other workloads that are running on the system. Configuring containment ensures the amount of resources is limited.

Configuring workload containment implicitly defines a maximum level of concurrency. With a CAP_PERCENTAGE_RESOURCE set to 60% and a REQUEST_MIN_RESOURCE_GRANT_PERCENT set to 1%, up to a 60-concurrency level is allowed for the workload group. Consider the method included below for determining the maximum concurrency:

[Max Concurrency] = [CAP_PERCENTAGE_RESOURCE] / [REQUEST_MIN_RESOURCE_GRANT_PERCENT]

Note

The effective CAP_PERCENTAGE_RESOURCE of a workload group will not reach 100% when workload groups with MIN_PERCENTAGE_RESOURCE at a level greater than zero are created. See sys.dm_workload_management_workload_groups_stats for effective runtime values.

28
Q

Explain Resources per request definition

A

Workload groups provide a mechanism to define the min and max amount of resources that are allocated per request with the REQUEST_MIN_RESOURCE_GRANT_PERCENT and REQUEST_MAX_RESOURCE_GRANT_PERCENT parameters in the CREATE WORKLOAD GROUP syntax. Resource in this case is memory. CPU resource governance is covered in the Resource governance section.

Note

REQUEST_MAX_RESOURCE_GRANT_PERCENT is an optional parameter that defaults to the same value that is specified for REQUEST_MIN_RESOURCE_GRANT_PERCENT.

Like choosing a resource class, configuring REQUEST_MIN_RESOURCE_GRANT_PERCENT sets the value for the resources utilized by a request. The amount of resources indicated by the set value is guaranteed for allocation to the request before it begins execution. For customers migrating from resource classes to workload groups, consider following the How To article to map from resources classes to workload groups as a starting point.

Configuring REQUEST_MAX_RESOURCE_GRANT_PERCENT to a value greater than REQUEST_MIN_RESOURCE_GRANT_PERCENT allows the system to allocate more resources per request. While scheduling a request, system determines actual resource allocation to the request, which is between REQUEST_MIN_RESOURCE_GRANT_PERCENT and REQUEST_MAX_RESOURCE_GRANT_PERCENT, based on resource availability in shared pool and current load on the system. The resources must exist in the shared pool of resources when the query is scheduled.

Note

REQUEST_MIN_RESOURCE_GRANT_PERCENT and REQUEST_MAX_RESOURCE_GRANT_PERCENT have effective values that are dependent on the effective MIN_PERCENTAGE_RESOURCE and CAP_PERCENTAGE_RESOURCE values. See sys.dm_workload_management_workload_groups_stats for effective runtime values.

29
Q

Explain Execution rules

A

Execution Rules
On ad-hoc reporting systems, customers can accidentally execute runaway queries that severely impact the productivity of others. System admins are forced to spend time killing runaway queries to free up system resources. Workload groups offer the ability to configure a query execution timeout rule to cancel queries that have exceeded the specified value. The rule is configured by setting the QUERY_EXECUTION_TIMEOUT_SEC parameter in the CREATE WORKLOAD GROUP syntax.

30
Q

Explain Shared pool resources

A

Shared pool resources
Shared pool resources are the resources not configured for isolation. Workload groups with a MIN_PERCENTAGE_RESOURCE set to zero leverage resources in the shared pool to execute requests. Workload groups with a CAP_PERCENTAGE_RESOURCE greater than MIN_PERCENTAGE_RESOURCE also used shared resources. The amount of resources available in the shared pool is calculated as follows.

[Shared Pool] = 100 - [sum of MIN_PERCENTAGE_RESOURCE across all workload groups]

Access to resources in the shared pool is allocated on an importance basis. Requests with the same importance level will access shared pool resources on a first in/first out basis.

31
Q

In a hash-distributed table, you designate one of the columns as the hash key. Then the hash function uses the value in this column to determine which data distribution to store a particular row on. Some characteristics of a good distribution column are…

A

Here are some characteristics of a good distribution column:

It has many unique values.
It’s frequently used in JOINs or GROUP BY clauses.
And it’s not used in WHERE clauses.

32
Q

What is a Integration table?

A

An integration table is a table that sits in between a data source and the data warehouse. One example is a staging table. Polybase lets you read data from external sources into a staging table using T-SQL. You can create an external table by using these T-SQL commands:

CREATE EXTERNAL DATA SOURCE,

CREATE EXTERNAL FILE FORMAT, and

CREATE EXTERNAL TABLE

33
Q

What the heck is Polybase?

A

What is PolyBase?
PolyBase enables your SQL Server instance to query data with T-SQL directly from SQL Server, Oracle, Teradata, MongoDB, Hadoop clusters, Cosmos DB without separately installing client connection software. You can also use the generic ODBC connector to connect to additional providers using third-party ODBC drivers. PolyBase allows T-SQL queries to join the data from external sources to relational tables in an instance of SQL Server.

A key use case for data virtualization with the PolyBase feature is to allow the data to stay in its original location and format. You can virtualize the external data through the SQL Server instance, so that it can be queried in place like any other table in SQL Server. This process minimizes the need for ETL processes for data movement. This data virtualization scenario is possible with the use of PolyBase connectors.

Supported SQL products and services
PolyBase provides these same functionalities for the following SQL products from Microsoft:

SQL Server 2016 (13.x) and later versions (Windows only)
SQL Server 2019 (15.x) and later versions (Linux)
SQL Server Analytics Platform System (PDW) (PDW), hosted in the Analytics Platform System (APS)
Azure Synapse Analytics
Note

Data virtualization using PolyBase feature is available in preview for Azure SQL Managed Instance, scoped to querying external data stored in files in Azure Data Lake Storage (ADLS) Gen2 and Azure Blob Storage. Visit Data virtualization with Azure SQL Managed Instance to learn more.

34
Q

The default type of index for tables in SQL pools is …, and this type of table should usually use …..

A

a clustered columnstore index

the hash distribution method