Azure Synapse Analytics Performance tuning Flashcards
In a data warehouse, the simplest type of schema is called a
star schema
Why is it called a star schema?
It’s called that because it looks a little bit like a star. It has a FACT TABLE IN THE MIDDLE and DIMENSION TABLES AROUND it. A fact table usually contains either individual events or summaries. For example, a sales database might have a fact table with either one record for each sales transaction or the total value of all sales transactions in a given period of time.
Dimension tables give more details about the items in the fact table. For example, if each sales transaction in the fact table contains a product ID, then there would be a product dimension table that describes that particular product in more detail, including things like brand, model number, and product category.
A Fact table usually contains?
A fact table usually contains either individual events or summaries.
What is an integration table or staging table?
One type of table that isn’t part of a star schema is called an integration table. This is generally a table that sits in between a data source and the data warehouse. For example, a common practice when loading data into a dedicated SQL pool is to first load the data into a staging table, perform some transformations on that data, and then load it into the SQL pool.
What is Polybase?
The fastest way to load data into Synapse Analytics is to use Polybase. It lets you read data from external sources using T-SQL. There are quite a few steps involved, though. Here’s what you need to do:
Convert your data into structured text files, such as CSV or Parquet, and put the files in either Blob storage or Data Lake Storage.
Create external tables by using these three T-SQL commands in this order:
CREATE EXTERNAL DATA SOURCE,
CREATE EXTERNAL FILE FORMAT, and
CREATE EXTERNAL TABLE
Load the data into a staging table in Synapse Analytics. This is a best practice so you can deal with data loading issues without affecting production tables.
Insert the data into production tables.
When you’re loading data into staging tables, you should use a round-robin distribution method. This’ll require a bit of explanation.
Tables in Synapse Analytics are actually spread out across 60 data distributions. This is why queries are so fast on this service–they’re massively parallelized. When you run a query, it spawns 60 queries that each run on one data distribution.
To make this work efficiently, you have to decide how the data will be distributed. This is also known as sharding. Synapse Analytics offers three choices: round-robin, hash-distributed, and replicated. A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it’s the fastest distribution type for loading data into a staging table. It doesn’t perform any optimization.
What is a Round-Robin table?
A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it’s the fastest distribution type for loading data into a staging table. It doesn’t perform any optimization.
What is a hash-distributed table?
The idea is that you designate one of the columns as the hash key (or semantic key, ex: billing number or other key that has to go on same package). Then the hash function uses the value in this column to determine which data distribution to store a particular row on. As long as you choose a hash key that’s appropriate for the most commonly run queries on this table, then query performance will be much better than it would be with a round-robin table.
Ideally, you should choose a distribution column that will spread the rows fairly evenly among the data distributions. If too many of the rows are on the same data distribution, then it will be a hot spot that reduces the advantages of Synapse Analytics’ massively parallel architecture. For example, if you were to choose a date column for the hash key, then all of the rows for a particular date would end up on the same distribution. So a query on that date would only run on that one distribution, which would make the query take much longer than if it were to run across all 60 distributions in parallel.
Here are some characteristics of a good distribution column:
It has many unique values so the rows will be spread out over the 60 distributions.
It’s frequently used in JOINs. If two fact tables are often joined together, then distribute both of the tables on the same join column. That way, rows from the two tables that have the same value in the join column will be stored on the same distribution, so they can be joined together easily. If you don’t have frequent joins, then choose a column that’s often in GROUP BY clauses.
It’s not used in WHERE clauses, as this would limit query matches to only a few distributions.
What is a Replicated table?
In contrast to a hash-distributed table, a replicated table is really simple. The entire table gets stored on each of the 60 data distributions. That seems to defeat the purpose of dividing data into 60 distributions, doesn’t it, so why would you ever use it? Well, if a relatively small dimension table is frequently used in joins and aggregations, then it will be much more efficient to have it on every distribution.
In summary, what are the distribution methods for the three types of tables?
To summarize, here are the recommended distribution methods for the three types of tables. Fact tables should be hash-distributed. Dimension tables should be replicated if they’re small or hash-distributed if they’re large. Staging tables should use round-robin distribution.
In addition to distributing a table, you can also partition it by date range. For example, you could partition your sales data by month. That is, each month’s data would be in a separate partition. The biggest benefit of doing this is that you could use something called partition switching.
Let’s say you only keep five years’ worth of sales data in your SQL pool, and at the end of every month, you add the data for the latest month and delete the data for the oldest month. If you have a huge number of rows to add and remove, then it can take a long time and could potentially run into problems during the process. With partition switching, you can load the latest month’s data into a temporary table, and then in the production table, replace the old partition with the new one.
When you’re loading data into staging tables, you should use….and why?….
a round-robin distribution method
A round-robin table has a very simple distribution pattern. Rows are distributed evenly across the data distributions. This is why it’s the fastest distribution type for loading data into a staging table. It doesn’t perform any optimization.
Is massivelly parallelizes the transformations
What is the recommended distribution method for fact tables with a clustered columnstore index?
Hash distribution
Columnstore indexes are the standard for storing and querying large data warehousing fact tables. This index uses column-based data storage and query processing to achieve gains up to 10 times the query performance in your data warehouse over traditional row-oriented storage.
What is a columnstore index?
A columnstore index is a technology for storing, retrieving, and managing data by using a columnar data format, called a columnstore.
Key terms and concepts
The following key terms and concepts are associated with columnstore indexes.
Columnstore
A columnstore is data that’s logically organized as a table with rows and columns, and physically stored in a column-wise data format.
Rowstore
A rowstore is data that’s logically organized as a table with rows and columns, and physically stored in a row-wise data format. This format is the traditional way to store relational table data. In SQL Server, rowstore refers to a table where the underlying data storage format is a heap, a clustered index, or a memory-optimized table.
Note
In discussions about columnstore indexes, the terms rowstore and columnstore are used to emphasize the format for the data storage.
Rowgroup
A rowgroup is a group of rows that are compressed into columnstore format at the same time. A rowgroup usually contains the maximum number of rows per rowgroup, which is 1,048,576 rows.
For high performance and high compression rates, the columnstore index slices the table into rowgroups, and then compresses each rowgroup in a column-wise manner. The number of rows in the rowgroup must be large enough to improve compression rates, and small enough to benefit from in-memory operations.
A rowgroup from where all data has been deleted transitions from COMPRESSED into TOMBSTONE state, and is later removed by a background process named the tuple-mover. For more information about rowgroup statuses, see sys.dm_db_column_store_row_group_physical_stats (Transact-SQL).
Tip
Having too many small rowgroups decreases the columnstore index quality. Until SQL Server 2017 (14.x), a reorganize operation is required to merge smaller COMPRESSED rowgroups, following an internal threshold policy that determines how to remove deleted rows and combine the compressed rowgroups.
Starting with SQL Server 2019 (15.x), a background merge task also works to merge COMPRESSED rowgroups from where a large number of rows has been deleted.
After merging smaller rowgroups, the index quality should be improved.
Why would you need Replicated tables?
For Dimension tables (master data or attributes)
Dimension tables should be…
replicated if they’re small or hash-distributed if they’re large.
How do I partition by date range?
In addition to distributing a table, you can also partition it by date range. For example, you could partition your sales data by month. That is, each month’s data would be in a separate partition. The biggest benefit of doing this is that you could use something called partition switching.
Let’s say you only keep five years’ worth of sales data in your SQL pool, and at the end of every month, you add the data for the latest month and delete the data for the oldest month. If you have a huge number of rows to add and remove, then it can take a long time and could potentially run into problems during the process. With partition switching, you can load the latest month’s data into a temporary table, and then in the production table, replace the old partition with the new one.