Chapter 15 Implementing Indexes and Statistics Flashcards

Question

What is a columnstore index?

Answer 1

In addition to regular row storage, SQL Server 2012 can store index data column by column in what's called a columnstore index. Columnstore indexes can speed up data warehousing queries by a large factor, from 10 to even 100 times. A columnstore index is just another noclustered index on a table. The SQL Server Query Optimizer considers using the columnstore index during the query optimization phase just as it does any other index. All you have to do to take advantage of this feature is create a columnstore index on a table. A columnstore index is stored compressed. The compression factor can be up to 10 times the original size of the index. When a query references a single column that is a part of a columnstore index, then SQL Server fetches only that column from disk; it doesn't fetch entire rows as with row storage. This also reduces disk I/O and memory cache consumption. Columnstore indexes use their own compression algorithm; you cannot use Row or Page compression on a columnstore index. On the other hand, SQL Server has to return rows. Therefore, rows must be reconstructed when you execute a query. This row reconstruction takes some time and uses some CPU memory resources.

Answer 2

Columnstore indexes accelerate data warehouse queries, not OLTP workloads. Because of the row reconstruction issues and other overhead when you update compressed data, tables containing a columnstore index become read-only. If you want to update a table containing a columnstore index, you must first drop the index.

Answer 3

The columnstore index is divided into units called segments. Segments are stored as large objects and consist of multiple pages. Segments are the unit of transfer from disk to memory. Each segment has metadata that stores the min and max value of each column for that segment. This enables early segment elimination in the storage engine. SQL Server loads only those segments requested by a query into memory.

Answer 4

You can optimize queries that aggregate data and perform multiple joins by permanently storing the aggregated and joined data. You can create a view with a query that joins and aggregates data. Then you can index the view to get an indexed view. With indexing, you are materializing a view. Note that the view must be created with the SCHEMABINDING option if you want to index it. In addition, you must use the COUNT_BIG aggregate function. Nevertheless, after you create the view and the index, execute the aggregate query again and measure the I/O.

Answer 5

There can only be 1 clustered index because this is the table itself organized as a balanced tree.

Answer 6

Clustering key

Answer 7

Root level, intermediate level, leaf level.

Answer 8

The WHERE clause is one of the most important parts of a query that can benefit from an index. You can check whether an index was used by displaying the estimated or actual execution plan. You can also track index usage by querying the sys.dm_db_index_usage_stats DMV.

Answer 9

The execution plan will show that SQL Server used (for example) a clustered index scan. The whole table was scanned, regardless of how many indexes there are.

Answer 10

When the Ordered property of the operator is set to False, the scan is unordered - also known as an allocation scan. Remember that order is not guaranteed if you do not include the ORDER BY clause.

Answer 11

Adding a WHERE clause to a query does not guarantee that an index is going to be used. The clause has to be supported by an appropriate index, and it must be selective enough. If the query returns too many rows, it is less expensive for SQL Server to perform a table or clustered index scan than to do a nonclustered index seek and then RID or key lookups.

Answer 12

Yes. SQL Server can aggregate data by using a hash or a stream aggregate operator. The stream aggregate is faster; however, it needs sorted input. The hash match (aggregate) operator is used when an aggregate query is not supported by an index. An aggregate query can benefit from an index even if it does not include the GROUP BY clause. For example, if you use the MIN aggregate function and you have an appropriate index, then SQL Server can seek for the first value of an index only, and does not have to scan the entire table.

Answer 13

Yes. If there is no appropriate index for the ORDER BY clause, SQL Server must sort data before returning it. Sorting large data sets could be a big performance hit on SQL Server. The data needs to be sorted in memory or must be spilled to tempdb if it does not fit in memory.

Answer 14

When an index contains all the columns referenced by a query it is typically referred to as covering the query. Covered queries are very efficient.

Answer 15

In an attempt to cover more queries with a non clustered index, you could try to add more columns. However, with a longer key, the index would become less efficient. SQL Server 2012 allows you to include a column in a nonclustered index on the leaf level only and not as part of a key. You can do this by using the INCLUDE clause of the CREATE INDEX statement. The included column is not part of the key, and SQL Server does not use it for seeks. Included columns help cover queries. An index with non key columns can significantly improve query performance when all columns in the query are included in the index as key or nonkey columns. Performance gains are achieved because the query optimizer can locate all the column values within the index, resulting in less I/O. However, you should be careful not to include too many columns. For example, if you included all columns of a table, you would actually copy the table.

Answer 16

A SARG (searchable argument) in a predicate helps the Query Optimizer decide to use an index. To write an appropriate SARG, you must ensure that a column that has an index on it appears in the predicate alone and not as a function parameter. The column name is alone on one side of the expression, and the constant or calculated value appears on the other side. Inclusive operators include eq, lt, gt, gte, lte, BETWEEN, LIKE. However, the LIKE operator is only inclusive if you don't use a wildcard % or _ at the *beginning* of the string you are comparing the column to.

Answer 17

The Query Optimizer converts the IN operator to OR with a separate comparison to each element from the IN operator list.

Answer 18

WHERE, JOIN, GROUP BY, and ORDER BY.

Answer 19

Using the AND operator in the WHERE clause predicate means that each part of the predicate limits the result set even more than the previous part. The Query Optimizer understands how the logical AND operator works, and can use appropriate indexes. However, the logical OR operator is inclusive. If the two conditions use two different columns, then SQL Server conservatively takes the worst case and estimates that the query would return the max number of rows. Having multiple conditions in a predicate connected with OR operator lowers the possibility for SQL Server to use indexes. You should consider rewriting the predicate to a logically equivalent predicate that uses the AND operator.

Answer 20

You could modify the index that is already used to INCLUDE the columns from the SELECT list that are not part of the key.

Answer 21

SQL Server sorts data in memory or spills the data to tempdb if it does not fit in memory.

Answer 22

(1) The arguments in the predicate are not searchable, (2) The predicate is not selective enough

Answer 23

SQL Server maintains statistics of the distribution of key values in special system statistical pages. The Query Optimizer uses these statistics to estimate the cardinality, or number of rows, in the query result set. In other words, it helps the Query Optimizer, produce an efficient query execution plan. By default, SQL Server creates statistics automatically for each index and for searchable non-key columns (used as searchable arguments) during query execution. Each statistics object is stored in a statistics binary large object and is created on one or more columns. Statistics include a header with metadata about the statistics and a density vector to measure cross column correlation. Statistics also include a histogram with the distribution of values in the first column.

Answer 24

(1) AUTO_CREATE_STATISTICS - When this option is set to on, SQL Server creates statistics automatically. This option is on by default and you should leave this option on in the vast majority of cases. (2) AUTO_UPDATE_STATISTICS - When this option is set to on, SQL Server automatically updates statistics when there are enough changes in the underlying tables and indexes. With this option on, SQL Server also updates an out-of-date statistics during query optimization. SQL Server checks for outdated statistics before compiling a query and before executing a cached query. In general, you should leave this option turned on. (3) AUTO_UPDATE_STATISTICS_ASYNC - This option determines whether SQL Server uses synchronous or asynchronous statistics updates during query optimization. If the statistics are updated asynchronously, SQL Server cannot use them for the optimization of the query that triggered the update; however, SQL Server does not wait for the statistics update during the optimization phase.

Answer 25

A statistic can have maximally 200 steps.

Answer 26

You can get information about statistics by querying the sys.stats and sys.stats_columns catalog views.

Answer 27

You can get detailed information about statistics with the DBCC SHOW_STATISTICS command. Note that you pass in the table name and the name of the statistic. e.g. DBCC SHOW_STATISTICS(N'Sales.Orders', N'idx_nc_empid'); By default, you get all statistics information, including the header, density vector, and histogram. From the header, you can get information like when the statistics were last updated. The WITH STAT_HEADER option simply returns the header. The WITH HISTOGRAM option simply shows the histogram of the statistics.

Answer 28

You can manually maintain statistics with the CREATE, DROP, and UPDATE statistics commands. You can also use the sys.sp_updatestats system procedure to manually update statics for all tables in a database. Note that this stored procedure can take a long time to execute and use a lot of resources.

Answer 29

STATS_DATE() provides information about when the statics were last updated.

Answer 30

Similar to filtered indexes, you can also create filtered statistics. Statistics created by SQL Server automatically are always created on all rows of a table. If queries frequently select from a subset of rows that has a unique data distribution, filtered statistics can improve query plans.

Answer 31

You can create this statistic manually. However, before creating it manually, you should verify that AUTO_CREATE_STATISTICS and AUTO_UPDATE_STATISTICS database options are turned on and that the database is not read-only. If the database is read-only, the Query Optimizer cannot save statistics.

Answer 32

(1) When query execution times are slow, and you know that the queries are written correctly and supported with appropriate indexes. Before you use query hints, update statistics. SQL Server does not consider using the index with outdated statistics. Check also whether auto-updating statistics is turned off for the database. (2) When insert operations occur on asc or desc key columns. Statistics are not updated for every single row; therefore, the number of rows inserted might be too small to trigger a statistics update. If queries select from the recently added rows, the current statistics might not have cardinality estimates for these new values. In addition bulk inserting rows to a table or truncating can change the distribution of data a lot. Queries executed right after these operations might get a suboptimal execution plan. (3) After an upgrade from a previous version of SQL Server. Statistics information can change with a new version of SQL Server. To be on the safe side, you should update the statistics for the upgraded database.

Answer 33

You should use the sys.sp_updatestats system stored procedure.

Answer 34

One example is when a query predicate contains multiple columns that have cross-column relationships; statistics on the multiple columns can help improve the query plan. Statistics on multiple columns contain cross column densities that are not available in single-column statistics. However, if the columns are already in the same index, the multi-column statistics object already exist, so you should not create an additional one manually.