Performance Concepts: Query Optimization Flashcards
History Tab
- Displays query history for the last 14 days
- Users can view other users queries but cannot view their query results
Analysis of Query Duration: Compilation Time
Query engine performing operations like cost based optimization and micropartition pruning
Analysis of Query Duration: Execution Time
Steps to process the data, processing time on the CPU
Query Analysis: Profile Overview: Processing
The percentage of the total execution time spent on data processing by the CPU of the virtual warehouse
Query Analysis: Profile Overview: Local Disk I/O
-The percentage of time processing was blocked by local disk access
- In other words, when the virtual warehouse had to read or write data from the local SSD storage
Query Analysis: Profile Overview: Remote Disk I/O
When the virtual warehouse had to read or write data from the remote storage layer
Database Order of Execution
Rows->Groups->Result
Rows statements
FROM
JOIN
WHERE
Group statements
GROUP BY
HAVING
Result statements
SELECT
DISTINCT
ORDER BY
LIMIT
Bytes spilled to local storage
Volume of data spilled to virtual warehouse local disk
Bytes spilled to remote storage
Volume of data spilled to remote disk.
Caching Layers
Services Layer
Warehouse Layer
Storage Layer
Services Layer Cache Categories
Metadata Cache
Results Cache
Metadata Cache
- highly available service residing in the cloud services layer
- maintains metadata on object information and statistics for each cache type
- sometimes referred to as the metadata store cloud services layer or just services layer
What does the metadata cache store?
-maintains understanding of what tables create, what databases those are in, what are the micro partitions that make up those tables, clustering information and more
- enables queries to be executed without the need of a virtual warehouse
- For example, the row count of each table is stored in a cache allowing for executing something such as a select count without using the virtual warehouse
- Functions, context functions, describe commands, and show commands make use of metadata
Results Cache
- Also referred to as Result 24 hour
- Each time a persistent result is reused, the 24 hour retention period is extended up to a maximum of 31, after which the result is purged
To reuse a result, ‘Result Cache’ Rule
- New query exactly matches previous query
- The underlying table data has not changed
- The same role is used as the previous query
- If time context functions are used, such as CURRENT_TIME(), the result cache will not be used
- Result reuse is disabled using the session parameter USE_CACHED_RESULT
Warehouse Cache
- Virtual warehouses have local SSD storage which maintains raw table data used for processing a query
- The larger the virtual warehouse the greater the local cache
- It is purged when the virtual warehouse is resized, suspended, or dropped
- Can be used partially, retrieving the rest of the data required for a query from remote storage
Materialized View
- a pre-computed and persisted data set derived from a SELECT query
- are updated via a background process ensuring data is current and consistent with the base table
- improve query performance by making complex queries that are commonly executed, readily available
- an enterprise edition and above serverless feature
- use compute resources to perform automatic background maintenance
- use storage to store query results, adding to the monthly storage usage for an account
Materialized Views can be created on top of ____________ _____________ to improve their query performance.
external tables
Materialized Views are limited in the following ways:
- Can only query a single table
- Cannot use JOIN
- Cannot use UDF, HAVING, ORDER BY, LIMIT, WINDOW FUNCTIONS
Clustering
a way describing the distribution of a column’s values
Automatic Clustering
- Snowflake supports specifying one or more table columns/expressions as a clustering key for a table
- Clustering aims to co-locate data of the clustering key in the same micro-partitions
- Clustering improves performance of queries that frequently filter or sort on the clustered keys
- Clustering should be reserved for large tables in the multi-terabyte range
When choosing a clustering key, Snowflake recommends a maximum of __ or __ columns (or expressions) per key.
3, 4
A __________ cardinality column would be something like a unique ID or a timestamp down to a millisecond.
high
A ________ cardinality would be something like country or gender.
Low
When choosing a clustering key, a cardinality that is too _____ will negatively impact pruning because the values are too unique to be grouped in micro-partitions.
high
If multiple columns are chosen to be the clustering key, it is suggested that the ______ cardinality is selected first followed by the ________ cardinality.
lowest, highest
Clustering Key Code Examples
As _____ operations are performed on a clustered table, the data in the table might become less clustered.
DML
________________ is a background process which transparently reorganizes data in the micro-partitions by the clustering key.
reclustering
Initial clustering and subsequent reclustering operations consume __________ and ________ credits.
clustering, reclustering, compute, storage
Clustering is recommended for large tables which do not frequently _________ and are frequently ___________.
change, queried
__________ ____________ ___________ is a table level property aimed at improving the performance of selective point lookup queries.
Search optimization service