2025 Clustering and Micro-Partitions Flashcards
How is Snowflake data stored?
Data is structured in a columnar fashion as encrypted, compressed files called micro-partitions.
What is the uncompressed size of data stored in each partition
50 MB to 500 MB
Are the micro-partitions immutable?
Yes
What is query pruning?
The metadata contains information to directly identify the micro-partition that contains data corresponding to the user query, instead of scanning the entire dataset.
What guidelines are followed in query pruning
Prune micro-partitions not needed
Prune columns not needed
What does Clustering mean
Dividing datasets into small groups based on data similarity. Used by Snowflake for efficient data pruning, resulting in optimized query performance. It involves organizing data based on the contents of one or more columns, called clustering keys.
Clustering is recommended for what size tables
Tables more than 1 TB, less than this there is good chance that clustering cost might surpass its benefits
How can you check if a table might benefit from a clustering approach
Clone the table and apply the clustering approach on the cloned table and see if the query performance improves.
Besides size, what makes a table a good candidate for clustering
Tables that do not change frequently and are queried regurlarly
Is reclustering a table maintained by Snowflake
Yes, and it will consume credits and have associated storage costs
How can you set a cluster keys on a table
With the CREATE or ALTER statements
Is there a default clustering key in Snowflake
No, if none is defined, clusters are created during data inserts
To manage costs, Snowflake recommends what percentage of the columns/expressions to be used as clustering keys
3/4
What columns are recommended for clustering
Columns frequently used in selective filters
Columns frequently used in joining predicates
Number of distinct values, large enough distinct values for effective. query pruning and small enough for co-locating data in the same micro-partitions
In the case of multicolumn clustering, how does Snowflake recommend ordering the columns
From lowest to highest cardinality.
How do you calculate the average depth of a table according to the clustering keys
SYSTEM$CLUSTERING_DEPTH(‘<t1>', '(<c1>,<c2>,..)'[,'<p>'])</c2></c1></t1>
What is the clustering depth for a table with no micro-partitions
0
What does SYSTEM$CLUSTERING_INFORMATION do
Provides useful metrics like overlapping micro-partitions and partition depth, but it does not independently determine clustering efficiency. Information is returned as a JSON object
Using clustering information, what is a good indicator the table is not well clustered
High value of Average_overlaps or Average_depth
What does it mean one there is one micro partition and the average depth is 1
The whole table will always be read for a any query
In clustering information, when looking at constant micro-partitions, what is good and what is bad
The higher the number of constant micro-partitions is, the more micro-partitions can be pruned from queries executed on the table