Performance Management Flashcards

1
Q

Clustering in Snowflake

A

As data is loaded into Snowflake, it is micro-partitioned and these partitions are clustered based on its metadata (by similarity). This is done automatically and requires no user intervention/management.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Can Materialized Views be clustered?

A

Yes

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Cluster Key

A

To manually improve clustering, Snowflake supports the addition of a Cluster Key to a table. A clustering key is deigned to co-locate data in a table in the same micro-partition. This improves overall clustering and query performance.

(Generally Snowflake does a good job doing this, but at times it can be better to manually intervene)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How to determine if a table is clustered?

A

A table is considered to be clustered if a clustering key is present.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Should materialized views be clustered?

A

x

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

How does clustering work in the background?

A

As data is inserted into a table, Clustering metadata is collected and recorded for each micro-partition created during the process. Snowflake leverages this clustering information to avoid unnecessary scanning of micro-partitions, and also uses this info for clustering key purposes.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

When can clustering keys be defined?

A

CREATE TABLE

ALTER TABLE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What should determine whether to define a clustering key?

A
  • Queries on the table are running slower than expected/degraded over time.
  • The clustering depth for a table is large.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Impact of Clustering on Query Pruning

A

Micro-partition metadata maintained by Snowflake enables precise and optimal pruning of columns in micro-partitions, including columns with semi-structured data.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Clustering Query Pruning Example

A

If a query filters on value X, and value X is only present in 10% of all micro-partitions, a well clustered table will only scan 10% of its micro-partitions and ignore the rest of the micro-partitions.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What keys are best clustered? (in order of priority)

A
  1. Cluster columns that are most actively used by select filters
  2. Columns frequently joined on
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Recommended Maximum Columns per Cluster Key

A

4

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What qualities make for a good cluster key

A

A column with low cardinality is a bad candidate, and a column with high cardinality is a bad candidate as well.

A column in the middle in terms of cardinality is best.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

What is a View?

A

A view is a table-like object in Snowflake derived from a query. A view functions effectively like a table and can be used interchangeably with a table.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is a Materialized View?

A

Identical to a normal view except that the predefined query is pre-computed and the output is stored for later use when the view is called.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Are materialized views faster than regular queries or views?

A

Yes, because the compute on a materialized view is done ahead of time unlike the compute for a regular query and view.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Search Optimization

A

Search Optimization is a service within Snowflake that aims to improve performance of selective lookup queries on tables. A point lookup query returns only one or a small number of distinct rows.

18
Q

When is Search Optimization useful?

A
  • Business users need fast response times for critical dashboards with highly selective features
  • Data scientists exploring large volumes of data and are looking for specific subsets of data
19
Q

Data types supported by Search Optimization

A
  • Fixed-point numbers (INT/NUMERIC)
  • DATE, TIME, and TIMESTAMP
  • VARCHAR
  • BINARY
20
Q

Search Optimization works best to improve the performance of a query when the following is true

A

For tables:

  • The table is not clustered
  • The table is frequently queried on columns other than the primary cluster key

For queries:

  • The query typically runs for 10+ seconds
  • At least one column accessed through the filter has 100k+ distinct values
21
Q

Alternative Solutions to Search Optimization

A
  • Clustering
  • Materialized Views
  • Caching
22
Q

Clustering a table can speed up which of the following? (Assuming they are on the clustering key)

A
  • Range searches

- Equality searches

23
Q

Granting access to Search Optimization

A

You must have OWNERSHIP on the table you want to optimize.

You must have the ADD SEARCH OPTIMIZATION privilege on the schema that contains the table.

24
Q

What factors impact query performance?

A
  • Query Queueing
  • Disk Spillage
  • Control Join Order
  • Unsatisfactory Pruning
  • Recognizing Row Explosion
25
Q

What (quantifiable) factors can impact query queueing?

A
  • QUEUED_LOAD
  • QUEUED_RESUMING
  • QUEUED_REPAIR
26
Q

Where are the three factors that detail query queueing pulled from?

A

INFORMATION_SCHEMA.QUERY_HISTORY view

27
Q

QUEUED_LOAD / QUEUED_OVERLOAD_TIME

A

Details the current load on a warehouse, can often describe whether or not there is capacity for a new query and details whether the warehouse is overloaded.

(This is the most common condition)

28
Q

QUEUED_RESUMING

A

Details whether or not the warehouse is currently resuming, say from auto-suspension or a restart. A query will execute once the warehouse is available.

29
Q

QUEUED_REPAIR

A

A faulty server within your warehouse is being replaced by a healthy one. This is very rare and typically only happens in the instance of hardware failure.

30
Q

What function is used to monitor activity history for a warehouse?

A

WAREHOUSE_LOAD_HISTORY() function

31
Q

Disk Spilling

A

When a Snowflake cannot fit an operation into memory (like tables/data that are being used to compute a query result), this data will spill into memory. If memory and local disk becomes full, the query data will spill into remote storage.

As data spills over into each new layer downstream, the slower the query becomes.

32
Q

Order of disk spillage layers

A

Descending by speed:

  • Memory
  • Local disk (SSD)
  • Remote storage (S3/Blob storage)
33
Q

Where can you see whether a query has disk spillage in the Snowflake UI?

A

Profile tab in the web interface, look for the “Spilling” section which dictates the number of bytes spilled to local storage.

34
Q

Control Join Order

A

In general, the SQL query optimizer chooses the correct order for joining tables, but in rare cases it does not it may be advisable to specify/correct the order in which queries join tables. When tables are not joined in an optimal order this can impact performance.

35
Q

How to manually control the join order of a query?

A

Specifying the join order in the query itself won’t change anything as the query optimizer will just decide for you in the backend, regardless of your specification in the query.

To do so, you must write several temp tables and join them in that order.

36
Q

Unsatisfactory Pruning

A

This is when unnecessary partitions are being scanned to compute a query, resulting in slower execution times.

37
Q

Solutions to Unsatisfactory Pruning

A

Improve clustering (clustering keys, etc) and pruning will improve.

38
Q

Row Explosion

A

Due to unanticipated data duplication due to joins on a query, more rows may be exploded out of a query than originally anticipated. This can exponentially increase compute and storage time/resources to complete a query.

39
Q

Row Explosion Solution

A

Double check to ensure your join statements are correct. At times there is no way to fix this if your joins are absolutely necessary.

40
Q

Query Profile Tool

A

The Query Profile tool provides execution details for a query as well as a graphical representation of the main components of the plan to process a query along with statistics for each component.

This tool helps identify common query performance issues mentioned in prior cards.

41
Q

Operator Trees

A

A query is divided into steps, with each step containing an Operator Tree. An Operator Tree contains the main elements/operator nodes and the relationships between each operator that comprise of a step within a query.

42
Q

Optimizing Query Workloads in Snowflake

A

Because Snowflake allows for simple creation and replication of data warehouses, it is best to segment query workloads based on their use case.

This allows for optimal caching, clustering, and concurrency.

i.e the marketing department should have their own warehouse separate from finance. Each data science team should have their own warehouse specific to their objective/subject matter.