D3.3 - DEFINE BEST PRACTICES THAT SHOULD BE CONSIDERED WHEN LOADING DATA Flashcards
1
Q
Fize Size: Optimizing parallel operations
A
- The number of load operations that run in parallel cannot exceed the number of data files to be loaded.
- To optimize number of parallel operations, Snowflake recommends producing data files roughly 100-250 MB (or larger) in size compressed
2
Q
File Size: Very large files
A
- Loading very large files (larger than 100 GB) is not recommended. If you must load one, then consider using the ON_ERROR copy option value.
- Aborting or skipping a file due to a small number of errors could result in delays and wasted credits.
- In addition, if a data loading operations continues past the limit of 24 hours, it could be aborted without any portion of the file being committed.
3
Q
File Size: Handling Very Large Files
A
- Aggregating smaller files to minimize the processing overhead for each file.
- Split larger files into a greater number of files to distribute the load among the compute resources in an active warehouse.
- The number of data files that are processed in parallel is determined by the amount of compute resources in a warehouse.
- We recommend splitting large files by line to avoid records that span chunks.
4
Q
Varient Data Type
A
- Has a 16MB size limit on individual row
5
Q
File Formats
A
- Structured
- Semi-Structured
6
Q
What types are within Structured files?
A
- Delimited (CSV, TSV, etc.)
- Any valid single byte delimiter is supported; default is commas
7
Q
What types are within Semi-structured files?
A
- JSON:
- Avro: includes automatic detection and processing of compressed files
- ORC: includes automatic detection and processing of compressed files
- Parquet: includes automatic detection and processing of compressed files; other than V2
- XML: supported as a preview feature