Data Loading & Unloading Flashcards
Stages
Stages are temporary storage locations for data files used in the data loading and unloading process
- Broken into Internal and External
Internal Stages are divided into what 3 stages?
- User stage
- Table stage
- Named stage
User Stage
- An internal stage
- Automatically allocated when a user is created
- Use a PUT command to get a file from your local machine to Snowflake
- reference to a user stage is: ls@~;
- Cannot be altered or dropped
- Not appropriate if multiple users need access to stage
Table Stage
- An internal stage
- Automatically allocated when a table is created
- Use a PUT command to get the file into a table stage
- Reference to a table stage is: ls @%MY_TABLE;
- Cannot be altered or dropped
- User must have ownership privileges on table
Named Stage
- An internal stage
- User created database object
- Use a PUT command to get the file into a name stage
- Reference to a named stage is: ls@MY_STAGE;
- Securable object
In internal storage, uncompressed files are automatically compressed using ________ when loaded into an internal storage unless explicitly set not to.
GZIP
In internal storage, stage files are automatically encrypted using ____ bit keys.
128
External Stages
External stages reference data files stored in a location outside of Snowflake, which we manage ourselves.
- Could be Amazon S3 buckets, Google cloud storage buckets, or Microsoft Azure containers
External named stage
- User created database object
- Files are uploaded using the cloud utilities of the cloud provider
- Reference to a named stage is ls @MY_STAGE;
- Storage locations can be private or public
- Copy options such as ON_ERROR and PURGE can be set on stages
Storage Integration
A storage integration is a reusable and securable Snowflake object which can be applied across stages and is recommended to avoid having to explicitly set sensitive information for each stage definition.
Stage Helper Command - LIST
List the contents of a stage:
- Path of staged file
- Size of staged file
- MD5 Hash of staged file
- Last updated timestamp
- Can optionally specify a path for specific folders or files
- Named and internal table stages can optionally include database and schema global pointer
Stage Helper Command - SELECT
- Query the contents of staged files directly using standard SQL for both internal and external stages
- Useful for inspected files prior to data loading/unloading
- Reference metadata columns such as filename and row numbers for a staged file
Stage Helper Command - REMOVE
- Remove files from either an external or internal stage
- Can optionally specify a path for specific folders or files
- Named and internal table stages can optionally include database and schema global pointer
PUT command
- The PUT command uploads data files from a local directory on a client machine to any of the three types of internal stage.
- Uploaded files are automatically encrypted with a 128-bit key with optional support for a 256-bit key.
PUT cannot be executed from within ____________.
worksheets
Duplicate files uploaded to a stage via PUT are _______.
ignored
COPY INTO <table> statement
- The COPY INTO <table> statement copies the contents of an internal or external location directly into a table
- COPY INTO <table> requires a user created virtual warehouse to execute
- Load history is stored in the metadata of the target table for 64 days, which ensures files are not loaded twice
What file formats can be uploaded into Snowflake?
- Delimited files (CSV, TSC, etc)
- JSON
- Avro
- ORC
- Parquet
- XML
Copy Option: ON_ERROR
Value that specifies the error handling for the load operation:
- CONTINUE
- SKIP_FILE
- SKIP_FILE_<num>
- SKIP_FILE_<num>%
- ABORT_STATEMENT</num></num>
Default Value - ‘ABORT_STATEMENT’
Copy Option: SIZE_LIMIT
Number that specifies the maximum size of data loaded by a COPY statement
Default - null(no size limit)
Copy Option: PURGE
Boolean that specifies whether to remove the data files from the stage automatically after the data is loaded successfully
Default Value - FALSE
Copy Option: RETURN_FAILED_ONLY
Boolean that specifies whether to return only files that have failed to load in the statement result
Default Value - FALSE
Copy Option: MATCH_BY_COLUMN_NAME
String that specifies whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data
Default Value - None
Copy Option: ENFORCE_LENGTH
Boolean that specifies whether to truncate text strings that exceed the target column length
Default Value - TRUE
Copy Option: TRUNCATECOLUMNS
Boolean that specifies whether to truncate text strings that exceed the target column length
Default Value - FALSE
Copy Option: FORCE
Boolean that specifies to load all files, regardless of whether they’ve been loaded previously and have not changed since they were loaded
Default Value - FALSE
Copy Option: LOAD_UNCERTAIN_FILES
Boolean that specifies to load files for which the load status is unknown. The COPY command skips these files by default.
Default Value - FALSE
Output Column Names: COPY INTO <table>
- FILE
- STATUS
- ROWS_PARSED
- ROWS_LOADED
- ERROR_LIMIT
- ERRORS_SEEN
- FIRST_ERROR
- FIRST_ERROR_LINE
- FIRST_ERROR_CHARACTER
- FIRST_ERROR_COLUMN_NAME
COPY INTO <table> : Column Name: FILE
Data Type - TEXT
Description - Name of source file and relative path to the file
COPY INTO <table> : Column Name: STATUS
Data Type - TEXT
Description - Status: loaded, load failed, or partially loaded
COPY INTO <table> : Column Name: ROWS_PARSED
Data Type - NUMBER
Description - Number of rows parsed from the source file
COPY INTO <table> : Column Name: ROWS_LOADED
Data Type - NUMBER
Description - Number of rows loaded from the source file
COPY INTO <table> : Column Name: ERROR_LMIT
Data Type - NUMBER
Description - If the number of errors reaches this limit, then abort
COPY INTO <table> : Column Name: ERRORS_SEEN
Data Type - NUMBER
Description - Number of error rows in the source file
COPY INTO <table> : Column Name: FIRST_ERROR
Data Type - TEXT
Description - First error of the source file
COPY INTO <table> : Column Name: FIRST_ERROR_LINE
Data Type - NUMBER
Description - Line number of the first error
COPY INTO <table> : Column Name: FIRST_ERROR_CHARACTER
Data Type - NUMBER
Description - Position of the first error character
COPY INTO <table> : Column Name: FIRST_ERROR_COLUMN_NAME
Data Type - TEXT
Description - Column name of the first error
VALIDATION_MODE statement
Optional parameter allows you to perform a dry-run of load process to expose errors when running COPY INTO<table>
- RETURN_N_ROWS
- RETURN_ERRORS
- RETURN_ALL_ERRORS