Data Loading & Unloading Flashcards

1
Q

Stages

A

Stages are temporary storage locations for data files used in the data loading and unloading process
- Broken into Internal and External

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Internal Stages are divided into what 3 stages?

A
  • User stage
  • Table stage
  • Named stage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

User Stage

A
  • An internal stage
  • Automatically allocated when a user is created
  • Use a PUT command to get a file from your local machine to Snowflake
  • reference to a user stage is: ls@~;
  • Cannot be altered or dropped
  • Not appropriate if multiple users need access to stage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Table Stage

A
  • An internal stage
  • Automatically allocated when a table is created
  • Use a PUT command to get the file into a table stage
  • Reference to a table stage is: ls @%MY_TABLE;
  • Cannot be altered or dropped
  • User must have ownership privileges on table
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Named Stage

A
  • An internal stage
  • User created database object
  • Use a PUT command to get the file into a name stage
  • Reference to a named stage is: ls@MY_STAGE;
  • Securable object
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

In internal storage, uncompressed files are automatically compressed using ________ when loaded into an internal storage unless explicitly set not to.

A

GZIP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

In internal storage, stage files are automatically encrypted using ____ bit keys.

A

128

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

External Stages

A

External stages reference data files stored in a location outside of Snowflake, which we manage ourselves.
- Could be Amazon S3 buckets, Google cloud storage buckets, or Microsoft Azure containers

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

External named stage

A
  • User created database object
  • Files are uploaded using the cloud utilities of the cloud provider
  • Reference to a named stage is ls @MY_STAGE;
  • Storage locations can be private or public
  • Copy options such as ON_ERROR and PURGE can be set on stages
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

Storage Integration

A

A storage integration is a reusable and securable Snowflake object which can be applied across stages and is recommended to avoid having to explicitly set sensitive information for each stage definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

Stage Helper Command - LIST

A

List the contents of a stage:
- Path of staged file
- Size of staged file
- MD5 Hash of staged file
- Last updated timestamp

  • Can optionally specify a path for specific folders or files
  • Named and internal table stages can optionally include database and schema global pointer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

Stage Helper Command - SELECT

A
  • Query the contents of staged files directly using standard SQL for both internal and external stages
  • Useful for inspected files prior to data loading/unloading
  • Reference metadata columns such as filename and row numbers for a staged file
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Stage Helper Command - REMOVE

A
  • Remove files from either an external or internal stage
  • Can optionally specify a path for specific folders or files
  • Named and internal table stages can optionally include database and schema global pointer
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

PUT command

A
  • The PUT command uploads data files from a local directory on a client machine to any of the three types of internal stage.
  • Uploaded files are automatically encrypted with a 128-bit key with optional support for a 256-bit key.
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

PUT cannot be executed from within ____________.

A

worksheets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Duplicate files uploaded to a stage via PUT are _______.

A

ignored

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

COPY INTO <table> statement

A
  • The COPY INTO <table> statement copies the contents of an internal or external location directly into a table
  • COPY INTO <table> requires a user created virtual warehouse to execute
  • Load history is stored in the metadata of the target table for 64 days, which ensures files are not loaded twice
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

What file formats can be uploaded into Snowflake?

A
  • Delimited files (CSV, TSC, etc)
  • JSON
  • Avro
  • ORC
  • Parquet
  • XML
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

Copy Option: ON_ERROR

A

Value that specifies the error handling for the load operation:
- CONTINUE
- SKIP_FILE
- SKIP_FILE_<num>
- SKIP_FILE_<num>%
- ABORT_STATEMENT</num></num>

Default Value - ‘ABORT_STATEMENT’

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Copy Option: SIZE_LIMIT

A

Number that specifies the maximum size of data loaded by a COPY statement

Default - null(no size limit)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Copy Option: PURGE

A

Boolean that specifies whether to remove the data files from the stage automatically after the data is loaded successfully

Default Value - FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Copy Option: RETURN_FAILED_ONLY

A

Boolean that specifies whether to return only files that have failed to load in the statement result

Default Value - FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Copy Option: MATCH_BY_COLUMN_NAME

A

String that specifies whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data

Default Value - None

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Copy Option: ENFORCE_LENGTH

A

Boolean that specifies whether to truncate text strings that exceed the target column length

Default Value - TRUE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q

Copy Option: TRUNCATECOLUMNS

A

Boolean that specifies whether to truncate text strings that exceed the target column length

Default Value - FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q

Copy Option: FORCE

A

Boolean that specifies to load all files, regardless of whether they’ve been loaded previously and have not changed since they were loaded

Default Value - FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q

Copy Option: LOAD_UNCERTAIN_FILES

A

Boolean that specifies to load files for which the load status is unknown. The COPY command skips these files by default.

Default Value - FALSE

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q

Output Column Names: COPY INTO <table>

A
  • FILE
  • STATUS
  • ROWS_PARSED
  • ROWS_LOADED
  • ERROR_LIMIT
  • ERRORS_SEEN
  • FIRST_ERROR
  • FIRST_ERROR_LINE
  • FIRST_ERROR_CHARACTER
  • FIRST_ERROR_COLUMN_NAME
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q

COPY INTO <table> : Column Name: FILE

A

Data Type - TEXT
Description - Name of source file and relative path to the file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q

COPY INTO <table> : Column Name: STATUS

A

Data Type - TEXT
Description - Status: loaded, load failed, or partially loaded

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q

COPY INTO <table> : Column Name: ROWS_PARSED

A

Data Type - NUMBER
Description - Number of rows parsed from the source file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q

COPY INTO <table> : Column Name: ROWS_LOADED

A

Data Type - NUMBER
Description - Number of rows loaded from the source file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q

COPY INTO <table> : Column Name: ERROR_LMIT

A

Data Type - NUMBER
Description - If the number of errors reaches this limit, then abort

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q

COPY INTO <table> : Column Name: ERRORS_SEEN

A

Data Type - NUMBER
Description - Number of error rows in the source file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q

COPY INTO <table> : Column Name: FIRST_ERROR

A

Data Type - TEXT
Description - First error of the source file

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q

COPY INTO <table> : Column Name: FIRST_ERROR_LINE

A

Data Type - NUMBER
Description - Line number of the first error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q

COPY INTO <table> : Column Name: FIRST_ERROR_CHARACTER

A

Data Type - NUMBER
Description - Position of the first error character

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q

COPY INTO <table> : Column Name: FIRST_ERROR_COLUMN_NAME

A

Data Type - TEXT
Description - Column name of the first error

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q

VALIDATION_MODE statement

A

Optional parameter allows you to perform a dry-run of load process to expose errors when running COPY INTO<table>

  • RETURN_N_ROWS
  • RETURN_ERRORS
  • RETURN_ALL_ERRORS
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q

VALIDATE statement

A
  • Validate is a table function to view all errors encountered during a previous COPY INTO execution
  • Validate accepts a job id of a previous query or the last load operation executed
41
Q

File format options can be set on a named stage or _____ ____ statement.

A

COPY INTO

42
Q

Explicitly declared file format options can all be rolled up into independent _____ _____ _______ objects.

A

File Format Snowflake

43
Q

File formats can be applied to both named stages and COPY INTO statements. If set on both _____ ______ will take precedence.

A

COPY INTO

44
Q

In the File Format object the file format you’re expecting to load is set via ‘type’ property with one of the following values: ______, ______, ______, ______, _______, or ______.

A

CSV, JSON, AVRO, ORC, PARQUET, XML

45
Q

CSV File Type

A

Comma-Separated Values file
A plain text file that contains a list of data. They mostly use the comma character to separate data, but sometimes use other characters, like semicolons.

46
Q

JSON File Type

A

JavaScript Object Notation file
A file that stores simple data structures and objects in JavaScript Object Notation (JSON) format. It is primarily used for transmitting data between a web application and a server. They are lightweight, text-based, human-readable, and can be edited using a text editor

47
Q

AVRO File type

A

Stores the data definition in JSON format making it easy to read and interpret; the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to split large data sets into subsets suitable for Apache MapReduce processing.

48
Q

ORC File type

A

Optimized Row Columnar (ORC)
Open-source columnar storage file format originally released in early 203 for Hadoop workloads. ORC provides a highly-efficient way to store Apache Hive data, though it can store other data as well. It was designed and optimized specifically with Hive data in mind, improving the overall performance when HIve reads, writes, and process data.

49
Q

PARQUET File type

A

Apache Parquet is a file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. It’s open-source and supports multiple coding languages, including Java, C++, and Python.

50
Q

XML File type

A

Extensible Markup Language file
It contains a formatted dataset that is intended to be processed by a website, web application, or software program. XML files can be thought of as text-based databases

51
Q

If a File Format object or options are not provided to either the stage or COPY statement, the default behavior will be try and interpret the contents of a stage as a _____ with _____ encoding.

A

CSV, UTF-8

52
Q

SNOWPIPE

A

The Pipe object defines a COPY INTO <table> statement that will execute in response to a file being uploaded to a stage.

53
Q

The two methods for detecting when a new file has been uploaded to a stage:

A
  1. Automating Snowpipe using cloud messaging (external stages only)
  2. Call Snow REST endpoints (internal and external stages)
54
Q

Snowpipe: Cloud Messaging flow

A
55
Q

Snowpipe: REST Endpoint flow

A
56
Q

Snowpipe is designed to load new data typically within a _______ after a file notification is sent.

A

minute

57
Q

Snowpipe is a _______________ feature, using Snowflake managed compute resources to load data files not a user managed ________________ __________________.

A

serverless feature, Virtual Warehouse

58
Q

Snowpipe load history is stored in the __________ of the pipe for _____ days, used to prevent reloading the same files in a table.

A

metadata, 14

59
Q

When a pipe is paused, event messages received for the pipe enter a limited retention period. The period is ____ days by default.

A

14

60
Q

Compare Bulk Loading vs. Snowpipe: Authentication Feature

A

Bulk Loading: Relies on the security options supported by the client for authenticating and initiating a user session.
Snowpipe: When calling the REST endpoints: Requires key pair authentication with JSON Web Token (JWT). JWTs are signed using a public/private key pair with RSA encryption.

61
Q

Compare Bulk Loading vs. Snowpipe: Load History

A

Bulk Loading: Stored in the metadata of the target table for 64 days.
Snowpipe: Stored in the metadata of the pipe for 14 days.

62
Q

Compare Bulk Loading vs. Snowpipe: Compute Resources

A

Bulk Loading: Requires a user-specified warehouse to execute COPY statements.
Snowpipe: Uses Snowflake-supplied compute resources

63
Q

Compare Bulk Loading vs. Snowpipe: Billing

A

Bulk Loading: Billed for the amount of time each virtual warehouse is active.
Snowpipe: Snowflake tracks the resource consumption of loads for all pipes in an account, with per-second/per-core granularity, as Snowpipe actively queues and process data files. In addition to resource consumption, an overhead is included in the utilization costs charged for Snowpipe: 0.06 credits per 1000 files notified or listed via event notifications or REST API calls.

64
Q

Data Loading Best Practices

A
  • Break files into 100-250 MB compressed
  • Organize Data by Path
  • Separate virtual warehouses for Load and Query
  • Pre-sort data
  • Load files no more than 1 file per minute so they don’t back up in queue and incur cost
65
Q

Table data can be unloaded to a stage via the ________ command.

A

COPY INTO <location></location>

66
Q

The ____ command is used to download a staged file to the local file system.

A

GET

67
Q

By default results unloaded to a stage using _______________ command are split in to multiple files.

A

COPY INTO <location></location>

68
Q

All data files unloaded to internal stages are automatically encrypted using ____-bit keys.

A

128

69
Q

COPY INTO<location> output files can be prefixed by specifying a string at the \_\_\_\_ of a stage path.</location>

A

end

70
Q

COPY INTO<location> includes a \_\_\_\_\_\_\_\_\_\_\_\_ \_\_\_ copy option to partition unloaded data into a directory structure.</location>

A

PARTITION BY

71
Q

COPY INTO<location> can copy table records directly to \_\_\_\_\_\_\_\_\_ cloud provider's blob storage.</location>

A

external

72
Q

COPY INTO<location> Copy Option: OVERWRITE</location>

A

Definition: Boolean that specifies whether the COPY command overwrites existing files with matching names, if any, in the location where files are stored.
Default Value: ‘ABORT_STATEMENT’

73
Q

COPY INTO<location> Copy Option: SINGLE</location>

A

Definition: Boolean that specifies whether to generate a single file or multiple files.
Default Value: FALSE

74
Q

COPY INTO<location> Copy Option: MAX_FILE_SIZE</location>

A

Definition: Number (>0) that specifies the upper size limit (in bytes) of each file to be generated in parallel per thread.
Default Value: FALSE

75
Q

COPY INTO<location> Copy Option: INCLUDE_QUERY_ID</location>

A

Definition: Boolean that specifies whether to uniquely identify unloaded files by including a universally unique identifier (UUID) in the filenames of unloaded data files.
Default Value: FALSE

76
Q

______ is the reverse of PUT. It allows users to specify a source stage and a _______ local directory to download files to.

A

GET, target

77
Q

GET cannot be used for _________ stages.

A

external

78
Q

GET cannot be ________ from within worksheets.

A

executed

79
Q

When using the GET command, downloaded files are automatically decrypted? T/F

A

True

80
Q

When using the GET command, __________ optional parameter specifies the number of threads to use for downloading files. Increasing this number can improve ____________ with downloading large files.

A

parallel, parellelization

81
Q

When using the GET command, _________ optional parameter specifies a regular expression pattern for filtering files to download.

A

pattern

82
Q

Semi-structured Data Type: ARRAY

A

Contains 0 or more elements of data. Each element is accessed by its position in the array.

83
Q

Semi-structured Data Type: OBJECT

A

Represent collections of key-value pairs.

84
Q

Semi-structured Data Type: VARIANT

A

Universal semi-structured data type used to represent arbitrary data structures.

85
Q

VARIANT data type can hold up to ___MB compressed data per row.

A

16

86
Q

Semi-structured Data Formats supported by Snowflake.

A

JSON, AVRO, ORC, PARQUET, XML

87
Q

Loading Semi-Structured Data Flow

A

Semi-Structured Data file —PUT–> Stage — COPY INTO –> Table

88
Q

JSON File Format Options: DATE_FORMAT

A

Used only for loading JSON data into separate columns. Defines the format of date string values in the data files.

89
Q

JSON File Format Options: TIME FORMAT

A

Used only for loading JSON data into separate columns. Defines the format on time string values in the data files.

90
Q

JSON File Format Options: COMPRESSION

A

Supported algorithms: GZIP, BZ2, BROTLI, ZSTD, DEFLATE, RAW_DEFLATE, NONE. If BROTLI, cannot use AUTO.

91
Q

JSON File Format Options: ALLOW DUPLICATE

A

Only used for loading. If TRUE, allows duplicate object field names (only the last one will be preserved)

92
Q

JSON File Format Options: STRIP OUTER ARRAY

A

Only used for loading. If TRUE, JSON parser will remove outer brackets []

93
Q

JSON File Format Options: STRIP NULL VALUES

A

Only used for loading. If TRUE, JSON parser will remove object fields or array elements containing NULL

94
Q

Three Semi-Structured Data Loading Approaches

A
  1. ELT (Extract, Load, Transform)
  2. ETL (Extract, Transform, Load)
  3. Automatic Schema Detection (INFER_SCHEMA, MATCH_BY_COLUMN_NAME)
95
Q

Unloading Semi-structured Data Flow

A

Table —COPY INTO–> Stage –GET–> Semi-structured Data Files

96
Q

Accessing Semi-Structured Data: Dot Notation Structure

A

SELECT <VARIANT>:<First>.<Subsequent> FROM <Table>;</Subsequent></First></VARIANT>

97
Q

Accessing Semi-Structured Data: Bracket Notation Structure

A

SELECT <VARIANT>['<First Level Element']['<Subsequent>'] FROM <Table>;</Subsequent></VARIANT>

98
Q

Accessing Semi-Structured Data: Repeating Element

A

SELECT SRC: <Array>[Element Index] FROM <Table>;</Array>