Data Pipeline Creation & Mgmtontologize prep Flashcards
foundational knowledge
What is Data Connection
The primary way you get data into and out of Foundry.
Data Connection is where you
define how data should be exchanged between Foundry and external systems such as databases, ERP systems, SaaS apps, cloud storage, and more
sources store information for
how to connect to a system
Sources store
credentials & secrets
Sources requires
egress policies
Examples of sources
Snowflake, AWS S3, Rest APIs, on-prem database, SFTP, ERP
Syncs store information about
what data to retrieve from a source and how to do it (e.g., schedule)
Syncs produce
datasets or streams
Incremental syncs are useful for
ingesting only the most recent data
You need to update for incremental syncs
- The transaction type
- The logic used to determine which data to ingest (e.g. SQL query)
- What information should be saved between runs
The transaction type does not affect
which data is brought into Foundry
The transaction type does affect
how data is saved
Transaction type: Snapshot
overwrite all data in the Foundry dataset
Transaction type: Append
add to what’s already in the Foundry dataset
Transaction type: Update
changes or add to what’s already in the Foundry dataset
To make a sync get only the latest data, you need to provide
- A column whose values can be sorted reliably
- A starting value for that column that will be used in a WHERE clause
Transactions are
file-based
Transaction type Update is not available for
non-file based sources
For update-based workflows, use
Use methods such as SCD2 or CDC
Virtual tables
- Window into external system
- Push down compute when possible (e.g. BigQuery)
Exports
- One way of exporting data to source systems
- Less flexible than External Transforms but potentially easier to configure and manage
Why use incremental transforms?
Make pipelines more efficient by reading, processing, and
writing only the new or updated data
Snapshot transforms process
the whole input dataset(s) and write the entire output dataset(s)
Incremental transforms only
process the new or updated data