Data Pipeline Creation & Mgmtontologize prep Flashcards

foundational knowledge

1
Q

What is Data Connection

A

The primary way you get data into and out of Foundry.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Connection is where you

A

define how data should be exchanged between Foundry and external systems such as databases, ERP systems, SaaS apps, cloud storage, and more

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

sources store information for

A

how to connect to a system

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

Sources store

A

credentials & secrets

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

Sources requires

A

egress policies

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

Examples of sources

A

Snowflake, AWS S3, Rest APIs, on-prem database, SFTP, ERP

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

Syncs store information about

A

what data to retrieve from a source and how to do it (e.g., schedule)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

Syncs produce

A

datasets or streams

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

Incremental syncs are useful for

A

ingesting only the most recent data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

You need to update for incremental syncs

A
  • The transaction type
  • The logic used to determine which data to ingest (e.g. SQL query)
  • What information should be saved between runs
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

The transaction type does not affect

A

which data is brought into Foundry

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

The transaction type does affect

A

how data is saved

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

Transaction type: Snapshot

A

overwrite all data in the Foundry dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

Transaction type: Append

A

add to what’s already in the Foundry dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

Transaction type: Update

A

changes or add to what’s already in the Foundry dataset

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

To make a sync get only the latest data, you need to provide

A
  1. A column whose values can be sorted reliably
  2. A starting value for that column that will be used in a WHERE clause
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q

Transactions are

A

file-based

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q

Transaction type Update is not available for

A

non-file based sources

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q

For update-based workflows, use

A

Use methods such as SCD2 or CDC

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q

Virtual tables

A
  • Window into external system
  • Push down compute when possible (e.g. BigQuery)
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q

Exports

A
  • One way of exporting data to source systems
  • Less flexible than External Transforms but potentially easier to configure and manage
How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q

Why use incremental transforms?

A

Make pipelines more efficient by reading, processing, and
writing only the new or updated data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q

Snapshot transforms process

A

the whole input dataset(s) and write the entire output dataset(s)

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q

Incremental transforms only

A

process the new or updated data

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
A dataset view is
what’s in the dataset at a certain point in time
26
datasets
buckets of files with a historical record (e.g. parquet)
27
a dataset view references
a specific set of files (as well as other metadata)
28
dataset view is not
analogous to a view in SQL databases
29
each entry is a
transaction
30
each entry corresponds
to a different dataset view
31
what are transactions
describe modifications made to the files in a dataset.
32
transactions are
atomic and blocking
33
transaction is
analogous to transactions in SQL databases
34
transaction type: delete
removes files from output dataset
35
open transactions
protect a dataset from conflicting updates
36
completed transactions
describe how we modified the output dataset(s) in a transform
37
non-incremental transforms will
read all the files of the input dataset(s)
38
read modes: input dataset(s) incremental transforms give different options
- added *default & one used most of the time - current - previous - modified - deleted
39
read modes: output dataset(s) different options
- added - current *default - previous * the preferable choice - modified - deleted
40
write modes: output dataset(s) options
- modify *default - replace *what happens when running a Snapshot build
41
Output dataset transaction type according to input dataset(s) transaction types and output write mode: Input dataset transaction: APPEND or UPDATE Output dataset write mode: REPLACE
SNAPSHOT
42
Output dataset transaction type according to input dataset(s) transaction types and output write mode: Input dataset transaction: APPEND or UPDATE Output dataset write mode: MODIFY
INCREMENTAL (APPEND OR UPDATE)
43
Output dataset transaction type according to input dataset(s) transaction types and output write mode: Input dataset transaction: SNAPSHOT Output dataset write mode: REPLACE
SNAPSHOT
44
Output dataset transaction type according to input dataset(s) transaction types and output write mode: Input dataset transaction: SNAPSHOT Output dataset write mode: MODIFY
SNAPSHOT
45
Handling Snapshot inputs
Must specify which input datasets are OK to have as Snapshots
46
Output dataset transaction type according to input dataset(s) transaction types and output write mode + Snapshot specifications Input dataset transaction: APPEND or UPDATE Output dataset write mode: REPLACE
SNAPSHOT
47
Output dataset transaction type according to input dataset(s) transaction types and output write mode + Snapshot specifications Input dataset transaction: APPEND or UPDATE Output dataset write mode: MODIFY
INCREMENTAL (APPEND OR UPDATE)
48
Output dataset transaction type according to input dataset(s) transaction types and output write mode + Snapshot specifications Input dataset transaction: SNAPSHOT Output dataset write mode: REPLACE
SNAPSHOT
49
Output dataset transaction type according to input dataset(s) transaction types and output write mode + Snapshot specifications Input dataset transaction: SNAPSHOT Output dataset write mode: MODIFY
SNAPSHOT or INCREMENTAL
50
Syntax notes must use
@incremental @transform
51
Syntax notes can choose
read modes when creating dataframes
52
Option 1: periodic Snapshot builds
typically time- or file-based branching logic in the transform
53
Option 2: Projections
Analogous to materialized views in SQL databases
54
What is pipeline builder
- No-code application for creating production pipeline segments - Git-based version control - Integration with Foundry’s data engineering suite
55
How is pipeline builder used generally
Clean and prepare raw data for downstream use
56
How is pipeline builder used specifically
Create a backing dataset
57
To use a dataset from another project, you must add a
Project Reference
58
Upstream schema changes
impact downstream nodes
59
Manual intervention is often needed to
ensure consistent schema
60
Monitoring View
a mission control for your data pipeline
61
The Scheduler application ensures
datasets build only if out of date with ancestors
62
Data Lineage gives you
quick insight into overall pipeline health and freshness
63
Job
the computation of a new version of one or more datasets combines two datasets that have at least one matching column.
64
Build
Composed of one or more jobs, builds are the subject of schedules
65
Sync
A synchronization of data between services
66
Pipeline
Semantically related datasets that build in sequence to generate specific outputs
67
Schedule
Configurable logic defining what to build when
68
Health Check
Configurable logic to validate data quality
69
Monitoring View
A curated set of health checks displayed in a dashboard & subscribed to for alerting
70
Data pipelines programmatically
connect inputs to outputs
71
Targets are the pipeline's outputs that support
workflow products
72
Inputs begin a pipeline's logic, often as ingests from
external data sources
73
Intermediate datasets are
built along the path to building the targets
74
data connection
sync raw data from source systems into a Datasource Project
75
datasource project
apply consistent clean-up and schema validation to raw data
76
transform project
inter-project transforms that produce re-usable datasets
77
Ontology project
build datasets that back object and link types
78
workflow project
build operational workflows on top of Ontology data and applications
79
Schedule
- Configured logic that defines what to build when - Time or event-based triggers - View schedule metrics in a dashboard - Can be the subject of a Health Check (e.g., did the schedule run successfully?)
80
Health Check
- Configured logic that validates dataset build or content - Time, job, or content based - Failures can alert and/or stop a pipeline - Can be arranged with other checks in a Monitoring View
81
Data Expectations
validations that run during the job that builds a dataset. They can warn you of an issue or cause the job to fail.
82
Data Expectations are considered a type of
Data Health Checks
83
Example of Data Expectations
- Ensure a column is a valid primary key - Check that a column is a valid foreign key - Verify the values in a column are allowable
84
union
combines two datasets to include all rows. requires all inputs have the same schema. If input schemas do not all match, an error message will display with a list of missing columns.
85
user-defined functions (UDF)
run custom code in Pipeline Builder that can be versioned and upgraded.
86
Data Health Checks
validations that run separately from the job that builds a dataset They are “backwards looking” in that sense
87
Monitoring Views
single- or cross-project monitoring of datasets, object types, and other resources
88
Benefits of using Data Expectations
- pipeline protection - change management - proactive testing
89
Expectation
A strongly typed requirement on the data structure or content (e.g. column is not null)
90
Check
A meaningful expectation (can be a composite of multiple expectations) that is connected to a single dataset (output or input) in a transform. used when identifying and monitoring it (e.g. “object schema validation”).
91
Pre-condition
A check that is assigned to an input of a transform, typically to validate essential assumptions on the inputs structure or content before proceeding with the build.
92
Post-condition
A check that is assigned to the output of the transform, typically to guarantee dataset SLAs are maintained and downstream dependencies are protected.
93
Check Result
Produced when the check runs (during build) and contains information on the expectations result and their breakdown. Can be monitored in Data Health
94
4 ways to view health checks in Foundry
- dataset - project - pipeline - platform
95
Scheduled builds can be configured to run:
- At certain times - When data has been updated - When logic has been updated - Any combination of the above conditions
96
Scheduled builds can be configured to build:
- A single dataset - A single dataset and all its dependencies - All datasets that depend on a dataset - All datasets that connect two datasets - Any combination of the above configurations
97
The incremental() decorator can be used to
wrap a transform’s compute function with logic for enabling incremental computation