Data Pipeline Creation & Mgmtontologize prep Flashcards

Question

A dataset view is

Answer 1

what’s in the dataset at a certain point in time

Answer 2

buckets of files with a historical record (e.g. parquet)

Answer 3

a specific set of files (as well as other metadata)

Answer 4

analogous to a view in SQL databases

Answer 5

transaction

Answer 6

to a different dataset view

Answer 7

describe modifications made to the files in a dataset.

Answer 8

atomic and blocking

Answer 9

analogous to transactions in SQL databases

Answer 10

removes files from output dataset

Answer 11

protect a dataset from conflicting updates

Answer 12

describe how we modified the output dataset(s) in a transform

Answer 13

read all the files of the input dataset(s)

Answer 14

- added *default & one used most of the time - current - previous - modified - deleted

Answer 15

- added - current *default - previous * the preferable choice - modified - deleted

Answer 16

- modify *default - replace *what happens when running a Snapshot build

Answer 17

INCREMENTAL (APPEND OR UPDATE)

Answer 18

Must specify which input datasets are OK to have as Snapshots

Answer 19

INCREMENTAL (APPEND OR UPDATE)

Answer 20

SNAPSHOT or INCREMENTAL

Answer 21

@incremental @transform

Answer 22

read modes when creating dataframes

Answer 23

typically time- or file-based branching logic in the transform

Answer 24

Analogous to materialized views in SQL databases

Answer 25

- No-code application for creating production pipeline segments - Git-based version control - Integration with Foundry’s data engineering suite

Answer 26

Clean and prepare raw data for downstream use

Answer 27

Create a backing dataset

Answer 28

Project Reference

Answer 29

impact downstream nodes

Answer 30

ensure consistent schema

Answer 31

a mission control for your data pipeline

Answer 32

datasets build only if out of date with ancestors

Answer 33

quick insight into overall pipeline health and freshness

Answer 34

the computation of a new version of one or more datasets combines two datasets that have at least one matching column.

Answer 35

Composed of one or more jobs, builds are the subject of schedules

Answer 36

A synchronization of data between services

Answer 37

Semantically related datasets that build in sequence to generate specific outputs

Answer 38

Configurable logic defining what to build when

Answer 39

Configurable logic to validate data quality

Answer 40

A curated set of health checks displayed in a dashboard & subscribed to for alerting

Answer 41

connect inputs to outputs

Answer 42

workflow products

Answer 43

external data sources

Answer 44

built along the path to building the targets

Answer 45

sync raw data from source systems into a Datasource Project

Answer 46

apply consistent clean-up and schema validation to raw data

Answer 47

inter-project transforms that produce re-usable datasets

Answer 48

build datasets that back object and link types

Answer 49

build operational workflows on top of Ontology data and applications

Answer 50

- Configured logic that defines what to build when - Time or event-based triggers - View schedule metrics in a dashboard - Can be the subject of a Health Check (e.g., did the schedule run successfully?)

Answer 51

- Configured logic that validates dataset build or content - Time, job, or content based - Failures can alert and/or stop a pipeline - Can be arranged with other checks in a Monitoring View

Answer 52

validations that run during the job that builds a dataset. They can warn you of an issue or cause the job to fail.

Answer 53

Data Health Checks

Answer 54

- Ensure a column is a valid primary key - Check that a column is a valid foreign key - Verify the values in a column are allowable

Answer 55

combines two datasets to include all rows. requires all inputs have the same schema. If input schemas do not all match, an error message will display with a list of missing columns.

Answer 56

run custom code in Pipeline Builder that can be versioned and upgraded.

Answer 57

validations that run separately from the job that builds a dataset They are “backwards looking” in that sense

Answer 58

single- or cross-project monitoring of datasets, object types, and other resources

Answer 59

- pipeline protection - change management - proactive testing

Answer 60

A strongly typed requirement on the data structure or content (e.g. column is not null)

Answer 61

A meaningful expectation (can be a composite of multiple expectations) that is connected to a single dataset (output or input) in a transform. used when identifying and monitoring it (e.g. “object schema validation”).

Answer 62

A check that is assigned to an input of a transform, typically to validate essential assumptions on the inputs structure or content before proceeding with the build.

Answer 63

A check that is assigned to the output of the transform, typically to guarantee dataset SLAs are maintained and downstream dependencies are protected.

Answer 64

Produced when the check runs (during build) and contains information on the expectations result and their breakdown. Can be monitored in Data Health

Answer 65

- dataset - project - pipeline - platform

Answer 66

- At certain times - When data has been updated - When logic has been updated - Any combination of the above conditions

Answer 67

- A single dataset - A single dataset and all its dependencies - All datasets that depend on a dataset - All datasets that connect two datasets - Any combination of the above configurations

Answer 68

wrap a transform’s compute function with logic for enabling incremental computation

Data Pipeline Creation & Mgmtontologize prep Flashcards

foundational knowledge