Airflow Core Concepts - Sheet1 Flashcards

1
Q
  1. What is a DAG in Airflow?
A

A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships to say how they should run.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q
  1. What is the role of a DAG in Airflow?
A

The DAG in Airflow doesn’t care about what is happening inside the tasks

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q
  1. What does a basic DAG define?
A

A basic DAG defines the tasks and dictates the order in which they have to run, and which tasks depend on what others. It also states how often to run the DAG.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q
  1. What are the three ways to declare a DAG in Airflow?
A

The three ways to declare a DAG in Airflow are: use a context manager, use a standard constructor, passing the DAG into any operators you use, or use the @dag decorator to turn a function into a DAG generator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q
  1. What do DAGs need to run in Airflow?
A

DAGs need Tasks to run in Airflow, and those usually come in the form of either Operators, Sensors, or TaskFlow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q
  1. What does a Task/Operator in a DAG usually depend on?
A

A Task/Operator in a DAG usually has dependencies on other tasks (those upstream of it), and other tasks depend on it (those downstream of it).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q
  1. How are individual task dependencies declared in a DAG?
A

Individual task dependencies in a DAG can be declared using the&raquo_space; and &laquo_space;operators, or using the more explicit set_upstream and set_downstream methods.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q
  1. What is the cross_downstream method used for in a DAG?
A

The cross_downstream method is used in a DAG to make two lists of tasks depend on all parts of each other.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q
  1. What is the chain method used for in a DAG?
A

The chain method is used in a DAG to chain together dependencies, or to create pairwise dependencies for lists of the same size.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q
  1. How does Airflow load DAGs?
A

Airflow loads DAGs from Python source files, which it looks for inside its configured DAG_FOLDER. It takes each file, executes it, and then loads any DAG objects from that file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q
  1. Can you define multiple DAGs per Python file?
A

Yes, you can define multiple DAGs per Python file, or even spread one very complex DAG across multiple Python files using imports.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q
  1. What does Airflow consider when searching for DAGs inside the DAG_FOLDER?
A

When searching for DAGs inside the DAG_FOLDER, Airflow only considers Python files that contain the strings airflow and dag (case-insensitively) as an optimization.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q
  1. How to consider all Python files when searching for DAGs inside the DAG_FOLDER?
A

To consider all Python files when searching for DAGs inside the DAG_FOLDER, you should disable the DAG_DISCOVERY_SAFE_MODE configuration flag.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q
  1. What is an .airflowignore file?
A

An .airflowignore file is a file inside your DAG_FOLDER, or any of its subfolders, which describes patterns of files for the loader to ignore.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q
  1. How to control if a python file needs to be parsed by Airflow in a more flexible way?
A

If the .airflowignore does not meet your needs and you want a more flexible way to control if a python file needs to be parsed by Airflow, you can plug your callable by setting might_contain_dag_callable in the config file.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q
  1. What’s the difference between context manager and standard constructor for DAG declaration?
A

The context manager automatically adds the DAG to any tasks inside it implicitly while the standard constructor requires the DAG to be passed into any operators used.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
17
Q
  1. How to use a context manager for DAG declaration?
A

You can use a context manager for DAG declaration with the with statement and the DAG function.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
18
Q
  1. How to use a standard constructor for DAG declaration?
A

You can use a standard constructor for DAG declaration by explicitly defining the DAG and passing it into any operators you use.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
19
Q
  1. What is the @dag decorator for?
A

The @dag decorator is used to turn a function into a DAG generator in Airflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
20
Q
  1. What’s the purpose of task dependencies in Airflow?
A

Task dependencies in Airflow dictate the order of task execution based on the dependencies between different tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
21
Q
  1. How to use the&raquo_space; and &laquo_space;operators for task dependencies?
A

The&raquo_space; and &laquo_space;operators are used to specify downstream and upstream dependencies respectively between different tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
22
Q
  1. How to use the set_upstream and set_downstream methods for task dependencies?
A

The set_upstream and set_downstream methods are used to specify upstream and downstream dependencies respectively between different tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
23
Q
  1. What does the cross_downstream function do?
A

The cross_downstream function is used to specify dependencies between two lists of tasks where every task in the first list is dependent on every task in the second list.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
24
Q
  1. How to use the chain method for task dependencies?
A

The chain method is used to specify a series of dependencies between tasks where each task is dependent on the previous one.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
25
Q
  1. How does Airflow identify DAGs in Python source files?
A

When Airflow loads Python source files from its configured DAG_FOLDER, it executes each file and then loads any objects at the top level that are a DAG instance.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
26
Q
  1. What is DAG_DISCOVERY_SAFE_MODE configuration flag for?
A

The DAG_DISCOVERY_SAFE_MODE configuration flag is used to make Airflow consider all Python files when searching for DAGs inside the DAG_FOLDER.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
27
Q
  1. What is the purpose of .airflowignore file?
A

The .airflowignore file is used to specify patterns of files that Airflow should ignore when searching for DAGs.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
28
Q
  1. What is might_contain_dag_callable in the config file for?
A

The might_contain_dag_callable in the config file is used to plug in your own callable that checks if a file needs to be parsed by Airflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
29
Q
  1. What’s the difference between a task and an operator in Airflow?
A

In Airflow, a task is an instance of an operator, so they’re essentially the same thing but at different stages. The operator defines the kind of task to run, and the task is the actual execution defined by that operator.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
30
Q
  1. What is the purpose of an EmptyOperator in Airflow?
A

The EmptyOperator in Airflow is a simple operator that does nothing, it’s often used as a placeholder or for debugging.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
31
Q
  1. How to specify a schedule for a DAG in Airflow?
A

A schedule for a DAG in Airflow can be specified using the schedule parameter in the DAG function, you can use cron expressions or special strings like “@daily” for common schedules.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
32
Q
  1. What does the start_date parameter in a DAG do?
A

The start_date parameter in a DAG specifies the date when the DAG should start running.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
33
Q
  1. What is the dag_id parameter for in a DAG?
A

The dag_id parameter in a DAG is used to uniquely identify the DAG in Airflow’s database.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
34
Q
  1. What are the recommended ways to declare task dependencies?
A

The recommended ways to declare task dependencies are using the&raquo_space; and &laquo_space;operators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
35
Q
  1. What does the cross_downstream function replace?
A

The cross_downstream function replaces a situation where you have multiple tasks depending on multiple other tasks, instead of using&raquo_space; operator multiple times.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
36
Q
  1. How does the chain function help in dynamic task declaration?
A

The chain function helps in dynamic task declaration by chaining together a dynamic number of tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
37
Q
  1. How does Airflow treat Python files that do not contain the strings airflow and dag?
A

Unless the DAG_DISCOVERY_SAFE_MODE configuration flag is disabled, Airflow ignores Python files that do not contain the strings airflow and dag.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
38
Q
  1. How does the .airflowignore file work?
A

The .airflowignore file describes patterns of files that Airflow should ignore when it’s loading DAGs, it covers the directory it’s in and all subfolders underneath it.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
39
Q
  1. How to create a custom callable for might_contain_dag_callable?
A

You can create a custom callable for might_contain_dag_callable that takes a file path and an optional zip file and returns True if the file needs to be parsed by Airflow, otherwise it returns False.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
40
Q
  1. How can you make a task depend on multiple tasks using the&raquo_space; operator?
A

You can make a task depend on multiple tasks using the&raquo_space; operator by putting the multiple tasks in a list, like this: first_task&raquo_space; [second_task, third_task].

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
41
Q
  1. How can you make a task depend on multiple tasks using the set_downstream method?
A

You can make a task depend on multiple tasks using the set_downstream method by passing multiple tasks as arguments, like this: first_task.set_downstream(second_task, third_task).

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
42
Q
  1. How can you specify a complex chain of dependencies using the chain function?
A

You can specify a complex chain of dependencies using the chain function by passing multiple tasks or lists of tasks as arguments, like this: chain(op1, [op2, op3], [op4, op5], op6).

43
Q
  1. How can you define multiple DAGs in a single Python file?
A

You can define multiple DAGs in a single Python file by creating multiple instances of the DAG class at the top level of the file.

44
Q
  1. How can you split a complex DAG across multiple Python files?
A

You can split a complex DAG across multiple Python files by defining different parts of the DAG in different files and then importing them into a main file.

45
Q
  1. How does Airflow find the DAGs in your Python files?
A

Airflow finds the DAGs in your Python files by executing each file and loading any DAG objects that are defined at the top level.

46
Q
  1. What happens if a DAG is defined inside a function in your Python file?
A

If a DAG is defined inside a function in your Python file, Airflow won’t find it because it only loads DAG objects that are defined at the top level.

47
Q
  1. How can you make Airflow parse all Python files in the DAG_FOLDER?
A

You can make Airflow parse all Python files in the DAG_FOLDER by disabling the DAG_DISCOVERY_SAFE_MODE configuration flag.

48
Q
  1. How can you exclude certain files from being parsed by Airflow?
A

You can exclude certain files from being parsed by Airflow by creating an .airflowignore file in your DAG_FOLDER and specifying the patterns of files to ignore.

49
Q
  1. What happens if you define a callable for might_contain_dag_callable?
A

If you define a callable for might_contain_dag_callable, it will replace the default Airflow heuristic for deciding if a file should be parsed.

50
Q
  1. What is the function of the dag_id parameter in the DAG declaration?
A

The dag_id parameter in the DAG declaration is used to assign a unique identifier to the DAG.

51
Q
  1. What are the two ways a DAG can run?
A

Triggered manually or via the API

52
Q
  1. How is a DAG schedule defined?
A

It is defined as part of the DAG via the schedule argument

53
Q
  1. What argument is used to define a DAG schedule?
A

The “schedule” argument

54
Q
  1. What is a valid Crontab schedule value?
A

0 0 * * * is a valid Crontab schedule value

55
Q
  1. What happens every time you run a DAG?
A

A new instance of the DAG is created, which Airflow calls a DAG Run

56
Q
  1. Can DAG Runs for the same DAG run in parallel?
A

Yes, they can

57
Q
  1. What identifies the period of data that tasks should operate on?
A

The defined data interval in each DAG Run

58
Q
  1. What is the purpose of the data interval for tasks, operators, and sensors inside a DAG?
A

It identifies the period of data that these should operate on

59
Q
  1. What is instantiated along with a DAG Run when a DAG is run?
A

Tasks specified inside a DAG are instantiated into Task Instances

60
Q
  1. What are the two dates associated with a DAG run?
A

A start date and an end date

61
Q
  1. What does the period between the start and end date of a DAG run describe?
A

The time when the DAG actually ran

62
Q
  1. What is the “logical date” of a DAG run?
A

It describes the intended time a DAG run is scheduled or triggered

63
Q
  1. What value should the logical date equal if a DAG run is manually triggered by the user?
A

The date and time of which the DAG run was triggered, equal to the DAG run’s start date

64
Q
  1. What does the logical date indicate when the DAG is being automatically scheduled?
A

It indicates the time at which it marks the start of the data interval

65
Q
  1. Is it necessary for every Operator/Task to be assigned to a DAG in order to run?
A

Yes, it is necessary

66
Q
  1. How can you calculate the DAG without passing it explicitly?
A

By declaring your Operator inside a with DAG block, inside a @dag decorator, or by putting your Operator upstream or downstream of an Operator that has a DAG

67
Q
  1. Can default arguments be applied to all operators tied to a DAG?
A

Yes, they can be applied using the “default_args” argument when creating the DAG

68
Q
  1. What is the @dag decorator used for?
A

It is used to decorate a function to turn it into a DAG generator function

69
Q
  1. What are the parameters in your function set up as when using the @dag decorator?
A

They are set up as DAG parameters

70
Q
  1. What does Airflow do with DAGs declared in a function with @dag?
A

Airflow loads the DAGs that appear at the top level of a DAG file

71
Q
  1. When will a DAG run a Task by default?
A

When all the Tasks it depends on are successful

72
Q
  1. How can you modify the default behavior of a DAG running a Task?
A

By using Branching, Latest Only, Depends On Past, or Trigger Rules

73
Q
  1. What is branching in a DAG?
A

It is the ability to select which Task to move onto based on a condition

74
Q
  1. What does the @task.branch decorator do?
A

It expects the decorated function to return an ID to a task or a list of IDs to follow

75
Q
  1. How is branching used with XComs?
A

It allows branching context to dynamically decide what branch to follow based on upstream tasks

76
Q
  1. What is the Latest Only operator?
A

It skips all tasks downstream of itself if you are not on the latest DAG run

77
Q
  1. How does the Depends On Past functionality work?
A

It allows due to large response size. The generated flashcards are available here.

78
Q
  1. What are the two ways a DAG can run?
A

Triggered manually or via the API

79
Q
  1. How is a DAG schedule defined?
A

It is defined as part of the DAG via the schedule argument

80
Q
  1. What argument is used to define a DAG schedule?
A

The “schedule” argument

81
Q
  1. What is a valid Crontab schedule value?
A

0 0 * * * is a valid Crontab schedule value

82
Q
  1. What happens every time you run a DAG?
A

A new instance of the DAG is created, which Airflow calls a DAG Run

83
Q
  1. Can DAG Runs for the same DAG run in parallel?
A

Yes, they can

84
Q
  1. What identifies the period of data that tasks should operate on?
A

The defined data interval in each DAG Run

85
Q
  1. What is the purpose of the data interval for tasks, operators, and sensors inside a DAG?
A

It identifies the period of data that these should operate on

86
Q
  1. What is instantiated along with a DAG Run when a DAG is run?
A

Tasks specified inside a DAG are instantiated into Task Instances

87
Q
  1. What are the two dates associated with a DAG run?
A

A start date and an end date

88
Q
  1. What does the period between the start and end date of a DAG run describe?
A

The time when the DAG actually ran

89
Q
  1. What is the “logical date” of a DAG run?
A

It describes the intended time a DAG run is scheduled or triggered

90
Q
  1. What value should the logical date equal if a DAG run is manually triggered by the user?
A

The date and time of which the DAG run was triggered, equal to the DAG run’s start date

91
Q
  1. What does the logical date indicate when the DAG is being automatically scheduled?
A

It indicates the time at which it marks the start of the data interval

92
Q
  1. Is it necessary for every Operator/Task to be assigned to a DAG in order to run?
A

Yes, it is necessary

93
Q
  1. How can you calculate the DAG without passing it explicitly?
A

By declaring your Operator inside a with DAG block, inside a @dag decorator, or by putting your Operator upstream or downstream of an Operator that has a DAG

94
Q
  1. Can default arguments be applied to all operators tied to a DAG?
A

Yes, they can be applied using the “default_args” argument when creating the DAG

95
Q
  1. What is the @dag decorator used for?
A

It is used to decorate a function to turn it into a DAG generator function

96
Q
  1. What are the parameters in your function set up as when using the @dag decorator?
A

They are set up as DAG parameters

97
Q
  1. What does Airflow do with DAGs declared in a function with @dag?
A

Airflow loads the DAGs that appear at the top level of a DAG file

98
Q
  1. When will a DAG run a Task by default?
A

When all the Tasks it depends on are successful

99
Q
  1. How can you modify the default behavior of a DAG running a Task?
A

By using Branching, Latest Only, Depends On Past, or Trigger Rules

100
Q
  1. What is branching in a DAG?
A

It is the ability to select which Task to move onto based on a condition

101
Q
  1. What does the @task.branch decorator do?
A

It expects the decorated function to return an ID to a task or a list of IDs to follow

102
Q
  1. How is branching used with XComs?
A

It allows branching context to dynamically decide what branch to follow based on upstream tasks

103
Q
  1. What is the Latest Only operator?
A

It skips all tasks downstream of itself if you are not on the latest DAG run

104
Q
  1. How does the Depends On Past functionality work?
A

It allows due to large response size. The generated flashcards are available here.