Airflow Core Concepts - Sheet1 Flashcards
- What is a DAG in Airflow?
A DAG (Directed Acyclic Graph) is the core concept of Airflow, collecting Tasks together, organized with dependencies and relationships to say how they should run.
- What is the role of a DAG in Airflow?
The DAG in Airflow doesn’t care about what is happening inside the tasks
- What does a basic DAG define?
A basic DAG defines the tasks and dictates the order in which they have to run, and which tasks depend on what others. It also states how often to run the DAG.
- What are the three ways to declare a DAG in Airflow?
The three ways to declare a DAG in Airflow are: use a context manager, use a standard constructor, passing the DAG into any operators you use, or use the @dag decorator to turn a function into a DAG generator.
- What do DAGs need to run in Airflow?
DAGs need Tasks to run in Airflow, and those usually come in the form of either Operators, Sensors, or TaskFlow.
- What does a Task/Operator in a DAG usually depend on?
A Task/Operator in a DAG usually has dependencies on other tasks (those upstream of it), and other tasks depend on it (those downstream of it).
- How are individual task dependencies declared in a DAG?
Individual task dependencies in a DAG can be declared using the»_space; and «_space;operators, or using the more explicit set_upstream and set_downstream methods.
- What is the cross_downstream method used for in a DAG?
The cross_downstream method is used in a DAG to make two lists of tasks depend on all parts of each other.
- What is the chain method used for in a DAG?
The chain method is used in a DAG to chain together dependencies, or to create pairwise dependencies for lists of the same size.
- How does Airflow load DAGs?
Airflow loads DAGs from Python source files, which it looks for inside its configured DAG_FOLDER. It takes each file, executes it, and then loads any DAG objects from that file.
- Can you define multiple DAGs per Python file?
Yes, you can define multiple DAGs per Python file, or even spread one very complex DAG across multiple Python files using imports.
- What does Airflow consider when searching for DAGs inside the DAG_FOLDER?
When searching for DAGs inside the DAG_FOLDER, Airflow only considers Python files that contain the strings airflow and dag (case-insensitively) as an optimization.
- How to consider all Python files when searching for DAGs inside the DAG_FOLDER?
To consider all Python files when searching for DAGs inside the DAG_FOLDER, you should disable the DAG_DISCOVERY_SAFE_MODE configuration flag.
- What is an .airflowignore file?
An .airflowignore file is a file inside your DAG_FOLDER, or any of its subfolders, which describes patterns of files for the loader to ignore.
- How to control if a python file needs to be parsed by Airflow in a more flexible way?
If the .airflowignore does not meet your needs and you want a more flexible way to control if a python file needs to be parsed by Airflow, you can plug your callable by setting might_contain_dag_callable in the config file.
- What’s the difference between context manager and standard constructor for DAG declaration?
The context manager automatically adds the DAG to any tasks inside it implicitly while the standard constructor requires the DAG to be passed into any operators used.
- How to use a context manager for DAG declaration?
You can use a context manager for DAG declaration with the with statement and the DAG function.
- How to use a standard constructor for DAG declaration?
You can use a standard constructor for DAG declaration by explicitly defining the DAG and passing it into any operators you use.
- What is the @dag decorator for?
The @dag decorator is used to turn a function into a DAG generator in Airflow.
- What’s the purpose of task dependencies in Airflow?
Task dependencies in Airflow dictate the order of task execution based on the dependencies between different tasks.
- How to use the»_space; and «_space;operators for task dependencies?
The»_space; and «_space;operators are used to specify downstream and upstream dependencies respectively between different tasks.
- How to use the set_upstream and set_downstream methods for task dependencies?
The set_upstream and set_downstream methods are used to specify upstream and downstream dependencies respectively between different tasks.
- What does the cross_downstream function do?
The cross_downstream function is used to specify dependencies between two lists of tasks where every task in the first list is dependent on every task in the second list.
- How to use the chain method for task dependencies?
The chain method is used to specify a series of dependencies between tasks where each task is dependent on the previous one.
- How does Airflow identify DAGs in Python source files?
When Airflow loads Python source files from its configured DAG_FOLDER, it executes each file and then loads any objects at the top level that are a DAG instance.
- What is DAG_DISCOVERY_SAFE_MODE configuration flag for?
The DAG_DISCOVERY_SAFE_MODE configuration flag is used to make Airflow consider all Python files when searching for DAGs inside the DAG_FOLDER.
- What is the purpose of .airflowignore file?
The .airflowignore file is used to specify patterns of files that Airflow should ignore when searching for DAGs.
- What is might_contain_dag_callable in the config file for?
The might_contain_dag_callable in the config file is used to plug in your own callable that checks if a file needs to be parsed by Airflow.
- What’s the difference between a task and an operator in Airflow?
In Airflow, a task is an instance of an operator, so they’re essentially the same thing but at different stages. The operator defines the kind of task to run, and the task is the actual execution defined by that operator.
- What is the purpose of an EmptyOperator in Airflow?
The EmptyOperator in Airflow is a simple operator that does nothing, it’s often used as a placeholder or for debugging.
- How to specify a schedule for a DAG in Airflow?
A schedule for a DAG in Airflow can be specified using the schedule parameter in the DAG function, you can use cron expressions or special strings like “@daily” for common schedules.
- What does the start_date parameter in a DAG do?
The start_date parameter in a DAG specifies the date when the DAG should start running.
- What is the dag_id parameter for in a DAG?
The dag_id parameter in a DAG is used to uniquely identify the DAG in Airflow’s database.
- What are the recommended ways to declare task dependencies?
The recommended ways to declare task dependencies are using the»_space; and «_space;operators.
- What does the cross_downstream function replace?
The cross_downstream function replaces a situation where you have multiple tasks depending on multiple other tasks, instead of using»_space; operator multiple times.
- How does the chain function help in dynamic task declaration?
The chain function helps in dynamic task declaration by chaining together a dynamic number of tasks.
- How does Airflow treat Python files that do not contain the strings airflow and dag?
Unless the DAG_DISCOVERY_SAFE_MODE configuration flag is disabled, Airflow ignores Python files that do not contain the strings airflow and dag.
- How does the .airflowignore file work?
The .airflowignore file describes patterns of files that Airflow should ignore when it’s loading DAGs, it covers the directory it’s in and all subfolders underneath it.
- How to create a custom callable for might_contain_dag_callable?
You can create a custom callable for might_contain_dag_callable that takes a file path and an optional zip file and returns True if the file needs to be parsed by Airflow, otherwise it returns False.
- How can you make a task depend on multiple tasks using the»_space; operator?
You can make a task depend on multiple tasks using the»_space; operator by putting the multiple tasks in a list, like this: first_task»_space; [second_task, third_task].
- How can you make a task depend on multiple tasks using the set_downstream method?
You can make a task depend on multiple tasks using the set_downstream method by passing multiple tasks as arguments, like this: first_task.set_downstream(second_task, third_task).