Scheduling data processing jobs using Cron and Apache Airflow Flashcards
What is Apache Airflow?
Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows.
What is a DAG in Airflow?
DAG stands for Directed Acyclic Graph, and it represents a workflow, detailing the tasks and their dependencies in Airflow.
What are Tasks in Airflow?
Tasks are the basic units of execution in Airflow and are defined as individual steps in a DAG.
How do you define a DAG in Airflow?
A DAG is defined using Python code, with the DAG class from the airflow module.
What is an Operator in Airflow?
Operators are templates that define individual tasks in a DAG. They determine what the tasks do, such as running a script, executing a Bash command, or calling an API.
What is a Sensor in Airflow?
Sensors are a special type of operator that will keep running until a certain condition is met. They are used to wait for external events.
What is the purpose of the Airflow Scheduler?
The Scheduler is responsible for scheduling the DAGs and ensuring the tasks are executed according to their schedule.
What is the Airflow Web UI?
The Web UI is a graphical interface provided by Airflow to help users monitor and manage DAGs and tasks.
How does Airflow handle task dependencies?
Task dependencies are managed by setting upstream and downstream tasks in the DAG definition using the»_space; and «_space;operators.
What is a Task Instance in Airflow?
A Task Instance represents a specific run of a task in a DAG, characterized by its execution date and state.
What are XComs in Airflow?
XComs, or Cross-Communication, allow tasks to share small amounts of data, enabling inter-task communication.
What is a Task Group in Airflow?
Task Groups allow for grouping related tasks within a DAG, improving organization and readability.
What are Hooks in Airflow?
Hooks are interfaces to external systems, providing methods to interact with databases, cloud services, and other systems.
How can you handle retries and failures in Airflow tasks?
You can configure retries and failure handling using parameters like retries, retry_delay, and retry_exponential_backoff in the task definition.
What is the Airflow Executor?
The Executor is a key component that determines how task instances are executed. Popular executors include the SequentialExecutor, LocalExecutor, and CeleryExecutor.
Why is it important to set catchup=False in a DAG?
Setting catchup=False prevents Airflow from backfilling missed schedule intervals when the DAG was not running.
What is the importance of idempotency in Airflow tasks?
Ensuring that tasks are idempotent means they can be run multiple times without causing different outcomes, which is crucial for rerunning tasks reliably.
How do you handle sensitive information in Airflow?
Sensitive information should be handled using Airflow’s connection management and environment variables to avoid hardcoding secrets in DAGs.
What is the purpose of defining start_date and schedule_interval in a DAG?
The start_date specifies when the DAG should start running, and schedule_interval defines how often the DAG should be triggered.
How can you test an Airflow DAG?
You can test a DAG using the airflow test command, which allows you to run individual tasks without affecting the state of the DAG.