Scheduling data processing jobs using Cron and Apache Airflow Flashcards

1
Q

What is Apache Airflow?

A

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

What is a DAG in Airflow?

A

DAG stands for Directed Acyclic Graph, and it represents a workflow, detailing the tasks and their dependencies in Airflow.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

What are Tasks in Airflow?

A

Tasks are the basic units of execution in Airflow and are defined as individual steps in a DAG.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
4
Q

How do you define a DAG in Airflow?

A

A DAG is defined using Python code, with the DAG class from the airflow module.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
5
Q

What is an Operator in Airflow?

A

Operators are templates that define individual tasks in a DAG. They determine what the tasks do, such as running a script, executing a Bash command, or calling an API.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
6
Q

What is a Sensor in Airflow?

A

Sensors are a special type of operator that will keep running until a certain condition is met. They are used to wait for external events.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
7
Q

What is the purpose of the Airflow Scheduler?

A

The Scheduler is responsible for scheduling the DAGs and ensuring the tasks are executed according to their schedule.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
8
Q

What is the Airflow Web UI?

A

The Web UI is a graphical interface provided by Airflow to help users monitor and manage DAGs and tasks.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
9
Q

How does Airflow handle task dependencies?

A

Task dependencies are managed by setting upstream and downstream tasks in the DAG definition using the&raquo_space; and &laquo_space;operators.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
10
Q

What is a Task Instance in Airflow?

A

A Task Instance represents a specific run of a task in a DAG, characterized by its execution date and state.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
11
Q

What are XComs in Airflow?

A

XComs, or Cross-Communication, allow tasks to share small amounts of data, enabling inter-task communication.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
12
Q

What is a Task Group in Airflow?

A

Task Groups allow for grouping related tasks within a DAG, improving organization and readability.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
13
Q

What are Hooks in Airflow?

A

Hooks are interfaces to external systems, providing methods to interact with databases, cloud services, and other systems.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
14
Q

How can you handle retries and failures in Airflow tasks?

A

You can configure retries and failure handling using parameters like retries, retry_delay, and retry_exponential_backoff in the task definition.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
15
Q

What is the Airflow Executor?

A

The Executor is a key component that determines how task instances are executed. Popular executors include the SequentialExecutor, LocalExecutor, and CeleryExecutor.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
16
Q

Why is it important to set catchup=False in a DAG?

A

Setting catchup=False prevents Airflow from backfilling missed schedule intervals when the DAG was not running.

17
Q

What is the importance of idempotency in Airflow tasks?

A

Ensuring that tasks are idempotent means they can be run multiple times without causing different outcomes, which is crucial for rerunning tasks reliably.

18
Q

How do you handle sensitive information in Airflow?

A

Sensitive information should be handled using Airflow’s connection management and environment variables to avoid hardcoding secrets in DAGs.

19
Q

What is the purpose of defining start_date and schedule_interval in a DAG?

A

The start_date specifies when the DAG should start running, and schedule_interval defines how often the DAG should be triggered.

20
Q

How can you test an Airflow DAG?

A

You can test a DAG using the airflow test command, which allows you to run individual tasks without affecting the state of the DAG.