Oozie & Airflow Flashcards
What is Oozie?
Define, manage, schedule and run workloads. Supports MapReduce, Spark, Pig.
It has monitoring capability and automatic retry of jobs.
What is Airflow?
Allows you to create and organize workflows using DAGs. DAGs are the core concept of Airflow which tells you how tasks are organized as well as its dependancies and relationships.
What is a Workflow in Airflow?
It is a collection of tasks in a way that shows each task’s relationships and dependecies. It is represented by a Directed Acyclic Graph (DAG).
Describe Airflow components
Webserver: Airflow UI built on Flask (Python) which allows you interact with Airflow functionalities.
Scheduler: Monitors DAGs and triggers those tasks whose dependencies have been met.
Executor: Responsible for running tasks. In production environments, it pushes task execution out to workers.
Metadata Database: Keeps track of the status of each task. By default it uses SQLite but it is not recommended for production
What are the types of Executors in Airflow?
Sequential Executor: Default but it is limited. Only one task is executed at a time.
Celery Executor: Uses several workers to execute a job in a distributed way
Kubernetes Executor: Each task is run in its own Kubernetes pod.
What are Operators in Airflow?
Determines what gets done by a task. Each operator does a specific task. For example, there’s a BashOperator that executes a bash command. A PythonOperator which executes a python function. AwsBatchOperator which executes a job on AWS batch.
What are Sensors in Airflow?
Special operators that are used to monitor a long-running task. For example:
- AthenaSensor: Asks for the state of the Query until it reaches a failure state or success state
- GoogleCloudStorageObjectSensor: Checks for the existence of a file in GoogleCloudStorage
What is XComs in Airflow?
XCom (short for cross-comunication) allows data to be sent between tasks.
What are Jinja templates in Airflow?
It is a template engine. It has special placeholders to serve dynamic data. It is simply a text file that contains variables or expressions which get replaced with values when a template is rendered at runtime.
Suppose you want to reference a unique S3 file name that corresponds to the date of the DAG run, You can accomplish that via Jinja.
What problems do Airflow tacke?
Helps re-reunning tasks in case of failures. (Failures)
Helps checking the status of tasks and alerting if necessary (Monitoring)
Helps backfilling historical data (Processing historical data)
What are Hooks in Airflow?
It abstracts away a lot of boilerplate code in connecting with data sources. It is the building block for Airflow Operators. It provides a uniform easy-to-use interface to access external servioces like S3, MySQL, Hive, Qubole.