Citi ETL Project Flashcards

Question 1

Q

Explain the Python project? Big Data Orchestration Framework

Answer

A

Configurable Big Data Orchestration Framework

As part of the Global Functions Data Services Technology Group at Citigroup, I spearheaded the development of a highly configurable orchestration framework using Python and the Luigi pipeline framework. This framework replaced hundreds of cumbersome AutoSys jobs with streamlined JSON-configured tasks, significantly enhancing the manageability and scalability of our big data platform.

The Python project was designed to be triggered by a custom File Watcher script, which ran at regular intervals to check for new signal files in a landing directory. These signal files, identified by a unique naming convention including product, timestamp, region, and ID, contained metadata about the associated data files. This metadata was crucial for guiding the subsequent processing steps.

Key Features and Components:
File Watcher:

The File Watcher script checked for new files matching a specific naming pattern, ensuring no duplicate processing through caching logic.
It extracted key parameters (PRODUCT, REGION, ID, TIMESTAMP) from the signal file name to trigger the Luigi script with appropriate settings.
Metadata Handling:

Signal files were formatted as CSV files containing metadata, including the names and expected formats of the data files.
Data integrity checks included verifying the existence of the data files and occasionally validating line counts to detect corruption.
Configuration Management:

Luigi Config Files, maintained in the project repository, were JSON-based configurations that defined the tasks to be executed for each product.
These files guided the pipeline, specifying tasks such as MOVE_FILES, VALIDATE_FILES, and RUN_SPARK_JOB, each configurable with various parameters.
Task Management:

Tasks were defined by a TASK_NAME and TASK_TYPE, with parameters passed through the JSON config file.
A Python dictionary was generated from the config file and used throughout the Luigi run to manage task execution.
Pipeline Orchestration:

Dependencies between tasks were defined in the Luigi Config File, forming a directed acyclic graph (DAG) to guide task execution.
Luigi managed parallel task execution where feasible, ensuring efficient processing while respecting task dependencies.
Error Handling and Monitoring:

Extensive error handling and logging were integrated into the Python code to manage various configurations and edge cases.
Luigi’s decorator functions were utilized for retry logic and failure notifications, with email alerts sent in case of errors.
Integration with Spark Jobs:

One of the Luigi tasks involved running Java Spark jobs. This was achieved by triggering a shell command formatted in the Luigi Config File, passing necessary parameters dynamically determined by the pipeline.
This integration allowed for seamless execution of complex data processing tasks within the broader orchestration framework.
Performance and Status Reporting:

Each Luigi task was timed, and the duration was recorded for analysis.
I developed a Python script using Pandas and Matplotlib to generate detailed performance and status reports, including visual graphs and metrics, for all big data jobs run within specified time frames.
Daily emails with timing graphs and visual metrics were sent to the team to identify bottlenecks and optimize performance. Weekly summaries, including all runs and any failed tasks, were also distributed.
This orchestration framework was adopted by multiple teams within Citigroup, significantly improving the efficiency and reliability of our big data operations. The comprehensive error handling, robust task management, flexible configuration capabilities, and detailed performance reporting made it a valuable asset for managing diverse data processing needs.

Question 2

Q

Explain the Java Project? Java Spark ETL Project

Answer

A

Java Spark ETL Project
As part of the Global Functions Data Services Technology Group at Citigroup, I contributed to the development of a Java Spark-based ETL framework. This project aimed to streamline and enhance data processing tasks by leveraging the power of Apache Spark for efficient, scalable, and high-performance data transformation.

Key Features and Components:
Spark Job Configuration:

Spark jobs were triggered by the Luigi project. The Luigi config file specified the name of the JAR file to be executed and the parameters to be passed.
These jobs were executed as part of batch processes, initiated whenever a Luigi run called them based on new data arrival.
Tasks and Dependencies:

The Spark jobs were highly configurable, defined by a JSON-based Spark jobs configuration file. Tasks included:
SQL File Execution: Running SQL syntax queries via Spark, specified within the configuration file.
Hive Integration: Reading from and loading data into Hive tables.
Column Manipulation: Adding or deleting columns with configurable names and types.
Dependencies between tasks were managed similarly to Luigi config tasks, with the added ability to specify dependency status (e.g., success or failure of preceding tasks).
Parallel Execution:

Parallel execution was managed using CompletableFuture from Java’s java.util.concurrent package, ensuring efficient task execution.
The DAG logic was carefully designed to enable safe and logical parallel processing, optimizing overall performance.
Error Handling and Monitoring:

Error handling was partially managed by the DAG logic within the Java Spark framework, with additional handling by the overarching Luigi project.
Monitoring was facilitated by Spark’s built-in user interface, which provided detailed logs and historical run data for analyzing successes and failures.
Integration with the Luigi Project:

The Java Spark jobs were seamlessly integrated into the broader Luigi pipeline. Luigi tasks specified the execution of Spark jobs, passing necessary parameters dynamically.
This integration ensured that the entire ETL process, from initial data ingestion to final transformation and loading, was cohesive and well-coordinated.
Challenges and Solutions:
Learning Curve: Getting up to speed with Spark commands and debugging issues was challenging initially. This was overcome through hands-on experience and continuous learning.
Configurable DAG Logic: Making the DAG highly configurable introduced complexities and edge cases. This was addressed by thorough testing and iterative refinement of the logic.
Project Impact:
The Java Spark ETL framework significantly enhanced the efficiency and reliability of Citigroup’s data processing operations. Its highly configurable nature allowed for flexible adaptation to various data processing needs, while the integration with the Luigi project ensured a cohesive and streamlined ETL pipeline.

Question 3

Q

Concise description of the Java Spark-based ETL framework

Answer

A

In my role at Citigroup, I developed a Java Spark-based ETL framework designed for efficient and scalable data processing. Configured via JSON files, this framework managed a variety of tasks including SQL file execution, Hive integration, and column manipulation. The framework’s DAG logic ensured robust task dependencies and error handling, while its integration with the Luigi project enabled seamless orchestration of the entire ETL pipeline. This approach significantly improved data processing efficiency and scalability, contributing to cost reductions and enhanced performance.

Question 4

Q

Concise description of the Python ETL Orchestration Framework

Answer

A

At Citigroup, I spearheaded the development of a Python-based ETL orchestration framework using Luigi. This framework was driven by JSON configuration files that defined tasks such as file movement, validation, and Spark job execution. Triggered by a file watcher, the Luigi pipeline managed task dependencies, ensuring efficient and orderly execution. The project included robust error handling, extensive logging, and performance monitoring, with daily and weekly reports generated using Pandas and Matplotlib to identify bottlenecks and optimize performance. This solution streamlined data processing workflows and improved overall operational efficiency.