Introduce the orchestrator, particularly focusing on Airflow #263

Tian-2017 · 2024-04-05T07:08:58Z

Tian-2017
Apr 5, 2024
Collaborator

🌵What is the problem or issue we're trying to address?

Primarily aimed at boosting productivity, with the secondary advantage of cutting costs. Currently, constructing ETL pipelines, whether on Glue or Lambda, is time-consuming and not a conventional method within the data engineering sector.

🎯How is this affecting producers, consumers or platform engineers?

Airflow, the market Dominator as an example:
• Functions as both an orchestrator and a Python environment, capable of executing Python scripts.
• Significantly reduces time and cost for building pipelines, especially for small datasets, as Terraform is not required.
Benefits for Teams - Enhanced Productivity:
• Enables data engineers and analysts to focus on delivering more useful data.
• Potential to double efficiency across teams, with a heavily reduced reliance on Lambda and Glue.
• Suitable for datasets under 10 GB, which can be handled directly by Airflow.
Handling Large Datasets and dependency issues:
• For large datasets, the use of Glue (or AWS EMR) is recommended.
• Dependency issues (may happen but not common) can be managed using AWS ECS to separate the ETL pipelines, aligning with current industry practices.
Cost Efficiency:
• Promises substantial cost savings in the long term.

📝What is the proposed task?

No response

🤔How might this work be carried out?

A presentation needs to be prepared for 16 April; it will cover the choice of orchestrator and its benefits.
Set up Airflow on EC2 or opt for AWS Managed Workflows for Apache Airflow (MWAA).
Establish a repository for new ETL pipelines.
Implement CI/CD using GitHub Actions for deployment from GitHub.
Create both a staging and a production environment.

⌛How urgent is this work?

very important, relatively urgent

💪How much effort do you think this will take?

3 full weeks:

The first week will be dedicated to developing a prototype that showcases Airflow in a staging environment and executes some Python scripts.
The following two weeks will focus on automating everything and setting up the production environment.
We will allocate two days for knowledge sharing to train all team members. This will be significantly easier than using Terraform. It only requires data engineers to manage it, while analysts can easily use it.

🛠️What skills are needed?

Terraform (used to manage the AWS infrastructure);
Airflow;
GitHub Actions - automates the deployment of merged code to Airflow in the production environment.
Docker - used to build Airflow and others;
Make - for staging, customise a Makefile to upload the ETL scripts to an S3 bucket, which Airflow in staging will then parse the script in S3 bucket.

📃Additional Info:

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce the orchestrator, particularly focusing on Airflow #263

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Introduce the orchestrator, particularly focusing on Airflow #263

Tian-2017 Apr 5, 2024 Collaborator

🌵What is the problem or issue we're trying to address?

🎯How is this affecting producers, consumers or platform engineers?

📝What is the proposed task?

🤔How might this work be carried out?

⌛How urgent is this work?

💪How much effort do you think this will take?

🛠️What skills are needed?

📃Additional Info:

Replies: 0 comments

Tian-2017
Apr 5, 2024
Collaborator