Introduce the orchestrator, particularly focusing on Airflow #263
Tian-2017
started this conversation in
Firebreak April 24
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
🌵What is the problem or issue we're trying to address?
Primarily aimed at boosting productivity, with the secondary advantage of cutting costs. Currently, constructing ETL pipelines, whether on Glue or Lambda, is time-consuming and not a conventional method within the data engineering sector.
🎯How is this affecting producers, consumers or platform engineers?
• Functions as both an orchestrator and a Python environment, capable of executing Python scripts.
• Significantly reduces time and cost for building pipelines, especially for small datasets, as Terraform is not required.
• Enables data engineers and analysts to focus on delivering more useful data.
• Potential to double efficiency across teams, with a heavily reduced reliance on Lambda and Glue.
• Suitable for datasets under 10 GB, which can be handled directly by Airflow.
• For large datasets, the use of Glue (or AWS EMR) is recommended.
• Dependency issues (may happen but not common) can be managed using AWS ECS to separate the ETL pipelines, aligning with current industry practices.
• Promises substantial cost savings in the long term.
📝What is the proposed task?
No response
🤔How might this work be carried out?
⌛How urgent is this work?
very important, relatively urgent
💪How much effort do you think this will take?
3 full weeks:
🛠️What skills are needed?
Terraform (used to manage the AWS infrastructure);
Airflow;
GitHub Actions - automates the deployment of merged code to Airflow in the production environment.
Docker - used to build Airflow and others;
Make - for staging, customise a Makefile to upload the ETL scripts to an S3 bucket, which Airflow in staging will then parse the script in S3 bucket.
📃Additional Info:
No response
Beta Was this translation helpful? Give feedback.
All reactions