Replies: 2 comments
-
hi, HttpToS3Operator is an airflow operator doing the data transfer thanks to the code you linked. So data is going through the airflow_worker ( if using celery executor ) you have multiple options:
or
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Yeah. The fact that the oprator loads teh whole file to memory is a feature, not a bug. But if you would like to contribute operator that could do streaming to S3 @rogalski -> feel free to contribute it. There is no need to have an issue for it or you can easly use any available S3 mechanisms to do so - like rclone. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
main is affected
Apache Airflow version
2.10
Operating System
Linux
Deployment
Amazon (AWS) MWAA
Deployment details
No response
What happened
Whole file is attempted to be loaded to memory.
Task exited with return code -9.
airflow/providers/src/airflow/providers/amazon/aws/transfers/http_to_s3.py
Lines 164 to 175 in bb77ebf
In this code response.content is in-memory representation of file content (as bytes).
What you think should happen instead
Lazy load via
stream=True
,response.raw
andS3Hook.load_fileobj()
How to reproduce
Run
HttpToS3Operator
with file larger than RAM available (~20GB in my setup was enough).Anything else
No response
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions