HttpToS3Operator OOM when downloading large file #46066

rogalski · 2025-01-24T08:24:12Z

rogalski
Jan 24, 2025

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

main is affected

Apache Airflow version

2.10

Operating System

Linux

Deployment

Amazon (AWS) MWAA

Deployment details

No response

What happened

Whole file is attempted to be loaded to memory.
Task exited with return code -9.

airflow/providers/src/airflow/providers/amazon/aws/transfers/http_to_s3.py

Lines 164 to 175 in bb77ebf

    
           def execute(self, context: Context): 
        
               self.log.info("Calling HTTP method") 
        
               response = self.http_hook.run(self.endpoint, self.data, self.headers, self.extra_options) 
        
               self.s3_hook.load_bytes( 
        
                   response.content, 
        
                   self.s3_key, 
        
                   self.s3_bucket, 
        
                   self.replace, 
        
                   self.encrypt, 
        
                   self.acl_policy, 
        
               )

In this code response.content is in-memory representation of file content (as bytes).

What you think should happen instead

Lazy load via stream=True, response.raw and S3Hook.load_fileobj()

How to reproduce

Run HttpToS3Operator with file larger than RAM available (~20GB in my setup was enough).

Anything else

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

raphaelauv · 2025-01-24T14:35:06Z

raphaelauv
Jan 24, 2025

hi, HttpToS3Operator is an airflow operator doing the data transfer thanks to the code you linked.

So data is going through the airflow_worker ( if using celery executor )

you have multiple options:

use the kubernetes Executor ( or any other executor that let you custom the RAM context of the task ) for this task and the code of the operator ( but still not doing stream transfer ) will not suffer of the airflow_worker hardware limits

or

use the KubernetesPodOperator and trigger an efficient specialist tool to execute the transfer , like RCLONE -> https://rclone.org/docs/

0 replies

potiuk · 2025-01-26T13:20:13Z

potiuk
Jan 26, 2025
Collaborator

Yeah. The fact that the oprator loads teh whole file to memory is a feature, not a bug. But if you would like to contribute operator that could do streaming to S3 @rogalski -> feel free to contribute it. There is no need to have an issue for it or you can easly use any available S3 mechanisms to do so - like rclone.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HttpToS3Operator OOM when downloading large file #46066

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

HttpToS3Operator OOM when downloading large file #46066

rogalski Jan 24, 2025

Apache Airflow Provider(s)

Versions of Apache Airflow Providers

Apache Airflow version

Operating System

Deployment

Deployment details

What happened

What you think should happen instead

How to reproduce

Anything else

Are you willing to submit PR?

Code of Conduct

Replies: 2 comments

raphaelauv Jan 24, 2025

potiuk Jan 26, 2025 Collaborator

rogalski
Jan 24, 2025

raphaelauv
Jan 24, 2025

potiuk
Jan 26, 2025
Collaborator