Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove S3 upload concurrency #2120

Merged
merged 2 commits into from
Jan 3, 2025
Merged

Remove S3 upload concurrency #2120

merged 2 commits into from
Jan 3, 2025

Conversation

benoit74
Copy link
Contributor

Fix #2118

Changes:

  • revert increase of sockets from Increase S3 socket limit to 250 #2093, this was not needed in the past, and I don't expect this to be needed if we do not push too many concurrent requests (which we shouldn't)
  • await upload of image to S3: while not awaiting the upload could help save some crawling time, it cannot be efficiently handled without proper queue size management (otherwise we end-up with current situation where too many uploads are queued at once). Implementing proper queue management is not worth it

Nota: we might want to open an issue to support async S3 upload with proper queue management on the medium/long term. I'm not convinced at all this is worth to implement, most (all?) Python scrapers work pretty well without it and I do not expect much benefit (i.e. price to pay is too high)

@benoit74 benoit74 self-assigned this Dec 26, 2024
@kelson42
Copy link
Collaborator

@benoit74 Thank you very much for the PR. I will let @audiodude full judge this PR.

Copy link

codecov bot commented Dec 27, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 75.25%. Comparing base (b66f1a9) to head (ea3db9b).
Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2120      +/-   ##
==========================================
- Coverage   75.26%   75.25%   -0.01%     
==========================================
  Files          41       41              
  Lines        3198     3197       -1     
  Branches      706      706              
==========================================
- Hits         2407     2406       -1     
  Misses        674      674              
  Partials      117      117              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 2, 2025

@audiodude Can you please review this PR?

Copy link
Collaborator

@kelson42 kelson42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kelson42 kelson42 merged commit a4db759 into main Jan 3, 2025
6 checks passed
@kelson42 kelson42 deleted the no_s3_concurrency branch January 3, 2025 09:04
Copy link
Member

@audiodude audiodude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cache error while uploading object: RequestTimeTooSkewed
3 participants