Skip to content

feat: Add deduplication to add_batch_of_requests #534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 56 commits into
base: master
Choose a base branch
from

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Aug 7, 2025

Description

  • Ensure that already known requests are excluded from api_client.batch_add_requests calls to avoid expensive and pointless API calls.
  • Add all new requests to the cache when calling batch_add_requests.
  • Add test with real API usage measurement.

Issues

Testing

  • Added new integration tests to verify reduced API usage.
  • Comparing benchmark actor based on master vs this PR. Actor is a simple ParselCrawler that crawls the whole crawlee.dev, which contains many duplicate links, as the documentation is cross-linked thoroughly. Results:
    • Massive reduction of cost from request queue.
    • Significant overall speed up due to reduced API calls.
image

@Pijukatel Pijukatel changed the title Add deduplication to add_batch_of_requests and test feat: Add deduplication to add_batch_of_requests and test Aug 7, 2025
@Pijukatel Pijukatel added enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. labels Aug 7, 2025
@Pijukatel Pijukatel changed the title feat: Add deduplication to add_batch_of_requests and test feat: Add deduplication to add_batch_of_requests Aug 7, 2025
Base automatically changed from new-apify-storage-clients to master August 12, 2025 16:45
@github-actions github-actions bot added this to the 121st sprint - Tooling team milestone Aug 13, 2025
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Aug 13, 2025
@Pijukatel Pijukatel requested review from vdusek and Mantisus August 13, 2025 12:48
@Pijukatel Pijukatel marked this pull request as ready for review August 13, 2025 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure that duplicate links are handled in a cost effective way when using Apify RequestQueue
2 participants