How do I re-crawl already processed requests for incremental crawling? #2611

jezsung · 2024-08-12T03:50:40Z

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Feature

I would love if I can set expiration time of a request, letting the crawl re-process the request if it has expired.

Ideally, I would love to have full control over whether the request needs to be re-processed or not.

Currently, the crawler seems to skip already processed request completely not even executing the preNavigationHooks or requestHandler.

Motivation

For example, with a e-commerce website, there's a category page that lists products. I'd like to crawl this page and call the enqueueLinks function to fetch the product details pages. Once the request for the category page URL is processed, it would be saved to the request_queues storage and won't be processed again as long as the storage persists.

Currently, there seems to be no way to re-visit this category URL while keeping the storage and other processed requests. There could be added products since the last crawl. I think re-crawling this URL would be ideal for most situations.

Ideal solution or implementation, and any additional constraints

Add a expiration time to a request and check if the request needs to be processed based on the expiration time every time the crawl runs.

Alternative solutions or implementations

Allow modifying output format of request object and add a hook that can determine whether a request needs to be processed or not.

This allows the user to add a custom expires field to a request and add a custom hook to check if the request has expired, returning boolean in the hook function.

Other context

No response

The text was updated successfully, but these errors were encountered:

jezsung added the feature Issues that represent new features or improvements to existing features. label Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I re-crawl already processed requests for incremental crawling? #2611

How do I re-crawl already processed requests for incremental crawling? #2611

jezsung commented Aug 12, 2024

How do I re-crawl already processed requests for incremental crawling? #2611

How do I re-crawl already processed requests for incremental crawling? #2611

Comments

jezsung commented Aug 12, 2024

Which package is the feature request for? If unsure which one to select, leave blank

Feature

Motivation

Ideal solution or implementation, and any additional constraints

Alternative solutions or implementations

Other context