Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I re-crawl already processed requests for incremental crawling? #2611

Open
jezsung opened this issue Aug 12, 2024 · 0 comments
Open
Labels
feature Issues that represent new features or improvements to existing features.

Comments

@jezsung
Copy link

jezsung commented Aug 12, 2024

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Feature

I would love if I can set expiration time of a request, letting the crawl re-process the request if it has expired.

Ideally, I would love to have full control over whether the request needs to be re-processed or not.

Currently, the crawler seems to skip already processed request completely not even executing the preNavigationHooks or requestHandler.

Motivation

For example, with a e-commerce website, there's a category page that lists products. I'd like to crawl this page and call the enqueueLinks function to fetch the product details pages. Once the request for the category page URL is processed, it would be saved to the request_queues storage and won't be processed again as long as the storage persists.

Currently, there seems to be no way to re-visit this category URL while keeping the storage and other processed requests. There could be added products since the last crawl. I think re-crawling this URL would be ideal for most situations.

Ideal solution or implementation, and any additional constraints

Add a expiration time to a request and check if the request needs to be processed based on the expiration time every time the crawl runs.

Alternative solutions or implementations

Allow modifying output format of request object and add a hook that can determine whether a request needs to be processed or not.

This allows the user to add a custom expires field to a request and add a custom hook to check if the request has expired, returning boolean in the hook function.

Other context

No response

@jezsung jezsung added the feature Issues that represent new features or improvements to existing features. label Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Issues that represent new features or improvements to existing features.
Projects
None yet
Development

No branches or pull requests

1 participant