You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Which package is the feature request for? If unsure which one to select, leave blank
@crawlee/playwright (PlaywrightCrawler)
Feature
I would love if I can set expiration time of a request, letting the crawl re-process the request if it has expired.
Ideally, I would love to have full control over whether the request needs to be re-processed or not.
Currently, the crawler seems to skip already processed request completely not even executing the preNavigationHooks or requestHandler.
Motivation
For example, with a e-commerce website, there's a category page that lists products. I'd like to crawl this page and call the enqueueLinks function to fetch the product details pages. Once the request for the category page URL is processed, it would be saved to the request_queues storage and won't be processed again as long as the storage persists.
Currently, there seems to be no way to re-visit this category URL while keeping the storage and other processed requests. There could be added products since the last crawl. I think re-crawling this URL would be ideal for most situations.
Ideal solution or implementation, and any additional constraints
Add a expiration time to a request and check if the request needs to be processed based on the expiration time every time the crawl runs.
Alternative solutions or implementations
Allow modifying output format of request object and add a hook that can determine whether a request needs to be processed or not.
This allows the user to add a custom expires field to a request and add a custom hook to check if the request has expired, returning boolean in the hook function.
Other context
No response
The text was updated successfully, but these errors were encountered:
jezsung
added
the
feature
Issues that represent new features or improvements to existing features.
label
Aug 12, 2024
Which package is the feature request for? If unsure which one to select, leave blank
@crawlee/playwright (PlaywrightCrawler)
Feature
I would love if I can set expiration time of a request, letting the crawl re-process the request if it has expired.
Ideally, I would love to have full control over whether the request needs to be re-processed or not.
Currently, the crawler seems to skip already processed request completely not even executing the
preNavigationHooks
orrequestHandler
.Motivation
For example, with a e-commerce website, there's a category page that lists products. I'd like to crawl this page and call the
enqueueLinks
function to fetch the product details pages. Once the request for the category page URL is processed, it would be saved to therequest_queues
storage and won't be processed again as long as the storage persists.Currently, there seems to be no way to re-visit this category URL while keeping the storage and other processed requests. There could be added products since the last crawl. I think re-crawling this URL would be ideal for most situations.
Ideal solution or implementation, and any additional constraints
Add a expiration time to a request and check if the request needs to be processed based on the expiration time every time the crawl runs.
Alternative solutions or implementations
Allow modifying output format of request object and add a hook that can determine whether a request needs to be processed or not.
This allows the user to add a custom
expires
field to a request and add a custom hook to check if the request has expired, returning boolean in the hook function.Other context
No response
The text was updated successfully, but these errors were encountered: