-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: Stop crawl at time #54
Comments
We would benefit from |
If the exception is raised then would you want the whole crawl to stop at that point? I think you get the same from max_pages by submitting crawl_limit, there is also a crawl_limit_by_page boolean which i think is false by default. crawl_limit is the max number of urls, and if crawl_limit_by_page is set to true then the crawl_limit only applies to text/html content. Like the idea of max_time though, hadn't thought of that before, thinking that would set a datetime and include that date into the within_crawl_limits to check if it has passed, so could also consume a stop_at datetime. max_time would just do the arithmetic for you. |
Yes, I think raising the error, breaking, or returning should stop the crawl as the default. Wasn't aware of the As for |
Hello -- this looks like a great crawler, but I need a way, when crawling, to max-out crawl times on a per-url basis.
Because of that I recommend two features:
Actually raise exceptions
This would allow me to decide any arbitrary conditions upon which to stop crawling.
Encode crawl stop options
This would be a higher level way of enshrining these as features, and would be a lot cleaner overall.
Ideally
:max_time
would acceptDateTime
,Time
orInteger
objects, where the integer would represent seconds.I'm totally new to this project, so feel free to let me know if these are crazy requests. I'm happy to help make this too, if you can give me a pointer as to where this would start out.
The text was updated successfully, but these errors were encountered: