Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Stop crawl at time #54

Open
samnissen opened this issue Apr 19, 2017 · 3 comments
Open

Feature request: Stop crawl at time #54

samnissen opened this issue Apr 19, 2017 · 3 comments

Comments

@samnissen
Copy link

Hello -- this looks like a great crawler, but I need a way, when crawling, to max-out crawl times on a per-url basis.

Because of that I recommend two features:

Actually raise exceptions

This would allow me to decide any arbitrary conditions upon which to stop crawling.

require 'cobweb'
require 'securerandom'

def condition
  true if SecureRandom.hex(10).include?("a") # or whatever condition I deem relevant
end

CobwebCrawler.new(:raise_arbitrary_exceptions => true).crawl("http://pepsico.com") do |page|
  puts "Just crawled #{page[:url]} and got a status of #{page[:status_code]}."
  raise MyCustomError, "message" if condition
end
Just crawled http://www.pepsico.com/ and got a status of 200.
# ... eventually condition is met ...
MyCustomError: message
        from (somewhere):3
# ...

Encode crawl stop options

This would be a higher level way of enshrining these as features, and would be a lot cleaner overall.

require 'cobweb'

pages = 0
puts Time.now #=> 2017-04-19 13:33:11 +0100 

CobwebCrawler.new(:max_pages => 1000, :max_time => 360).crawl("http://pepsico.com") do |page|
  pages += 1
end
puts "Stopped after #{pages} pages at #{Time.now}"
#=> Stopped after 1000 pages at 2017-04-19 13:36:25 +0100
# (... or some other time that is not more than 360 seconds from start time)

Ideally :max_time would accept DateTime, Time or Integer objects, where the integer would represent seconds.

I'm totally new to this project, so feel free to let me know if these are crazy requests. I'm happy to help make this too, if you can give me a pointer as to where this would start out.

@svenaas
Copy link

svenaas commented Apr 28, 2020

We would benefit from :max_pages or :max_time options, especially in development and test environments.

@stewartmckee
Copy link
Owner

If the exception is raised then would you want the whole crawl to stop at that point?

I think you get the same from max_pages by submitting crawl_limit, there is also a crawl_limit_by_page boolean which i think is false by default. crawl_limit is the max number of urls, and if crawl_limit_by_page is set to true then the crawl_limit only applies to text/html content.

Like the idea of max_time though, hadn't thought of that before, thinking that would set a datetime and include that date into the within_crawl_limits to check if it has passed, so could also consume a stop_at datetime. max_time would just do the arithmetic for you.

@samnissen
Copy link
Author

samnissen commented Mar 31, 2021

Yes, I think raising the error, breaking, or returning should stop the crawl as the default.

Wasn't aware of the crawl_limit – will check that out thank you.

As for max_time, I'm thinking that would probably be an integer, whereas something like stop_at could be a datetime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants