Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential problem with reading batches when batches are deleted only on consumer close? #15

Open
hermit-crab opened this issue Aug 25, 2019 · 4 comments
Labels

Comments

@hermit-crab
Copy link

Good day. Let's say we have a million requests inside a slot, then consumer defines either HCF_CONSUMER_MAX_REQUESTS = 15000 or HCF_CONSUMER_MAX_BATCHES = 150 or it just closes itself at after N hours. Then it also defines HCF_CONSUMER_DELETE_BATCHES_ON_STOP = True, so it only purges batches upon exiting.

In this case, since as far as I can tell there is no pagination for scrapycloud_frontier_slot.queue.iter(mincount) the consumer will be iteration only the initial MAX_NEXT_REQUESTS reading them over and over till it reaches either max requests / max batches / self enforced time limit, won't it?

@starrify
Copy link
Member

Unfortunately hubstorage doesn't paginate HCF's slot read (yet?) (code). Therefore HCF_CONSUMER_DELETE_BATCHES_ON_STOP would cause the backend to access only the initial one read (approximately MAX_NEXT_REQUESTS requests).

It seems that a good approach may not be available unless hubstorage supports pagination.

One possible workaround is to delete batches when the spider is idle (something like HCF_CONSUMER_DELETE_BATCHES_ON_IDLE?) to make sure previous requests are all consumed, before performing the next read. However this would usually harm the spider's concurrency / throughput.

@kalessin
Copy link
Member

The idea behind HCF_CONSUMER_DELETE_BATCHES_ON_STOP is to support cases when one need to ensure that batches are deleted only if spider finishes. And yes, that requires to set MAX_NEXT_REQUESTS to the same number of requests/batches you want to read per job. So if you have 1 million requests, it is better if you split the crawling of them among multiple jobs.

About alternative mentioned by @starrify, similar behavior but deleting on IDLE signal would require a different spider architecture. If you use scrapy-frontera scheduler, new requests are not read on idle signal but each time there are available concurrency slots and local requests queue is empty.

A possibility, without depending on hubstorage, is to implement (in scrapy-frontera?) some advanced request tracking that deletes each batch once all requests for that batch were processed (either successfully, or with errors) and provided some configurable conditions are met (for example, max number of errors, things like that). Even if in future hubstorage provides hcf batches pagination, that would still be a useful feature, that is not the first time it was talked about.

@hermit-crab
Copy link
Author

I haven't thoroughly checked the code if it's already there but would it make sense to enforce consumer closure (i.e. only reading hcf once) after the first MAX_NEXT_REQUESTS are read or document / show a warning when a HCF_CONSUMER_MAX_REQUESTS/HCF_CONSUMER_MAX_BATCHES values exceed MAX_NEXT_REQUESTS (when DELETE_BATCHES_ON_STOP turned on)? This seems to be a non obvious surprise issue that a user could only catch if he's well familiar with the framework mechanics and hubstorage limitations.

@kalessin
Copy link
Member

Yes, that would be a good idea, @hermit-crab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants