Potential problem with reading batches when batches are deleted only on consumer close? #15

hermit-crab · 2019-08-25T10:17:34Z

Good day. Let's say we have a million requests inside a slot, then consumer defines either HCF_CONSUMER_MAX_REQUESTS = 15000 or HCF_CONSUMER_MAX_BATCHES = 150 or it just closes itself at after N hours. Then it also defines HCF_CONSUMER_DELETE_BATCHES_ON_STOP = True, so it only purges batches upon exiting.

In this case, since as far as I can tell there is no pagination for scrapycloud_frontier_slot.queue.iter(mincount) the consumer will be iteration only the initial MAX_NEXT_REQUESTS reading them over and over till it reaches either max requests / max batches / self enforced time limit, won't it?

The text was updated successfully, but these errors were encountered:

starrify · 2019-08-26T16:53:04Z

Unfortunately hubstorage doesn't paginate HCF's slot read (yet?) (code). Therefore HCF_CONSUMER_DELETE_BATCHES_ON_STOP would cause the backend to access only the initial one read (approximately MAX_NEXT_REQUESTS requests).

It seems that a good approach may not be available unless hubstorage supports pagination.

One possible workaround is to delete batches when the spider is idle (something like HCF_CONSUMER_DELETE_BATCHES_ON_IDLE?) to make sure previous requests are all consumed, before performing the next read. However this would usually harm the spider's concurrency / throughput.

kalessin · 2019-08-26T17:49:37Z

The idea behind HCF_CONSUMER_DELETE_BATCHES_ON_STOP is to support cases when one need to ensure that batches are deleted only if spider finishes. And yes, that requires to set MAX_NEXT_REQUESTS to the same number of requests/batches you want to read per job. So if you have 1 million requests, it is better if you split the crawling of them among multiple jobs.

About alternative mentioned by @starrify, similar behavior but deleting on IDLE signal would require a different spider architecture. If you use scrapy-frontera scheduler, new requests are not read on idle signal but each time there are available concurrency slots and local requests queue is empty.

A possibility, without depending on hubstorage, is to implement (in scrapy-frontera?) some advanced request tracking that deletes each batch once all requests for that batch were processed (either successfully, or with errors) and provided some configurable conditions are met (for example, max number of errors, things like that). Even if in future hubstorage provides hcf batches pagination, that would still be a useful feature, that is not the first time it was talked about.

hermit-crab · 2019-08-27T07:08:58Z

I haven't thoroughly checked the code if it's already there but would it make sense to enforce consumer closure (i.e. only reading hcf once) after the first MAX_NEXT_REQUESTS are read or document / show a warning when a HCF_CONSUMER_MAX_REQUESTS/HCF_CONSUMER_MAX_BATCHES values exceed MAX_NEXT_REQUESTS (when DELETE_BATCHES_ON_STOP turned on)? This seems to be a non obvious surprise issue that a user could only catch if he's well familiar with the framework mechanics and hubstorage limitations.

kalessin · 2019-08-27T20:46:14Z

Yes, that would be a good idea, @hermit-crab

hermit-crab added the question label Aug 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential problem with reading batches when batches are deleted only on consumer close? #15

Potential problem with reading batches when batches are deleted only on consumer close? #15

hermit-crab commented Aug 25, 2019

starrify commented Aug 26, 2019

kalessin commented Aug 26, 2019

hermit-crab commented Aug 27, 2019

kalessin commented Aug 27, 2019

Potential problem with reading batches when batches are deleted only on consumer close? #15

Potential problem with reading batches when batches are deleted only on consumer close? #15

Comments

hermit-crab commented Aug 25, 2019

starrify commented Aug 26, 2019

kalessin commented Aug 26, 2019

hermit-crab commented Aug 27, 2019

kalessin commented Aug 27, 2019