-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential problem with reading batches when batches are deleted only on consumer close? #15
Comments
Unfortunately hubstorage doesn't paginate HCF's slot read (yet?) (code). Therefore It seems that a good approach may not be available unless hubstorage supports pagination. One possible workaround is to delete batches when the spider is idle (something like |
The idea behind About alternative mentioned by @starrify, similar behavior but deleting on IDLE signal would require a different spider architecture. If you use scrapy-frontera scheduler, new requests are not read on idle signal but each time there are available concurrency slots and local requests queue is empty. A possibility, without depending on hubstorage, is to implement (in scrapy-frontera?) some advanced request tracking that deletes each batch once all requests for that batch were processed (either successfully, or with errors) and provided some configurable conditions are met (for example, max number of errors, things like that). Even if in future hubstorage provides hcf batches pagination, that would still be a useful feature, that is not the first time it was talked about. |
I haven't thoroughly checked the code if it's already there but would it make sense to enforce consumer closure (i.e. only reading hcf once) after the first |
Yes, that would be a good idea, @hermit-crab |
Good day. Let's say we have a million requests inside a slot, then consumer defines either
HCF_CONSUMER_MAX_REQUESTS = 15000
orHCF_CONSUMER_MAX_BATCHES = 150
or it just closes itself at after N hours. Then it also definesHCF_CONSUMER_DELETE_BATCHES_ON_STOP = True
, so it only purges batches upon exiting.In this case, since as far as I can tell there is no pagination for
scrapycloud_frontier_slot.queue.iter(mincount)
the consumer will be iteration only the initialMAX_NEXT_REQUESTS
reading them over and over till it reaches either max requests / max batches / self enforced time limit, won't it?The text was updated successfully, but these errors were encountered: