Review overridden settings in settings.py #461

jpmckinney · 2020-07-22T18:43:25Z

Notably:

CONCURRENT_REQUESTS: 32 (default 16)
- I think this is fine, though as the number of spiders increases, we might want to increase it further when we're running commands like latestreleasedate or dryrun.
CONCURRENT_REQUESTS_PER_DOMAIN: 2 (default 8)
- I think some sources can handle a lot more concurrent requests than this. It would be nice for sources that can handle more to complete faster. We can maybe leave it at the default in settings.py, and lower it in individual spiders as needed. (Right now we set CONCURRENT_REQUESTS to 1 in a few cases.) If some servers can do more than 8, we can also increase it. We don't need to be super accurate, though.
DOWNLOAD_TIMEOUT: 360 (6 minutes)
- I set this based on the performance of the scrapy dryrun command. In short, it needs to be high enough that slow servers can respond, but low enough that a spider doesn't wait a very long time for a non-response.

The text was updated successfully, but these errors were encountered:

jpmckinney · 2024-02-08T19:07:37Z

Moving from #655: See if some sources support greater concurrency

jpmckinney added the framework Relating to other common functionality label Jul 22, 2020

yolile added this to the Priority milestone Mar 3, 2021

jpmckinney added settings Relating to how we configure settings and removed framework Relating to other common functionality labels Sep 1, 2021

jpmckinney mentioned this issue Sep 15, 2021

Improve individual spider documentation #655

Closed

6 tasks

Provide feedback