-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pool connections across threads and/or make WaybackClient thread-safe #58
Comments
Now that #2 is finished, I’d like to get to this sometime relatively soon, so urllib3 is probably the better option: it’s production-safe, the same core implementation we already depend on, and compatible with the complicated hack we already have for gzip. |
Updates, since it's been 2 and a half years:
Both require Python 3.7+, and we have so far tried to support 3.6. However, that’s mostly been for Google Colab users, and Colab now runs Python 3.10 (wow, they really zoomed forward — I recall checking last year they were still on 3.6). Dropping 3.6 support might be ok now. |
Some other urllib3 v2 notes:
|
We’ve since updated to support urllib3 v2 under the hood, but still call into it through requests, so the issues there have been handled. We are also moving up to Python 3.8+ in the next release, so version support is also a non-issue now. At the moment, I think I am going to move forward with direct usage of urllib3 rather than switching to httpx. I would like to move to httpx in the long run so we can get async support (I don’t think urllib3 will have this in the foreseeable future) and HTTP/2 support (but hopefully that’s actually coming soon in urllib3: urllib3/urllib3#3000). For now, though, getting thread-safety done is really important, and keeping with the known behavior of urllib3 for all our weird hacks and such should make that easier than shifting to httpx and learning whatever weird quirks it might have (there are always some). I hope to get this done over the next couple weeks and include it in a v0.5.0 release before the end of the year. |
This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit. Part of #58.
This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit. Part of #58.
This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit. Part of #58.
This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit. Part of #58.
Hello there, Maybe, try Niquests instead? |
Thanks for the recommendation! I don’t think it’s necessarily any easier for us to use than httpx would be, but I’ll take a look. |
This is a current major goal, but we didn’t have an issue tracking it!
This package should be thread-safe, but because it is based on Requests, we can’t guarantee that. Requests’s authors won’t guarantee it, and at various times have expressed everything from certainty that it’s not thread-safe to cautious optimism that it might be, but overall have expressed that they don’t plan to go out of their way to make certain (and then to maintain that status).
Thread safety is important for EDGI’s use: we need to pull lots of data and so need to make
get_memento()
calls concurrent. Connections also need to be pooled and shared across threads for stability (when EDGI implemented some hacks to do this, speed and reliability both improved considerably). In any case, other EDGI codebases implement some pretty nutty workarounds to make that relatively safe. That shouldn’t be necessary, and people should just be able to use a singleWaybackClient
across multiple threads.Ideally, we should switch to a different HTTP client instead of Requests. Two options:
urllib3 is thread-safe and is what Requests is based on, so it should be a reliable switch. Its API is very different, though, so there’s lots to update and test.
httpx is new, has an API that matches Requests, and has lots of fancy features. It also has async support in case we ever want to make this package async-capable in the future. However, it’s still in beta (should be final before the end of the year), and may possibly have some funny under-the-hood differences we’ll need to account for. We’d at least need to figure out how to re-implement our crazy hack for Wayback’s gzip issues.
Some other approaches worth keeping in the back pocket, but that probably aren’t ideal:
A sketched out implementation in Sketch out a way to support multithreading #23 makes a funky abstraction that lets it appear as though you are using a single
WaybackClient
on multiple threads, when in fact you are using several. It’s clever, but also a little hacky and probably has a lot of messy corner cases as a result. I don’t feel great about it.We could implement EDGI’s workaround from web-monitoring-processing under the hood so that all connections across all clients are pooled. This makes things magically seem to work even if you create separate clients on each thread, but it’s probably unexpected behavior. If a user was actually trying to isolate sets of connections, this would get in their way (and do it in a silent way so they might not even know). It also depends on a small part of Requests staying as thread-safe as it currently is, and there are no guarantees there.
The text was updated successfully, but these errors were encountered: