Pool connections across threads and/or make WaybackClient thread-safe #58

Mr0grog · 2020-10-21T07:19:18Z

This is a current major goal, but we didn’t have an issue tracking it!

This package should be thread-safe, but because it is based on Requests, we can’t guarantee that. Requests’s authors won’t guarantee it, and at various times have expressed everything from certainty that it’s not thread-safe to cautious optimism that it might be, but overall have expressed that they don’t plan to go out of their way to make certain (and then to maintain that status).

Thread safety is important for EDGI’s use: we need to pull lots of data and so need to make get_memento() calls concurrent. Connections also need to be pooled and shared across threads for stability (when EDGI implemented some hacks to do this, speed and reliability both improved considerably). In any case, other EDGI codebases implement some pretty nutty workarounds to make that relatively safe. That shouldn’t be necessary, and people should just be able to use a single WaybackClient across multiple threads.

Ideally, we should switch to a different HTTP client instead of Requests. Two options:

urllib3 is thread-safe and is what Requests is based on, so it should be a reliable switch. Its API is very different, though, so there’s lots to update and test.
httpx is new, has an API that matches Requests, and has lots of fancy features. It also has async support in case we ever want to make this package async-capable in the future. However, it’s still in beta (should be final before the end of the year), and may possibly have some funny under-the-hood differences we’ll need to account for. We’d at least need to figure out how to re-implement our crazy hack for Wayback’s gzip issues.

Some other approaches worth keeping in the back pocket, but that probably aren’t ideal:

A sketched out implementation in Sketch out a way to support multithreading #23 makes a funky abstraction that lets it appear as though you are using a single WaybackClient on multiple threads, when in fact you are using several. It’s clever, but also a little hacky and probably has a lot of messy corner cases as a result. I don’t feel great about it.
We could implement EDGI’s workaround from web-monitoring-processing under the hood so that all connections across all clients are pooled. This makes things magically seem to work even if you create separate clients on each thread, but it’s probably unexpected behavior. If a user was actually trying to isolate sets of connections, this would get in their way (and do it in a silent way so they might not even know). It also depends on a small part of Requests staying as thread-safe as it currently is, and there are no guarantees there.

The text was updated successfully, but these errors were encountered:

Mr0grog · 2020-10-21T07:21:15Z

Now that #2 is finished, I’d like to get to this sometime relatively soon, so urllib3 is probably the better option: it’s production-safe, the same core implementation we already depend on, and compatible with the complicated hack we already have for gzip.

Mr0grog · 2023-05-09T17:38:07Z

Updates, since it's been 2 and a half years:

urllib3 v2 is out now, which adds more high-level tooling.
httpx is still v0.x, but no longer markets itself as beta or unstable; it's probably safe to use, and supports HTTP/2 (but not yet 3).

Both require Python 3.7+, and we have so far tried to support 3.6. However, that’s mostly been for Google Colab users, and Colab now runs Python 3.10 (wow, they really zoomed forward — I recall checking last year they were still on 3.6). Dropping 3.6 support might be ok now.

Mr0grog · 2023-05-09T19:05:05Z

Some other urllib3 v2 notes:

urllib3.HTTPResponse.from_httplib was removed (see HTTPConnection should return urllib3.HTTPResponse instead of http.client.HTTPResponse urllib3/urllib3#2648), so our hack for Wayback’s malformed gzip headers needs updating. OTOH, I think Wayback finally fixed that issue, so maybe we can just drop it. :

wayback/wayback/_client.py

Lines 157 to 186 in ee40997

    
           ##################################################################### 
        
           # HACK: handle malformed Content-Encoding headers from Wayback. 
        
           # When you send `Accept-Encoding: gzip` on a request for a memento, Wayback 
        
           # will faithfully gzip the response body. However, if the original response 
        
           # from the web server that was snapshotted was gzipped, Wayback screws up the 
        
           # `Content-Encoding` header on the memento response, leading any HTTP client to 
        
           # *not* decompress the gzipped body. Wayback folks have no clear timeline for 
        
           # a fix, hence the workaround here. More info in this issue: 
        
           # https://github.com/edgi-govdata-archiving/web-monitoring-processing/issues/309 
        
           # 
        
           # This subclass of urllib3's response class identifies the malformed headers 
        
           # and repairs them before instantiating the actual response object, so when it 
        
           # reads the body, it knows to decode it correctly. 
        
           # 
        
           # See what we're overriding from urllib3: 
        
           # https://github.com/urllib3/urllib3/blob/a6ec68a5c5c5743c59fe5c62c635c929586c429b/src/urllib3/response.py#L499-L526 
        
           class WaybackResponse(HTTPConnectionPool.ResponseCls): 
        
               @classmethod 
        
               def from_httplib(cls, httplib_response, **response_kwargs): 
        
                   headers = httplib_response.msg 
        
                   pairs = headers.items() 
        
                   if ('content-encoding', '') in pairs and ('Content-Encoding', 'gzip') in pairs: 
        
                       del headers['content-encoding'] 
        
                       headers['Content-Encoding'] = 'gzip' 
        
                   return super().from_httplib(httplib_response, **response_kwargs) 
        
           HTTPConnectionPool.ResponseCls = WaybackResponse 
        
           # END HACK 
        
           #####################################################################

It looks like some exceptions have changed, so we should be careful about those. (That said, rejiggering all the exception logic will be pretty messy switching away from Requests anyway.)
Release notes: https://github.com/urllib3/urllib3/releases/tag/2.0.0
Migration guide: https://urllib3.readthedocs.io/en/latest/v2-migration-guide.html

Mr0grog · 2023-12-13T00:41:03Z

We’ve since updated to support urllib3 v2 under the hood, but still call into it through requests, so the issues there have been handled. We are also moving up to Python 3.8+ in the next release, so version support is also a non-issue now.

At the moment, I think I am going to move forward with direct usage of urllib3 rather than switching to httpx. I would like to move to httpx in the long run so we can get async support (I don’t think urllib3 will have this in the foreseeable future) and HTTP/2 support (but hopefully that’s actually coming soon in urllib3: urllib3/urllib3#3000). For now, though, getting thread-safety done is really important, and keeping with the known behavior of urllib3 for all our weird hacks and such should make that easier than shifting to httpx and learning whatever weird quirks it might have (there are always some).

I hope to get this done over the next couple weeks and include it in a v0.5.0 release before the end of the year.

This starts the path toward removing the requests package by rejiggering *most* of our exception handling. There is a huge amount more to do and this doesn't even begin to start using urllib3 to make requests, so basically everything should be completely broken with this commit. Part of #58.

Ousret · 2024-01-16T03:49:05Z

Hello there,

Maybe, try Niquests instead?
We can assist if needed. There a non negligible chance that is going to ease your tasks ahead.

Mr0grog · 2024-01-17T07:12:34Z

Thanks for the recommendation! I don’t think it’s necessarily any easier for us to use than httpx would be, but I’ll take a look.

Mr0grog added the enhancement New feature or request label Oct 21, 2020

Mr0grog mentioned this issue Aug 31, 2021

ClosedConnectionError & rate limiting #82

Open

Mr0grog mentioned this issue Nov 7, 2022

Multiple threads #106

Open

Mr0grog added this to Wayback Roadmap Nov 10, 2022

Mr0grog added this to the v0.5.0 milestone Nov 10, 2022

Mr0grog mentioned this issue May 30, 2023

Not compatible with urllib3 2.0 #116

Closed

Mr0grog mentioned this issue Nov 2, 2023

search_calls_per_second needs to be dialed down #137

Closed

Mr0grog mentioned this issue Dec 7, 2023

Winter cleaning for rate limiting #147

Merged

4 tasks

Mr0grog self-assigned this Dec 13, 2023

Mr0grog pinned this issue Dec 13, 2023

Mr0grog mentioned this issue Dec 13, 2023

Add top-level convenience functions #149

Open

Mr0grog moved this to Prioritized in Wayback Roadmap Dec 13, 2023

Mr0grog moved this from Prioritized to In Progress in Wayback Roadmap Dec 13, 2023

Mr0grog linked a pull request Dec 14, 2023 that will close this issue

Remove requests dependency #152

Draft

Mr0grog mentioned this issue Dec 30, 2023

Remove requests from public interfaces #157

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pool connections across threads and/or make WaybackClient thread-safe #58

Pool connections across threads and/or make WaybackClient thread-safe #58

Mr0grog commented Oct 21, 2020

Mr0grog commented Oct 21, 2020

Mr0grog commented May 9, 2023

Mr0grog commented May 9, 2023

Mr0grog commented Dec 13, 2023

Ousret commented Jan 16, 2024 •

edited

Loading

Mr0grog commented Jan 17, 2024

Pool connections across threads and/or make WaybackClient thread-safe #58

Pool connections across threads and/or make WaybackClient thread-safe #58

Comments

Mr0grog commented Oct 21, 2020

Mr0grog commented Oct 21, 2020

Mr0grog commented May 9, 2023

Mr0grog commented May 9, 2023

Mr0grog commented Dec 13, 2023

Ousret commented Jan 16, 2024 • edited Loading

Mr0grog commented Jan 17, 2024

Ousret commented Jan 16, 2024 •

edited

Loading