Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch curl to using the multi-handle & workers #47

Merged
merged 10 commits into from
Oct 23, 2024

Conversation

bbockelm
Copy link
Collaborator

We've noticed the performance of the plugin is horribly slow -- to the point where we don't have enough cores in machines to have the origin keep up with the caches (why the caches are pushing so hard is a separate story).

A few judicious uses of pstack indicates that almost all the CPU time is spent in TLS handshakes. It turns out that the curl_easy_* APIs have no connection reuse unless you reuse the handle itself. So, every single request is a separate TLS connection to the upstream HTTP / S3 server. That is indeed, no fun.

This takes motivation (and code) from https://github.com/PelicanPlatform/xrdcl-pelican to have all the requests handled in a set of separate worker threads. The request object enters a queue, is processed by a worker, and then, on completion, the original thread is notified.

This means there's probably 5-ish connections held open to AWS (one per thread) instead of many thousands of short-lived ones.

@bbockelm bbockelm requested review from rw2 and jhiemstrawisc October 15, 2024 17:13
@bbockelm bbockelm force-pushed the multi_curl branch 5 times, most recently from 2b51ba7 to 38309ce Compare October 15, 2024 18:12
@rw2
Copy link
Collaborator

rw2 commented Oct 15, 2024

I built this and ran my stress test on it to see if maybe it fixed the failures problem as a side effect. Quite the opposite! The existing code base fails at something like a 0-20% rate, with high levels of variability. This update is failing at more like a 90% or more rate. I don't have a ton of time this afternoon to dig deeper, but have a bunch of time tomorrow to see if this might help us uncover the failure problem(s).

@rw2
Copy link
Collaborator

rw2 commented Oct 16, 2024

Here's the log of my latest failure. 95% of transfers failed. 241016 rich log.txt

@bbockelm bbockelm force-pushed the multi_curl branch 3 times, most recently from 15cc21c to 56656fd Compare October 22, 2024 06:03
bbockelm and others added 8 commits October 22, 2024 11:58
We now depend on XRootD 5.7.0 or later -- that means we'll need
upstream builds.
Start using the new XrdOucGatherConf-style syntax to process S3/HTTP
server configurations.  Used while otherwise debugging the branch.
This adds a new test fixture that launches an xrootd daemon for use
during the HTTP unit tests.
Print a fixed size of the buffer, not a string.  curl doesn't necessarily
null-terminate.
@bbockelm bbockelm force-pushed the multi_curl branch 2 times, most recently from 2728449 to 11f6b31 Compare October 22, 2024 10:07
When providing a successful response to a Stat request, initialize
all the fields inside the struct.  This ensures everything is zeroed
out when it is later used.

(Use of uninitialized data was reported by valgrind)
Copy link
Member

@jhiemstrawisc jhiemstrawisc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was a little long to crawl through for line-by-line understanding like I usually like to do, but the overall curl_multi approach seems solid (aided by the fact that we use the same approach in xrdcl-pelican) and the extra tests are always welcome. If I had one request, it would be to crank up the volume on comment frequency and verbosity.

My matrix of hand-run tests all passed.

@bbockelm bbockelm merged commit 1312ddf into PelicanPlatform:main Oct 23, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants