Improve client handling around director outages #1389

bbockelm · 2024-06-06T20:34:41Z

A client unlucky enough to start while the director is restarted will get either a 404 or a 500 from the Traefik ingress (I think it depends slightly on where things are in the restart process).

This is a fatal outage; here's a few example hold messages generated:

Transfer input files failure at execution point [email protected] using protocol osdf. \
Details: Error from [email protected]: Failed to transfer files: \
FILETRANSFER:1:non-zero exit (1) from /usr/libexec/condor/stash_plugin. |\
Error: failed to get namespace information for remote URL osdf:///chtc/staging/blah/container.sif: \
Could not unmarshall the director's response: invalid character 'B' looking for beginning of value \
( URL file = osdf:///chtc/staging/blah/container.sif )|

Transfer input files failure at execution point [email protected] using protocol osdf. \
Details: Error from [email protected]: Failed to transfer files: \
FILETRANSFER:1:non-zero exit (1) from /usr/libexec/condor/stash_plugin. |\
Error: failed to get namespace information for remote URL osdf:///chtc/staging/blah/container.sif: 404: 404 page not found\n \
( URL file = osdf:///chtc/staging/blah/container.sif )|

A few thoughts:

There's an obvious bug in that we try to unmarshal a JSON struct out of the body for a non-JSON response (I think the first response is actually plain-text from Traefik).
If we differentiate a response from the director versus Traefik, we could retry. We should add a Server header from the director to detect this situation.
For error responses from Traefik -- or, in general, for network connectivity issues -- we could consider not only retrying in general but retrying more aggressively when in "plugin" mode. While I might want a CLI to give up after 5 seconds, I can probably spend 30 seconds of retrying in plugin mode.

The text was updated successfully, but these errors were encountered:

jhiemstrawisc · 2024-09-30T15:09:50Z

@bbockelm, it looks like you started working on this in #1565. Do you intend for that PR to close this issue?

aowen-uwmad · 2024-10-22T20:31:04Z

I encountered this bug (or version of it) this afternoon, around 1:45 PM CDT, directly using the pelican binary in Ubuntu 22.04 (on WSL, Windows 11).

Version: 7.8.6
Build Date: 2024-06-06T21:27:21Z
Build Commit: f911e9940cc4ba00827475dbd4a33545bd927554
Built By: goreleaser

Managed to catch the error with debug enabled:

$ ./pelican -d object get pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validati
on/test.txt downloaded-test.txt
WARNING[2024-10-22T13:47:18-05:00] Debug is set as a flag or in config, this will override anything set for Logging.Level within your configuration
DEBUG[2024-10-22T13:47:18-05:00] Launch progress bars display
DEBUG[2024-10-22T13:47:18-05:00] Len of source: 2
DEBUG[2024-10-22T13:47:18-05:00] Sources: [pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt]
DEBUG[2024-10-22T13:47:18-05:00] Destination: downloaded-test.txt
DEBUG[2024-10-22T13:47:18-05:00] Making new clients
DEBUG[2024-10-22T13:47:18-05:00] Created new client 0192b58e-0daf-7aa0-a498-a4ec5a9973de
DEBUG[2024-10-22T13:47:18-05:00] Detected pelican:// url, getting federation metadata from specified host osg-htc.org

DEBUG[2024-10-22T13:47:18-05:00] Performing federation service discovery for specified url against endpoint https://osg-htc.org
DEBUG[2024-10-22T13:47:18-05:00] Federation service discovery resulted in director URL https://osdf-director.osg-htc.org
DEBUG[2024-10-22T13:47:18-05:00] Federation service discovery resulted in registry URL https://osdf-registry.osg-htc.org
DEBUG[2024-10-22T13:47:18-05:00] Federation service discovery resulted in JWKS URL https://osg-htc.org/osdf/public_signing_key.jwks
DEBUG[2024-10-22T13:47:18-05:00] Federation service discovery resulted in broker URL
DEBUG[2024-10-22T13:47:18-05:00] Will query director at https://osdf-director.osg-htc.org for object /ospool/uc-shared/public/OSG-Staff/validation/test.txt
DEBUG[2024-10-22T13:47:18-05:00] Director's response: &{301 Moved Permanently 301 HTTP/1.1 1 1 map[Alt-Svc:[h3=":443"; ma=86400] Cf-Cache-Status:[DYNAMIC] Cf-Ray:[8d6ba7b75a4c61b1-ORD] Connection:[keep-alive] Content-Length:[17] Content-Type:[text/plain; charset=utf-8] Date:[Tue, 22 Oct 2024 18:47:19 GMT] Location:[https://osdf-director.osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt] Nel:[{"success_fraction":0,"report_to":"cf-nel","max_age":604800}] Report-To:[{"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=0FEEo1BzuV1dLAQwYiPjB3DWJ4%2FdYXERSYhM3446j0Fp4IGX6dsN1SfOzaFrHDV9nFdnqF9ZtEpgGDWbQFlwwken6NUuuRT6qqTPcfV7ThYyZJ5uWK53SWgvwBFZOazLABJTm7yNYjXbkd0Y"}],"group":"cf-nel","max_age":604800}] Server:[cloudflare] Server-Timing:[cfL4;desc="?proto=TCP&rtt=9006&sent=5&recv=7&lost=0&retrans=0&sent_bytes=3098&recv_bytes=524&delivery_rate=445156&cwnd=231&unsent_bytes=0&cid=c44fddeab861b0a0&ts=85&x=0"] Strict-Transport-Security:[max-age=2592000]] 0xc0009180c0 17 [] false false map[] 0xc000626000 0xc000204d10}
ERROR[2024-10-22T13:47:18-05:00] Error while querying the Director: Could not unmarshall the director's response: invalid character 'M' looking for beginning of value
ERROR[2024-10-22T13:47:18-05:00] Could not unmarshall the director's response: invalid character 'M' looking for beginning of value
DEBUG[2024-10-22T13:47:18-05:00] Shutting down transfer engine
DEBUG[2024-10-22T13:47:18-05:00] Job handler has been shutdown
ERROR[2024-10-22T13:47:18-05:00] Failure getting pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt: failed to get namespace information for remote URL pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt: Could not unmarshall the director's response: invalid character 'M' looking for beginning of value

Oddly, if I typed the command manually, I did not receive an error. As demonstrated live to @turetske , @jhiemstrawisc , if I copied the command from the website (using the "Copy" button in the code block, or highlighting then Ctrl+C) it produced the above error. As such, I initially thought it was an issue with the website.

Later attempts to reproduce the error worked as expected; presumably the OSDF director was back online. I did change my wireless network in between tests; perhaps the first network was doing something funky?

bbockelm · 2024-10-26T15:29:29Z

@aowen-uwmad - that's a puzzler.

The headers indicate you're talking to a HTTP server that's proxied through CloudFlare. Needless to say -- that's not where our director lives!

Is there anything special about your laptop's networking setup that might be relevant?

Next time this occurs, could you also run dig -t AAAA osdf-director.osg-htc.org and dig osdf-director.osg-htc.org to see if you're picking up a funky DNS resolution?

Another idea - could you and @williamnswanson try reproducing it for the ITB director?

aowen-uwmad · 2024-10-28T14:48:02Z

My laptop setup has a history of funky networking things, but it hasn't acted up in quite a while.
I can try to replicate it this afternoon.

bbockelm added enhancement New feature or request client Issue affecting the OSDF client director Issue relating to the director component labels Jun 6, 2024

bbockelm added this to the v7.10.0 milestone Jun 6, 2024

bbockelm assigned jhiemstrawisc Jun 6, 2024

jhiemstrawisc added the critical High priority for next release label Sep 24, 2024

jhiemstrawisc modified the milestones: v7.10.0, v7.12.0 Sep 24, 2024

jhiemstrawisc modified the milestones: v7.12.0, v7.13.0 Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve client handling around director outages #1389

Improve client handling around director outages #1389

bbockelm commented Jun 6, 2024

jhiemstrawisc commented Sep 30, 2024

aowen-uwmad commented Oct 22, 2024

bbockelm commented Oct 26, 2024

aowen-uwmad commented Oct 28, 2024

Improve client handling around director outages #1389

Improve client handling around director outages #1389

Comments

bbockelm commented Jun 6, 2024

jhiemstrawisc commented Sep 30, 2024

aowen-uwmad commented Oct 22, 2024

bbockelm commented Oct 26, 2024

aowen-uwmad commented Oct 28, 2024