Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve client handling around director outages #1389

Open
bbockelm opened this issue Jun 6, 2024 · 4 comments
Open

Improve client handling around director outages #1389

bbockelm opened this issue Jun 6, 2024 · 4 comments
Assignees
Labels
client Issue affecting the OSDF client critical High priority for next release director Issue relating to the director component enhancement New feature or request
Milestone

Comments

@bbockelm
Copy link
Collaborator

bbockelm commented Jun 6, 2024

A client unlucky enough to start while the director is restarted will get either a 404 or a 500 from the Traefik ingress (I think it depends slightly on where things are in the restart process).

This is a fatal outage; here's a few example hold messages generated:

Transfer input files failure at execution point [email protected] using protocol osdf. \
Details: Error from [email protected]: Failed to transfer files: \
FILETRANSFER:1:non-zero exit (1) from /usr/libexec/condor/stash_plugin. |\
Error: failed to get namespace information for remote URL osdf:///chtc/staging/blah/container.sif: \
Could not unmarshall the director's response: invalid character 'B' looking for beginning of value \
( URL file = osdf:///chtc/staging/blah/container.sif )|
Transfer input files failure at execution point [email protected] using protocol osdf. \
Details: Error from [email protected]: Failed to transfer files: \
FILETRANSFER:1:non-zero exit (1) from /usr/libexec/condor/stash_plugin. |\
Error: failed to get namespace information for remote URL osdf:///chtc/staging/blah/container.sif: 404: 404 page not found\n \
( URL file = osdf:///chtc/staging/blah/container.sif )|

A few thoughts:

  1. There's an obvious bug in that we try to unmarshal a JSON struct out of the body for a non-JSON response (I think the first response is actually plain-text from Traefik).
  2. If we differentiate a response from the director versus Traefik, we could retry. We should add a Server header from the director to detect this situation.
  3. For error responses from Traefik -- or, in general, for network connectivity issues -- we could consider not only retrying in general but retrying more aggressively when in "plugin" mode. While I might want a CLI to give up after 5 seconds, I can probably spend 30 seconds of retrying in plugin mode.
@bbockelm bbockelm added enhancement New feature or request client Issue affecting the OSDF client director Issue relating to the director component labels Jun 6, 2024
@bbockelm bbockelm added this to the v7.10.0 milestone Jun 6, 2024
@jhiemstrawisc jhiemstrawisc added the critical High priority for next release label Sep 24, 2024
@jhiemstrawisc jhiemstrawisc modified the milestones: v7.10.0, v7.12.0 Sep 24, 2024
@jhiemstrawisc
Copy link
Member

@bbockelm, it looks like you started working on this in #1565. Do you intend for that PR to close this issue?

@aowen-uwmad
Copy link
Contributor

I encountered this bug (or version of it) this afternoon, around 1:45 PM CDT, directly using the pelican binary in Ubuntu 22.04 (on WSL, Windows 11).

Version: 7.8.6
Build Date: 2024-06-06T21:27:21Z
Build Commit: f911e9940cc4ba00827475dbd4a33545bd927554
Built By: goreleaser

Managed to catch the error with debug enabled:

$ ./pelican -d object get pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validati
on/test.txt downloaded-test.txt
WARNING[2024-10-22T13:47:18-05:00] Debug is set as a flag or in config, this will override anything set for Logging.Level within your configuration
DEBUG[2024-10-22T13:47:18-05:00] Launch progress bars display
DEBUG[2024-10-22T13:47:18-05:00] Len of source: 2
DEBUG[2024-10-22T13:47:18-05:00] Sources: [pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt]
DEBUG[2024-10-22T13:47:18-05:00] Destination: downloaded-test.txt
DEBUG[2024-10-22T13:47:18-05:00] Making new clients
DEBUG[2024-10-22T13:47:18-05:00] Created new client 0192b58e-0daf-7aa0-a498-a4ec5a9973de
DEBUG[2024-10-22T13:47:18-05:00] Detected pelican:// url, getting federation metadata from specified host osg-htc.org

DEBUG[2024-10-22T13:47:18-05:00] Performing federation service discovery for specified url against endpoint https://osg-htc.org
DEBUG[2024-10-22T13:47:18-05:00] Federation service discovery resulted in director URL https://osdf-director.osg-htc.org
DEBUG[2024-10-22T13:47:18-05:00] Federation service discovery resulted in registry URL https://osdf-registry.osg-htc.org
DEBUG[2024-10-22T13:47:18-05:00] Federation service discovery resulted in JWKS URL https://osg-htc.org/osdf/public_signing_key.jwks
DEBUG[2024-10-22T13:47:18-05:00] Federation service discovery resulted in broker URL
DEBUG[2024-10-22T13:47:18-05:00] Will query director at https://osdf-director.osg-htc.org for object /ospool/uc-shared/public/OSG-Staff/validation/test.txt
DEBUG[2024-10-22T13:47:18-05:00] Director's response: &{301 Moved Permanently 301 HTTP/1.1 1 1 map[Alt-Svc:[h3=":443"; ma=86400] Cf-Cache-Status:[DYNAMIC] Cf-Ray:[8d6ba7b75a4c61b1-ORD] Connection:[keep-alive] Content-Length:[17] Content-Type:[text/plain; charset=utf-8] Date:[Tue, 22 Oct 2024 18:47:19 GMT] Location:[https://osdf-director.osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt] Nel:[{"success_fraction":0,"report_to":"cf-nel","max_age":604800}] Report-To:[{"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v4?s=0FEEo1BzuV1dLAQwYiPjB3DWJ4%2FdYXERSYhM3446j0Fp4IGX6dsN1SfOzaFrHDV9nFdnqF9ZtEpgGDWbQFlwwken6NUuuRT6qqTPcfV7ThYyZJ5uWK53SWgvwBFZOazLABJTm7yNYjXbkd0Y"}],"group":"cf-nel","max_age":604800}] Server:[cloudflare] Server-Timing:[cfL4;desc="?proto=TCP&rtt=9006&sent=5&recv=7&lost=0&retrans=0&sent_bytes=3098&recv_bytes=524&delivery_rate=445156&cwnd=231&unsent_bytes=0&cid=c44fddeab861b0a0&ts=85&x=0"] Strict-Transport-Security:[max-age=2592000]] 0xc0009180c0 17 [] false false map[] 0xc000626000 0xc000204d10}
ERROR[2024-10-22T13:47:18-05:00] Error while querying the Director: Could not unmarshall the director's response: invalid character 'M' looking for beginning of value
ERROR[2024-10-22T13:47:18-05:00] Could not unmarshall the director's response: invalid character 'M' looking for beginning of value
DEBUG[2024-10-22T13:47:18-05:00] Shutting down transfer engine
DEBUG[2024-10-22T13:47:18-05:00] Job handler has been shutdown
ERROR[2024-10-22T13:47:18-05:00] Failure getting pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt: failed to get namespace information for remote URL pelican://osg-htc.org/ospool/uc-shared/public/OSG-Staff/validation/test.txt: Could not unmarshall the director's response: invalid character 'M' looking for beginning of value

Oddly, if I typed the command manually, I did not receive an error. As demonstrated live to @turetske , @jhiemstrawisc , if I copied the command from the website (using the "Copy" button in the code block, or highlighting then Ctrl+C) it produced the above error. As such, I initially thought it was an issue with the website.

Later attempts to reproduce the error worked as expected; presumably the OSDF director was back online. I did change my wireless network in between tests; perhaps the first network was doing something funky?

@bbockelm
Copy link
Collaborator Author

@aowen-uwmad - that's a puzzler.

The headers indicate you're talking to a HTTP server that's proxied through CloudFlare. Needless to say -- that's not where our director lives!

Is there anything special about your laptop's networking setup that might be relevant?

Next time this occurs, could you also run dig -t AAAA osdf-director.osg-htc.org and dig osdf-director.osg-htc.org to see if you're picking up a funky DNS resolution?

Another idea - could you and @williamnswanson try reproducing it for the ITB director?

@aowen-uwmad
Copy link
Contributor

My laptop setup has a history of funky networking things, but it hasn't acted up in quite a while.
I can try to replicate it this afternoon.

@jhiemstrawisc jhiemstrawisc modified the milestones: v7.12.0, v7.13.0 Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client Issue affecting the OSDF client critical High priority for next release director Issue relating to the director component enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants