Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downtime set by origin and cache #1996

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

h2zh
Copy link
Collaborator

@h2zh h2zh commented Feb 6, 2025

Now origin and cache server can set their own downtime in the director by simply sending a request with payload {"enableDowntime": true/false} to API /api/v1.0/downtime. Then this PR will take care of the setting in the backend.

This is a backend change. Frontend UI is not included.

Closes #1251

@h2zh h2zh added cache Issue relating to the cache component origin Issue relating to the origin component labels Feb 6, 2025
@h2zh h2zh added this to the v7.14 milestone Feb 6, 2025
@h2zh h2zh requested a review from CannonLock February 6, 2025 23:53
@h2zh h2zh force-pushed the downtime-set-by-origin-and-cache branch from c7714da to 4fc17d3 Compare February 7, 2025 16:46
Copy link
Contributor

@CannonLock CannonLock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small items and one big one that might result in some refactoring.

It is unwieldy to work with one off endpoints like this that only are there to mark a boolean. Is there any way that we can design this more RESTfully so that we are instead POSTing a body like {downtime: true/false} to /director/servers/.

This way if we ever want to adjust a servers state we can just tack it on the same POST request rather then creating a whole new endpoint.

I realize this is not easy as the ServerName in this case is a url which is a unwieldy url key. Is there anyway that we can get the ServerAd from the Servername?

There are other endpoints in Pelican that follow this model that I hope to change someday as well: https://github.com/orgs/PelicanPlatform/discussions/1688

return
}
ctx.JSON(http.StatusOK, server_structs.SimpleApiResp{
Status: server_structs.RespFailed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Status: server_structs.RespFailed,
Status: server_structs.RespOK,

// Construct the request body
downtimeRequest := server_structs.DowntimeRequest{
ServerUrl: serverUrl,
Hostname: hostname,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why pass in Hostname if you don't use it?

Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this approach will work:

  1. The information is sent once, meaning the director has to actually be up and functional or the downtime is lost. That conflicts with the whole idea of periodically sending the service's state to the director.
  • Instead of adding the new API, make the service state part of the ad and send out a new ad. Basically, adding a new field to what we tell the director about ourselves.
  1. If you restart the origin/cache, it doesn't know what state it is in. That's fundamental concept: the service is an independent entity, it can't use the director as its "memory".
  • This information needs to be persisted at the service level. Make sure to talk to @CannonLock about what information should go into a downtime (it's not a boolean - e.g., there should be a start time, end time, who created the downtime, a description. We should model loosely on the downtimes we create in topology)..

Also - there's a new API and no corresponding documentation in the OpenAPI schema.

@bbockelm
Copy link
Collaborator

Additionally - when creating commit messages, please follow the advice starting in this paragraph:

  • The subject should state what the commit does.
  • The body should cover why the change is needed.
  • If the change is particularly complex, then you can cover the details of what the change is/how the change works.

Not every commit message is going to be perfect but we should always strive to better.

@h2zh h2zh modified the milestones: v7.14, v7.15 Feb 11, 2025
@CannonLock CannonLock assigned h2zh and unassigned CannonLock Feb 17, 2025
@h2zh
Copy link
Collaborator Author

h2zh commented Feb 20, 2025

Proposed system design based on @bbockelm's comment and conversation with @CannonLock.
downtime (3)

Rules between OSDF Federation admin and Origin/Cache admin: (checked with @williamnswanson)

  1. The OSDF admin can set any server to downtime.
  2. If the Origin administrator sets its own downtime, then OSDF admin cannot revoke the downtime of that origin.

Some design thoughts:

  1. Need to create a db for Cache to persist the downtime there, while Cache server is stateless at the moment - thinking of implementing it in another PR in case this one goes too giant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cache Issue relating to the cache component origin Issue relating to the origin component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Origin and cache should be able to control downtime
3 participants