[RFD] Support for Power Management #64

cjh1 · 2024-11-08T15:30:10Z

Support for Power Management

This RFD lays out a path to adding power management to OpenCHAMI.

What needs to change:

Currently OpenCHAMI doesn't have a way to perform basic power management operations such as powering on a set of nodes. A well integrated power management service is an important component of a system management platform.

What do you propose?

The proposal is to bring the existing Cray-HPE Power Control Service into the OpenCHAMI project.

Starting from an existing code base that integrates with SMD seems like a more pragmatic approach, rather than starting from scratch. It also has the advantage for those that have existing Cray-HPE hardware that can reuse integrations with the existing PCS API. In general, it seems that the PCS API is pretty functional, and many of the issues discussed below are the result of the implementation of the command line tools that use the PCS API, rather than deficiencies in the service itself.

Inline with the transition of SMD to the OpenCHAMI project. The following set of changes would performed initially:

The vendor directory will be removed and the go version will be updated.
The release handling will be updated to use goreleaser and publish containers to ghcr.io.
The mux router will be switched out for chi to be consistent with other OpenCHAMI codebases.

PCS and its tooling do have some pain points that will serve as a bug/feature list for future development.

Here are a few of the top issues raised by NERSC staff:

Quite frequently, the API reports success ( HTTP 200 ) but there is an error talking to redfish and the underlying failure is not propagated back to the operator. In the case of SLURM, daemon retry logic has been added to try to overcome this flakiness. Sometimes operators have to call the redfish interface directly, but this is rare.

PCS is 'imperative' (go do this action) by design rather than 'declarative' (maintain this state). So it's unlikely that we would add this sort of retry logic to PCS. However, ensuring that errors are correctly propagated to the API would allow other tools to be built with a more declarative view of the system.
When interacting with BMCs in any form (via Redfish or IPMI or whatever) they don't always listen to you the first time. Some implementations of power control will re-send the same request several times and ask the BMC what it thinks happened.

This is somewhat similar to the previous point and is probably out of scope in terms of PCS. However, PCS needs to provide accurate information in terms of how the BMCs respond to requests.
The PCS’s view of what can be capped is incorrect sometimes. Do we have more details on this?

The following fit more into the category of feature requests:

Progress tracking
- The API currently provides an id for each transition that then has to be polled with another command invocation to check the status. This is somewhat cumbersome for operators, it would be good to have "execute and monitor" mode. This would include progress bar features to get an idea of what percentage of nodes successfully booted/failed/in-progress without details of which specific nodes. The state in time output of cray power transition describe which you have to scroll through is not very useful for such high level monitoring.
The presentation of progress is probably out of scope of PCS. However, providing an event stream associated with transition would allow us to write more useful tools that could provide this sort of progress information without the need to resort to polling. One possible approach would be to add SSE or websockets to the API to allow a client to subscribe to specific events.
Retry logic on server side
- Implement a queue and retry logic on the PCS service side.
This is probably outside the scope of PCS as it was designed. However, it could be implemented by a service built on top of PCS.
Queuing of transitions
- Currently if a transition is issued and a transition is already in progress the request is rejected ( Looking at the code this shouldn’t happen as it should lock the components with reservations? ). This can happen for example if a SLURM command has been issued and one or more operators also issues a command. These transitions/requests could be queued allowing the transition/request to be serialized. Operators would need the ability to view the queue of transitions/requests on a per node basis.
More investigation needs to be performed to understand how queue / serialization of transition could be performed.

Longer terms goals

Transition to a cell-based architecture

In line with wider discussions ( #41 ) across the collaboration we should look at how we could transition away from a single PCS instance, to multiple independent instances, for example one instance per cabinet, thus reduce the size of the failure domain. Given the imperative nature of PCS it should be amenable to a cellular deployment.

Transition away from TRS

PCS currently uses HMS Task Runner Service (TRS) to parallelize operations, for example sending requests to BMCs. It uses Kafka to queue tasks that can then be processed by workers. TRS doesn't seem to be under active development. Given this it would be a good idea to move to a community supported alternative, of which there are many. Here are just a few:

asynq
machinery ( not sure how active this is )
taskq
river ( a relatively new one )

An analysis will need to be performed to select an alternative that matches the needs of PCS. Moving away from TRS would allow us to leverage the features of a modern task queue and reduce the maintenance burden of having to maintain TRS along with PCS. TRS also has a "local" mode that uses goroutines, this may be enough to support the requests generated by PCS and would reduce the amount of TRS code that would need to be maintained.

Looks at moving to PostgreSQL for state storage

OpenCHAMI's SMD implementation transitioned away from etcd as its persistent backend store as etcd was a big contributor to unplanned outages at LANL. This has not been our experience with PCS at NERSC. However, looking at the implementation of the storage provider for PCS, it does look like it would be amenable to a relational implementation if this was necessary. Another approach that might be worth considering is to use node-local storage with snapshotting like the experiments that have been implemented in (https://github.com/OpenCHAMI/quack). This might fit nicely given that the power control state can be regenerated relatively easily.

Operator facing tools

PCS provides an API that can be used to build operator facing tools needed to perform power transitions. cray power is one of the current clients of PCS. Many of the issues/feature requests raised would be implemented in client tools. The intent would be to implement a new command line interface to PCS that addresses these needs. Another RFD would be submitted to provide a detailed discussion of such a tool.

What alternatives exist?

Decide that power management is out of scope for OpenCHAMI and recommend integration with other tools a such as:
- powerman
- xCat (rpower)
- IPMI
The downside of using these external tools is they lack the integration with SMD, for example creating a reservation for nodes that are being shutdown.
Start from scratch and implement a new microservice from the ground up. This would avoid carrying any technical debt from PCS, however, it would involve significant development effort.

The text was updated successfully, but these errors were encountered:

alexlovelltroy · 2024-11-08T15:51:28Z

This is great! I have a few comments.

Scoping of PCS as a purely imperative server is appropriate for an initial attempt to add the service to OpenCHAMI and preserve some set of backwards compatibility, but we should open the discussion of how/if we would like it to evolve to encompass more scope. One of the issues we've seen with CSM is that the proliferation of microservices makes it hard to make a change because of the many microservices involved. If an individual module/microservice/etc isn't valuable alone, it may be too narrowly scoped.
Be careful of replacing TRS with another task queuing system if we can handle the expected scale purely with goroutines. The best way to address distributed systems problems is to avoid them.
On the LANL side, we're exploring some additional client paradigms that we haven't published yet. They kindof look like the kubectl ability to extend the cli with additional modules as needed. We should collaborate on an RFD to describe it when the time comes.

cjh1 · 2024-11-08T16:27:12Z

Scoping of PCS as a purely imperative server is appropriate for an initial attempt to add the service to OpenCHAMI and preserve some set of backwards compatibility, but we should open the discussion of how/if we would like it to evolve to encompass more scope. One of the issues we've seen with CSM is that the proliferation of microservices makes it hard to make a change because of the many microservices involved. If an individual module/microservice/etc isn't valuable alone, it may be too narrowly scoped.

Absolutely, I was just trying to keep the scope of this RFD manageable, but I think you make a good point.

Be careful of replacing TRS with another task queuing system if we can handle the expected scale purely with goroutines. The best way to address distributed systems problems is to avoid them.

Agreed, the best option would be to use goroutines. The first step will be to check if they can support the load, if they can we are done.

On the LANL side, we're exploring some additional client paradigms that we haven't published yet. They kindof look like the kubectl ability to extend the cli with additional modules as needed. We should collaborate on an RFD to describe it when the time comes.

Yes, I saw that you had started to add some CLIs. Having a single command with subcommands/modules is a good approach. I am a fan of Typer and have used it to build CLIs like that before. Would be happy to collaborate on RFD for that.

jwlv · 2025-01-10T18:21:33Z

Just a quick note on a few things.

PCS was recently updated to use the latest version of Go (1.23). Many of its image and module dependencies have also been updated.

Some additional detail on how TRS is currently used. There are two modes of operation:

"Local" - Uses retryablehttp within goroutines
"Remote" - Uses Kafka

Remote mode was created to increase scalability but has not yet been adopted due to its added complexity. Unless local mode shows itself to be inadequate, remote mode will likely remain a dormant feature.

Local mode has actually seen some substantial changes over the last few months to improve stability and scalability. When those changes were tested, TRS showed no issue handling 24,000 concurrent connections within the simplified unit test framework that runs during builds in GitHub. We hit limitations that capped this number by the GitHub build framework itself rather than any limitation within TRS. I suspect we’d hit PCS scaling issues before we hit TRS scaling issues.

We were also able to add support for, and configuration of, active thread pools within TRS. Provided remote BMC http servers are configured properly, TRS is now capable of keeping connections open rather than recycling them after every http request.

The goal of TRS is to reduce duplication and maintenance costs by providing a single highly scalable http communication mechanism for all HMS services within CSM. To date, only PCS and FAS (Firmware Action Service) have migrated to TRS. We would see benefit in migrating additional services like SMD and hmcollector as well.

cjh1 · 2025-01-10T19:09:44Z

@jwlv Thanks for this information, this is really useful. One of the things that we were going todo was to see how far we could get with just using the "local" version of TRS, it seems that you have already answered that question. Given that the "remote" version is a dormant feature is there any plan to depreciate it or is it used by other project? One reason I ask is we ran into issue building on ARM because of the version of librdkafka, so if we could remove the kafka dependency that would avoid problems going forward. More generally is there a plan to continue supporting TRS going forward?

jwlv · 2025-01-10T20:25:55Z

Given that the "remote" version is a dormant feature is there any plan to depreciate it or is it used by other project? One reason I ask is we ran into issue building on ARM because of the version of librdkafka, so if we could remove the kafka dependency that would avoid problems going forward.

Funny you mention that. Just this week I tried to build locally on my M4 MacBook Pro and it failed because of that exact same issue. The version TRS is referencing is quite old. I checked the latest version and found they've since added ARM support. I've already created an internal ticket to update TRS to the latest version which should resolve that issue for us.

More generally is there a plan to continue supporting TRS going forward?

The plan is to continue supporting TRS for the indefinite future.

cjh1 · 2025-01-10T20:29:04Z

Funny you mention that. Just this week I tried to build locally on my M4 MacBook Pro and it failed because of that exact same issue. The version TRS is referencing is quite old. I checked the latest version and found they've since added ARM support. I've already created an internal ticket to update TRS to the latest version which should resolve that issue for us.

Excellent, I will look out for that change so I bump our version as well.

jwlv · 2025-01-10T20:41:11Z

One of the things that we were going todo was to see how far we could get with just using the "local" version of TRS

There were recent tests added to pkg/trs_http_api/trshttp_local_test.go that should probably be moved to Test/testApp.go as they are substantially more than just simple unit tests. Doing so may also allow us to get past the GItHub build constraints surrounding unit tests and see how far we can push TRS. Note the TestConnsWithHttpTxPolicy_PcsHugeBusy() test that could simply just bump 24000 up to a higher value.

cjh1 · 2025-01-10T20:58:47Z

There were recent tests added to pkg/trs_http_api/trshttp_local_test.go that should probably be moved to Test/testApp.go as they are substantially more than just simple unit tests. Doing so may also allow us to get past the GItHub build constraints surrounding unit tests and see how far we can push TRS.

I see the value of moving the code out of the unit tests if they are more stress/load test. However, how would that help with "get pass the GitHub build constraints" they would still be run in a GitHub action workflow? Or are you taking about running them in another environment?

jwlv · 2025-01-10T21:55:35Z

I see the value of moving the code out of the unit tests if they are more stress/load test. However, how would that help with "get pass the GitHub build constraints" they would still be run in a GitHub action workflow? Or are you taking about running them in another environment?

Well, was hoping a different workflow would be less restrictive, or at least allow us to configure it differently while leaving the unit test workflow as is. It would also allow us to run it in a different environment.

alexlovelltroy · 2025-01-16T18:23:27Z

What is the benefit to TRS in local mode over using built-in concurrency from the go language?

jwlv · 2025-01-16T20:42:59Z

What is the benefit to TRS in local mode over using built-in concurrency from the go language?

TRS local mode fully leverages the built-in concurrency of the go language. It could have been built directly into PCS but there was value in providing its functionality via an external module so that it could be reused by other HMS services to reduce duplication and maintenance costs.

TRS leverages the external module go-retryablehttp for automatic retries under certain conditions. That code base isn't complex or substantial but is fairly general purpose, so we've had to add workarounds for certain issues within TRS to account for our specific use case requirements. There might be value in considering replacing it with our own custom implementation so that we can remove those workarounds.

jwlv · 2025-01-17T22:20:11Z

Just FYI that changes have been pushed to both TRS and PCS which enable ARM builds. Numerous module and image dependencies were also updated to pick up fixes and resolve security issues.

cjh1 · 2025-01-21T13:46:18Z

Thanks @jwlv we will look at picking up the changes.

cjh1 added the rfd Request for Discussion label Nov 8, 2024

cjh1 mentioned this issue Nov 8, 2024

Power Control #54

Open

evanmcc mentioned this issue Dec 9, 2024

[RFD] Remote Console Service #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFD] Support for Power Management #64

[RFD] Support for Power Management #64

cjh1 commented Nov 8, 2024

alexlovelltroy commented Nov 8, 2024

cjh1 commented Nov 8, 2024

jwlv commented Jan 10, 2025

cjh1 commented Jan 10, 2025 •

edited

Loading

jwlv commented Jan 10, 2025

cjh1 commented Jan 10, 2025

jwlv commented Jan 10, 2025

cjh1 commented Jan 10, 2025

jwlv commented Jan 10, 2025

alexlovelltroy commented Jan 16, 2025

jwlv commented Jan 16, 2025 •

edited

Loading

jwlv commented Jan 17, 2025

cjh1 commented Jan 21, 2025

[RFD] Support for Power Management #64

[RFD] Support for Power Management #64

Comments

cjh1 commented Nov 8, 2024

Support for Power Management

What needs to change:

What do you propose?

Longer terms goals

Transition to a cell-based architecture

Transition away from TRS

Looks at moving to PostgreSQL for state storage

Operator facing tools

What alternatives exist?

alexlovelltroy commented Nov 8, 2024

cjh1 commented Nov 8, 2024

jwlv commented Jan 10, 2025

cjh1 commented Jan 10, 2025 • edited Loading

jwlv commented Jan 10, 2025

cjh1 commented Jan 10, 2025

jwlv commented Jan 10, 2025

cjh1 commented Jan 10, 2025

jwlv commented Jan 10, 2025

alexlovelltroy commented Jan 16, 2025

jwlv commented Jan 16, 2025 • edited Loading

jwlv commented Jan 17, 2025

cjh1 commented Jan 21, 2025

cjh1 commented Jan 10, 2025 •

edited

Loading

jwlv commented Jan 16, 2025 •

edited

Loading