Being able to scrape Prometheus metrics during graceful shutdown from management endpoints #41002

joshiste · 2024-06-06T11:14:06Z

I try to describe our use case and the problem we have:

We're using Prometheus to scrape the metrics.
We set server.port=8080 and management.server.port=9090 (hence a second http server is used).
Stopping the application gracefully can take longer since the app has long-running processes that we're waiting on.
While waiting on these, we want the default server to be shutting down, but the management server to be up, so we can still scrape the metrics.
Currently, the management server is started after and stopped before the default server, preventing this. The phases/order for the servers cannot changed in any way.

I totally acknowledge that the current order is the way it is, to not serve the health endpoints before the default server is up. And as discussed in #31714 that the phases must be well configured and are easy to get wrong. But I'd love to have some kind of possibility to change the order (e.g. by subclassing).

The text was updated successfully, but these errors were encountered:

wilkinsona · 2024-06-06T14:51:00Z

This isn't really related to the lifecycle phases as they're not involved in closing the management context which is done by org.springframework.boot.actuate.autoconfigure.web.server.ChildManagementContextInitializer.CloseManagementContextListener in response to the parent context's ContextClosedEvent.

Unfortunately, I think it will be quite difficult to allow the ordering to be changed as we'd have to move away from using the ContextClosedEvent to close the management context. A Lifecycle or SmartLifecycle would seem like an obvious choice as the phase could then be configured but the application context does not expose the state of its closed flag so I don't think it would be possible for us to distinguish between a stop() call that should just stop() the management context and a stop() call that should close() it.

wilkinsona · 2024-06-17T16:18:45Z

I've opened spring-projects/spring-framework#33058 to see if Framework could make the application context's close state accessible to us.

jonatan-ivanov · 2024-06-17T19:27:19Z

I think one alternative solution to this could be using Prometheus RSocket Proxy (but you need to deploy an extra component in your infrastructure).

In the use-case above, if Prometheus does not scrape while the long-running processes is running, or one/some of the scrapes fail or Prometheus is not scraping enough, I think you can be in a similar situation even if the management endpoint is still able to accept traffic.

In case of the Prometheus RSocket Proxy, both the Proxy can scrape the app and the app can also send data to the Proxy (that is scraped by Prometheus later). So if the ordering is right, your app can send the latest data to the Proxy right before the process stops (after your long-running process finished its job).

wilkinsona · 2024-06-20T10:04:21Z

Framework 6.2 now provides an isClosed() accessor backed by its closed flag. That means that we may be able to rework things here so that the separate management context is closed as part of a lifecycle implementation rather than in response to the ContextClosedEvent. We can investigate further once we've created the 3.3.x branch and main has upgraded to Framework 6.2.0-M5 or its snapshots.

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Jun 6, 2024

wilkinsona added the for: team-meeting An issue we'd like to discuss as a team to make progress label Jun 6, 2024

philwebb removed the for: team-meeting An issue we'd like to discuss as a team to make progress label Jun 17, 2024

philwebb assigned wilkinsona Jun 17, 2024

wilkinsona mentioned this issue Jun 17, 2024

Provide a way to determine if a context is in the process of being closed spring-projects/spring-framework#33058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Being able to scrape Prometheus metrics during graceful shutdown from management endpoints #41002

Being able to scrape Prometheus metrics during graceful shutdown from management endpoints #41002

joshiste commented Jun 6, 2024

wilkinsona commented Jun 6, 2024 •

edited

Loading

wilkinsona commented Jun 17, 2024

jonatan-ivanov commented Jun 17, 2024

wilkinsona commented Jun 20, 2024 •

edited

Loading

Being able to scrape Prometheus metrics during graceful shutdown from management endpoints #41002

Being able to scrape Prometheus metrics during graceful shutdown from management endpoints #41002

Comments

joshiste commented Jun 6, 2024

wilkinsona commented Jun 6, 2024 • edited Loading

wilkinsona commented Jun 17, 2024

jonatan-ivanov commented Jun 17, 2024

wilkinsona commented Jun 20, 2024 • edited Loading

wilkinsona commented Jun 6, 2024 •

edited

Loading

wilkinsona commented Jun 20, 2024 •

edited

Loading