Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Being able to scrape Prometheus metrics during graceful shutdown from management endpoints #41002

Open
joshiste opened this issue Jun 6, 2024 · 4 comments
Assignees
Labels
status: waiting-for-triage An issue we've not yet triaged

Comments

@joshiste
Copy link
Contributor

joshiste commented Jun 6, 2024

I try to describe our use case and the problem we have:

We're using Prometheus to scrape the metrics.
We set server.port=8080 and management.server.port=9090 (hence a second http server is used).
Stopping the application gracefully can take longer since the app has long-running processes that we're waiting on.
While waiting on these, we want the default server to be shutting down, but the management server to be up, so we can still scrape the metrics.
Currently, the management server is started after and stopped before the default server, preventing this. The phases/order for the servers cannot changed in any way.

I totally acknowledge that the current order is the way it is, to not serve the health endpoints before the default server is up. And as discussed in #31714 that the phases must be well configured and are easy to get wrong. But I'd love to have some kind of possibility to change the order (e.g. by subclassing).

@spring-projects-issues spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label Jun 6, 2024
@wilkinsona
Copy link
Member

wilkinsona commented Jun 6, 2024

This isn't really related to the lifecycle phases as they're not involved in closing the management context which is done by org.springframework.boot.actuate.autoconfigure.web.server.ChildManagementContextInitializer.CloseManagementContextListener in response to the parent context's ContextClosedEvent.

Unfortunately, I think it will be quite difficult to allow the ordering to be changed as we'd have to move away from using the ContextClosedEvent to close the management context. A Lifecycle or SmartLifecycle would seem like an obvious choice as the phase could then be configured but the application context does not expose the state of its closed flag so I don't think it would be possible for us to distinguish between a stop() call that should just stop() the management context and a stop() call that should close() it.

@wilkinsona wilkinsona added the for: team-meeting An issue we'd like to discuss as a team to make progress label Jun 6, 2024
@philwebb philwebb removed the for: team-meeting An issue we'd like to discuss as a team to make progress label Jun 17, 2024
@wilkinsona
Copy link
Member

I've opened spring-projects/spring-framework#33058 to see if Framework could make the application context's close state accessible to us.

@jonatan-ivanov
Copy link
Member

I think one alternative solution to this could be using Prometheus RSocket Proxy (but you need to deploy an extra component in your infrastructure).

In the use-case above, if Prometheus does not scrape while the long-running processes is running, or one/some of the scrapes fail or Prometheus is not scraping enough, I think you can be in a similar situation even if the management endpoint is still able to accept traffic.

In case of the Prometheus RSocket Proxy, both the Proxy can scrape the app and the app can also send data to the Proxy (that is scraped by Prometheus later). So if the ordering is right, your app can send the latest data to the Proxy right before the process stops (after your long-running process finished its job).

@wilkinsona
Copy link
Member

wilkinsona commented Jun 20, 2024

Framework 6.2 now provides an isClosed() accessor backed by its closed flag. That means that we may be able to rework things here so that the separate management context is closed as part of a lifecycle implementation rather than in response to the ContextClosedEvent. We can investigate further once we've created the 3.3.x branch and main has upgraded to Framework 6.2.0-M5 or its snapshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-triage An issue we've not yet triaged
Projects
None yet
Development

No branches or pull requests

5 participants