Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Housekeeping] Ensure accurate ActiveNodeExecutions and ActiveTaskExecutions metrics #4593

Closed
2 tasks done
hamersaw opened this issue Dec 13, 2023 · 4 comments
Closed
2 tasks done
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. exo housekeeping Issues that help maintain flyte and keep it tech-debt free

Comments

@hamersaw
Copy link
Contributor

Describe the issue

Currently the only ActiveNodeExecutions and ActiveTaskExecutions metrics exported are from flyteadmin. This can result in inaccuracies if, for example, flyteadmin restarts or there is an uncommon inconsistency in data and control plane state (ex. FlyteWorkflow CR manually deleted).

What if we do not do this?

If we do nothing, then these metrics are often inaccurate and therefore useless.

Related component(s)

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@hamersaw hamersaw added housekeeping Issues that help maintain flyte and keep it tech-debt free exo backlogged For internal use. Reserved for contributor team workflow. labels Dec 13, 2023
Copy link

dosubot bot commented Dec 13, 2023

🤖

Hello @hamersaw!

I'm Dosu, a bot designed to assist you with bugs, answer your queries, and help you journey through the world of open-source contribution. While waiting for a human maintainer, I'm here to make sure your issue doesn't go unnoticed.

I'm taking a quick look at your issue regarding the ActiveNodeExecutions and ActiveTaskExecutions metrics. We'll delve into it in a moment. Thanks for your patience!

@hamersaw
Copy link
Contributor Author

One proposal, is to move these metrics to FlytePropeller so they can more accurately reflect actual execution status. A relatively high-level design is as follows:

  • Add NodeExecutionCount and TaskExecutionCount values to the ExecutionContext through the ControlFlow struct and increment these accordingly as FlytePropeller progresses through DAG execution (entrypoint here).
  • Maintain an in-memory mapping of execution_id to node and task execution counts that is updated as FlytePropeller evaluations executions. This update will in-turn update the prometheus gauge metric.
  • Use a separate go routine to periodically iterate over the in-memory mapping and remove executions that have been manually deleted (decrementing the gauge metric).

With this solution the metric under FlytePropeller restarts will be eventually consistent. Although, as the workflow reeval duration is, by default, 30s it should converge within that timeframe.

@Future-Outlier
Copy link
Member

@hamersaw Are you working on this issue?
If not, can I try it?

@hamersaw
Copy link
Contributor Author

hamersaw commented Jan 8, 2024

@Future-Outlier this is already being addressed, will assign accordingly. Thanks for the initiative!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. exo housekeeping Issues that help maintain flyte and keep it tech-debt free
Projects
None yet
Development

No branches or pull requests

3 participants