[Housekeeping] Ensure accurate ActiveNodeExecutions and ActiveTaskExecutions metrics #4593

hamersaw · 2023-12-13T15:17:42Z

Describe the issue

Currently the only ActiveNodeExecutions and ActiveTaskExecutions metrics exported are from flyteadmin. This can result in inaccuracies if, for example, flyteadmin restarts or there is an uncommon inconsistency in data and control plane state (ex. FlyteWorkflow CR manually deleted).

What if we do not do this?

If we do nothing, then these metrics are often inaccurate and therefore useless.

Related component(s)

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

The text was updated successfully, but these errors were encountered:

dosubot · 2023-12-13T15:17:56Z

🤖

Hello @hamersaw!

I'm Dosu, a bot designed to assist you with bugs, answer your queries, and help you journey through the world of open-source contribution. While waiting for a human maintainer, I'm here to make sure your issue doesn't go unnoticed.

I'm taking a quick look at your issue regarding the ActiveNodeExecutions and ActiveTaskExecutions metrics. We'll delve into it in a moment. Thanks for your patience!

hamersaw · 2023-12-13T15:26:07Z

One proposal, is to move these metrics to FlytePropeller so they can more accurately reflect actual execution status. A relatively high-level design is as follows:

Add NodeExecutionCount and TaskExecutionCount values to the ExecutionContext through the ControlFlow struct and increment these accordingly as FlytePropeller progresses through DAG execution (entrypoint here).
Maintain an in-memory mapping of execution_id to node and task execution counts that is updated as FlytePropeller evaluations executions. This update will in-turn update the prometheus gauge metric.
Use a separate go routine to periodically iterate over the in-memory mapping and remove executions that have been manually deleted (decrementing the gauge metric).

With this solution the metric under FlytePropeller restarts will be eventually consistent. Although, as the workflow reeval duration is, by default, 30s it should converge within that timeframe.

Future-Outlier · 2024-01-07T04:56:18Z

@hamersaw Are you working on this issue?
If not, can I try it?

hamersaw · 2024-01-08T16:13:48Z

@Future-Outlier this is already being addressed, will assign accordingly. Thanks for the initiative!

hamersaw added housekeeping Issues that help maintain flyte and keep it tech-debt free exo backlogged For internal use. Reserved for contributor team workflow. labels Dec 13, 2023

hamersaw assigned pvditt Jan 8, 2024

sshardool mentioned this issue Feb 29, 2024

Add tracking for active node and task execution counts in propeller #4986

Merged

pvditt closed this as completed Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Housekeeping] Ensure accurate ActiveNodeExecutions and ActiveTaskExecutions metrics #4593

[Housekeeping] Ensure accurate ActiveNodeExecutions and ActiveTaskExecutions metrics #4593

hamersaw commented Dec 13, 2023

dosubot bot commented Dec 13, 2023

hamersaw commented Dec 13, 2023

Future-Outlier commented Jan 7, 2024

hamersaw commented Jan 8, 2024

[Housekeeping] Ensure accurate ActiveNodeExecutions and ActiveTaskExecutions metrics #4593

[Housekeeping] Ensure accurate ActiveNodeExecutions and ActiveTaskExecutions metrics #4593

Comments

hamersaw commented Dec 13, 2023

Describe the issue

What if we do not do this?

Related component(s)

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot bot commented Dec 13, 2023

hamersaw commented Dec 13, 2023

Future-Outlier commented Jan 7, 2024

hamersaw commented Jan 8, 2024