Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tracking for active node and task execution counts in propeller #4986

Merged
merged 5 commits into from
Apr 4, 2024

Conversation

sshardool
Copy link
Contributor

@sshardool sshardool commented Feb 29, 2024

Tracking issue

#4593

Why are the changes needed?

Fixes this from the issue description

Currently the only ActiveNodeExecutions and ActiveTaskExecutions metrics exported are from flyteadmin. 
This can result in inaccuracies if, for example, flyteadmin restarts or there is an uncommon inconsistency in data and control plane state (ex. FlyteWorkflow CR manually deleted).

What changes were proposed in this pull request?

The metrics for active node and task executions are added to flytepropeller with the expectation that the workflow re-evaluation duration is an acceptable delay for updating the metrics (eventually consistent)

How was this patch tested?

Tested on a dev flyte setup (single node).
Tested is in progress in larger scale internal deployments.

Sample of the new metrics:

flyte:propeller:flyte_dev:execstats:active_node_executions 9
flyte:propeller:flyte_dev:execstats:active_task_executions 7
flyte:propeller:flyte_dev:execstats:active_workflow_executions 3

Copy link

welcome bot commented Feb 29, 2024

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Feb 29, 2024
@sshardool sshardool marked this pull request as draft February 29, 2024 21:41
Copy link

codecov bot commented Mar 1, 2024

Codecov Report

Attention: Patch coverage is 44.80519% with 85 lines in your changes are missing coverage. Please review.

Project coverage is 59.06%. Comparing base (37255a1) to head (38efa8d).
Report is 1 commits behind head on master.

Files Patch % Lines
...er/pkg/controller/workflowstore/execution_stats.go 32.60% 60 Missing and 2 partials ⚠️
...ller/pkg/controller/executors/execution_context.go 37.50% 10 Missing ⚠️
flytepropeller/pkg/controller/controller.go 0.00% 8 Missing ⚠️
flytepropeller/pkg/controller/workflow/executor.go 83.33% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4986      +/-   ##
==========================================
- Coverage   59.10%   59.06%   -0.04%     
==========================================
  Files         645      646       +1     
  Lines       55581    55714     +133     
==========================================
+ Hits        32851    32910      +59     
- Misses      20136    20208      +72     
- Partials     2594     2596       +2     
Flag Coverage Δ
unittests 59.06% <44.80%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sshardool sshardool force-pushed the origin/activenodes branch from 8a948ce to 1254afa Compare March 11, 2024 07:34
@sshardool sshardool force-pushed the origin/activenodes branch from 1254afa to 52c023b Compare March 12, 2024 06:04
@sshardool sshardool marked this pull request as ready for review March 12, 2024 06:11
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Mar 12, 2024
Signed-off-by: Shardool <[email protected]>
@sshardool sshardool changed the title [Draft] Add tracking for active node and task execution counts in propeller Add tracking for active node and task execution counts in propeller Mar 14, 2024
Signed-off-by: Paul Dittamo <[email protected]>
@pvditt
Copy link
Contributor

pvditt commented Apr 2, 2024

@sshardool this PR is great.

Have you noticed any increases in latency for workflow executions with this running? How many workers/threads are running for your propeller deployments? I figure this should only add in the order of nanoseconds for propeller loops with lock contention with ExecutionStatsHolder updates so it shouldn't be a concern.

We might want to run this on an internal cluster for a little bit prior to merging, but this PR LGTM.

@sshardool
Copy link
Contributor Author

@sshardool this PR is great.

Have you noticed any increases in latency for workflow executions with this running? How many workers/threads are running for your propeller deployments? I figure this should only add in the order of nanoseconds for propeller loops with lock contention with ExecutionStatsHolder updates so it shouldn't be a concern.

We might want to run this on an internal cluster for a little bit prior to merging, but this PR LGTM.

Thanks for the review @pvditt . We currently have this running with 4 worker threads and have not seen any increase in execution latencies. You're right, critical section for all operations which are inline with the work-queue tasks is very small (single map entry update). Let me know if an additional metric tracking the wait duration (i.e time taken for mutex acquisition) in ExecutionStatsHolder:AddOrUpdateEntry() will help.

Copy link
Contributor

@pvditt pvditt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for adding this

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Apr 4, 2024
@pvditt pvditt merged commit f1c2231 into flyteorg:master Apr 4, 2024
46 of 48 checks passed
Copy link

welcome bot commented Apr 4, 2024

Congrats on merging your first pull request! 🎉

Jeinhaus pushed a commit to Jeinhaus/flyte that referenced this pull request Apr 8, 2024
…lyteorg#4986)

* Add tracking for active node and task execution counts in propeller

Signed-off-by: Shardool <[email protected]>

* Update unit tests for task and node execution counts

Signed-off-by: Shardool <[email protected]>

* Fix linter errors

Signed-off-by: Shardool <[email protected]>

* fix linter errors

Signed-off-by: Paul Dittamo <[email protected]>

---------

Signed-off-by: Shardool <[email protected]>
Signed-off-by: Paul Dittamo <[email protected]>
Co-authored-by: Paul Dittamo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants