[Housekeeping] Enqueue in subqueue only on actionable Pod phase #2856

hamersaw · 2022-09-07T14:36:25Z

Describe the issue

Currently the FlytePropeller workqueue contains two levels. The first is where FlytePropeller pulls workflows from the periodically evaluate. The second is used as a batching queue to ensure fast downstream node evaluations while mitigating issues caused by high frequency execution. FlytePropeller adds workflows the the subqueue when k8s resource updates are observed.

This means that everytime a Pod phase is updated we re-evaluate the workflow, which may result is many unnecessary evaluations. For example, on Pod transition from "QUEUED" to "INITIALIZING" there is nothing that FlytePropeller can do. It may make sense to only only enqueue the workflow when k8s resources reach terminal phases (ex. "FAILED" or "SUCCEEDED") and FlytePropeller can actually reschedule or schedule downstream operations.

What if we do not do this?

The subqueue may be flooded with workflows and FlytePropeller will unnecessarily evaluate the workflow, incurring additional overhead.

Related component(s)

No response

Are you sure this issue hasn't been raised already?

Yes

Have you read the Code of Conduct?

Yes

P3rcy-8685 · 2022-10-07T11:03:00Z

I would like to work on this issue @samhita-alla @hamersaw

P3rcy-8685 · 2022-10-07T11:05:53Z

I would like to work on this issue @samhita-alla @hamersaw

I would love some guidance on this issue. I am familiar with the language but I am participating in open source projects for the first time

daniel-shuy · 2022-10-29T03:42:32Z

@hamersaw Looking through the code, I see that the WorkerPool is getting the workflow namespace and name from the subqueue (https://github.com/flyteorg/flytepropeller/blob/26ad85757c57cda41bf23a2b054d49eccaa8145d/pkg/controller/workers.go#L39-L91) and calling Propeller.Handle() to reconcile the workflow.

Correct me if I'm wrong, but this seems contrary to your explanation though, as Propeller.Handle() ignores terminated workflows (https://github.com/flyteorg/flytepropeller/blob/26ad85757c57cda41bf23a2b054d49eccaa8145d/pkg/controller/handler.go#L205-L211):

	if w.GetExecutionStatus().IsTerminated() {
		if HasCompletedLabel(w) && !HasFinalizer(w) {
			logger.Debugf(ctx, "Workflow is terminated.")
			// This workflow had previously completed, let us ignore it
			return nil
		}
	}

hamersaw · 2022-10-31T14:57:34Z

@daniel-shuy I think I may not have been clear enough in the description. So the workqueue retrieval that you linked to is for the top-level queue. When we create the queue we create a two-tiered structure. Basically, every time we want FlytePropeller to process a worklfow it is added to the top-level queue.

Items are added to the subqueue when k8s resource changes are detected. This means that every time a Pod phase changes (ex. Initialized -> Queued) the item is added to the subqueue. Then periodically, items from the subqueue are batched and added to the top-level queue. This scheme was designed because in large workflows Pod updates can happen very frequently and we shouldn't be evaluating a workflow every few milliseconds.

So this issue was designed to further minimize the number of times FlytePropeller needs to evaluate the workflow. Now that I review this, I think we need more than terminal Pod phases, because we still want to update node phase status' dynamically. But we only need to add the workflow to the subqueue if FlytePropeller can update the workflow using the Pod phase. So phase updates from Initializing -> Queued or Queued -> Running may not be useful. This is going to require some parsing through which Pod phase are actionable. Looking at the k8s Pod plugin may be a great place to start.

daniel-shuy · 2022-11-01T14:38:12Z

@hamersaw Thanks for the detailed explanation. Looks like this is even more complicated than I expected 😅

github-actions · 2024-04-11T00:06:40Z

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable.
Thank you for your contribution and understanding! 🙏

hamersaw · 2024-04-11T13:34:10Z

commenting to keep open.

peterxcli · 2024-10-28T02:14:23Z

@hamersaw Hi! I'm currently working on this issue and have a quick question. Is there a way to retrieve the previous phase of a task pod directly, or should I manually record it when K8s resource changes are detected?

Thanks for your guidance!

hamersaw added good first issue Good for newcomers propeller Issues related to flyte propeller housekeeping Issues that help maintain flyte and keep it tech-debt free labels Sep 7, 2022

hamersaw added this to the 1.3.0-candidate milestone Sep 7, 2022

samhita-alla added the hacktoberfest label Sep 22, 2022

samhita-alla mentioned this issue Sep 23, 2022

Flyte Hacktoberfest 2022: Issues, Ideas and Guidelines #2917

Closed

50 tasks

samhita-alla assigned P3rcy-8685 Oct 7, 2022

samhita-alla unassigned P3rcy-8685 Oct 28, 2022

hamersaw changed the title ~~[Housekeeping] Enqueue in subqueue only on terminal Pod phase~~ [Housekeeping] Enqueue in subqueue only on actionable Pod phase Oct 31, 2022

cosmicBboy removed this from the 1.3.0-candidate milestone Jan 23, 2023

cosmicBboy removed the hacktoberfest label Jul 15, 2023

github-actions bot added the stale label Apr 11, 2024

github-actions bot removed the stale label Apr 12, 2024

github-actions bot assigned peterxcli Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Housekeeping] Enqueue in subqueue only on actionable Pod phase #2856

[Housekeeping] Enqueue in subqueue only on actionable Pod phase #2856

hamersaw commented Sep 7, 2022

P3rcy-8685 commented Oct 7, 2022

P3rcy-8685 commented Oct 7, 2022

daniel-shuy commented Oct 29, 2022

hamersaw commented Oct 31, 2022 •

edited

Loading

daniel-shuy commented Nov 1, 2022

github-actions bot commented Apr 11, 2024

hamersaw commented Apr 11, 2024

peterxcli commented Oct 28, 2024

[Housekeeping] Enqueue in subqueue only on actionable Pod phase #2856

[Housekeeping] Enqueue in subqueue only on actionable Pod phase #2856

Comments

hamersaw commented Sep 7, 2022

Describe the issue

What if we do not do this?

Related component(s)

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

P3rcy-8685 commented Oct 7, 2022

P3rcy-8685 commented Oct 7, 2022

daniel-shuy commented Oct 29, 2022

hamersaw commented Oct 31, 2022 • edited Loading

daniel-shuy commented Nov 1, 2022

github-actions bot commented Apr 11, 2024

hamersaw commented Apr 11, 2024

peterxcli commented Oct 28, 2024

hamersaw commented Oct 31, 2022 •

edited

Loading