Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running workflows showing with the "?" symbol #485

Open
oliver-sanders opened this issue Aug 14, 2023 · 16 comments
Open

running workflows showing with the "?" symbol #485

oliver-sanders opened this issue Aug 14, 2023 · 16 comments
Labels
bug Something isn't working
Milestone

Comments

@oliver-sanders
Copy link
Member

oliver-sanders commented Aug 14, 2023

When starting a stopped workflow, it will sometimes show in the GUI with the "?" symbol rather than the "▶" icon for a running workflow.

From inspecting the UI data store, when the workflow is started, the status and statusMsg fields are initially provided correctly:

 status: "running", statusMsg: "running"

But are then quickly set to "" in a subsequent delta:

status: "", statusMsg: ""

Then, this happens again leaving the workflow with status "" causing the workflow-icon component to display the ? and preventing mutations from being run against the workflow.

@oliver-sanders oliver-sanders added the bug Something isn't working label Aug 14, 2023
@oliver-sanders oliver-sanders added this to the 1.3.1 milestone Aug 14, 2023
@oliver-sanders
Copy link
Member Author

oliver-sanders commented Aug 14, 2023

This diff shows the status is getting into the data store correctly:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index a15341d8f..cf292ecfe 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -1855,6 +1855,7 @@ class DataStoreMgr:
         # Set status & msg if changed.
         status, status_msg = map(
             str, get_workflow_status(self.schd))
+        LOG.warning(f'{status=}, {status_msg=}')
         if w_data.status != status or w_data.status_msg != status_msg:
             w_delta.status = status
             w_delta.status_msg = status_msg

Producing messages like this:

WARNING - status='running', status_msg='running'

Which I think narrows it down to the UIS.

@dwsutherland
Copy link
Member

dwsutherland commented Aug 16, 2023

Yeah, bit odd, Scheduler and UIS stores are fine:
image

So the last delta arriving and/or applied to the UIS must be running ..

and when I refresh the page it comes right.. So I guess I'll have to look at the deltas arriving/leaving the UIS

@dwsutherland
Copy link
Member

Here's the problem delta arriving at the WUI:
image

@dwsutherland
Copy link
Member

It's already been added, but we received an extra added workflow delta... I suspect the UIS is constructing this..

@dwsutherland
Copy link
Member

Thing is.. that delta isn't consistently sent, so it makes me think it's from the UIS scan or something..

@dwsutherland
Copy link
Member

Finding it really hard to replicate the issue.. the content of the delta says two things:

  • It looks like the contact info in the added which suggests that the UIS _update_contact( data_store_mgr method is what populated the data-store .. and this is part of the "initial burst" being sent.
  • Even if it's sent, the WUI should probably be overwriting it with the updated (although I don't know if it's consistently sent alongside it)

Will carry on tomorrow.. might be fixed by just making sure those status fields aren't being set, but the WUI might break if the workflow has no status fields.

@dwsutherland
Copy link
Member

The issue is more easily reproduced with higher scan rate (lower scan_interval), which is a clue in itself.. slowly narrowing in on it.

@dwsutherland
Copy link
Member

dwsutherland commented Aug 23, 2023

I believe I've figured out what's going on here.. Two things really.

At the UIS, occasionally the UIS connects to the Scheduler before the first start-up delta arrives from the scheduler;

{
	"id": "1",
	"type": "data",
	"payload": {
		"data": {
			"deltas": {
				"id": "~sutherlander/simple/run1",
				"added": {
					"workflow": {
						"id": "~sutherlander/simple/run1",
						"status": "",
						"statusMsg": "",
						"owner": "sutherlander",
						"host": "cortex.virtualbox.org",
						"port": 43043,
						"stateTotals": {
							"waiting": 0,
							"expired": 0,
							"preparing": 0,
							"submit-failed": 0,
							"submitted": 0,
							"running": 0,
							"failed": 0,
							"succeeded": 0
						},
						"latestStateTasks": {},
						"__typename": "Workflow"
					},
					"__typename": "Added"
				},
				"updated": {
					"workflow": {
						"id": "~sutherlander/simple/run1",
						"status": "running",
						"statusMsg": "running",
						"port": 43043,
						"__typename": "Workflow"
					},
					"__typename": "Updated"
				},
				"pruned": {
					"__typename": "Pruned"
				},
				"__typename": "Deltas"
			}
		}
	}
}

gets published by the scheduler.. The UIS publishes the initial_burst (which already contains the added workflow in the running state)..
This is possible because there's a window of time between the scheduler starting up and publishing (as deltas are batched every main loop), and in that time the UIS subscribes to and sends a query for a full dump of the scheduler, and queues the initial burst for the WUI.. Which is actually quite impressive it's that fast.

Fixing this (not sending this startup subscription if the initial burst has been sent) will be enough to resolve the issue.

However the 2nd issue is the WUI's handling of the above delta, by my understanding if it encounters an added workflow delta, it should wipe out the previously store apply this added delta, and then apply the updated delta (which it clearly isn't, because it doesn't end up with running status)..
Also, you can avoid sending empties if you add the stripNull: true arg to the workflow part:

fragment AddedDelta on Added {

  workflow (stripNull: true) {
    ...WorkflowData
  }
}

However, having no status would probably trip the scan panel up .. (and if the update was applied it wouldn't be an issue)

Addressing either of these should resolve the issue..

@MetRonnie
Copy link
Member

I encountered this and it seems the UI received an empty delta (no added, pruned or updated; just the id is populated). Don't know if that is relevant?

image

@oliver-sanders
Copy link
Member Author

Empty deltas are unnecessary but harmless.

@MetRonnie MetRonnie modified the milestones: 1.3.1, 1.3.2 Sep 7, 2023
@oliver-sanders oliver-sanders modified the milestones: 1.3.2, 1.5.0 Nov 2, 2023
@jarich
Copy link

jarich commented Apr 3, 2024

This might be an unrelated bug, but we see this when we are running more than 8 workflows.

@hjoliver
Copy link
Member

hjoliver commented Apr 3, 2024

@jarich - could be this?

#569

on a dual core machine (2 physical cores = 4 virtual cores) the maximum number of workflows that will update is 8! Ouch!

@oliver-sanders
Copy link
Member Author

Release including fix coming out soon, hopefully today, will announce on Discourse.

@MetRonnie
Copy link
Member

Can this be closed now?

@oliver-sanders
Copy link
Member Author

I don't think so, I've seen this recently (note the 8 workflow thing commented above had a different root cause).

I think the underlying issue here is that sometimes the workflow status can be undefined. This is generally a short-lived glitch corrected by a subsequent delta, but is something that shouldn't happen.

@oliver-sanders oliver-sanders modified the milestones: 1.6.0, 1.7.0 Dec 3, 2024
@oliver-sanders
Copy link
Member Author

I've seen this a few times of late and we have recently had another report. Due to the way we grey-out UI commands based on workflow state, this can leave the user unable to interact with the workflow.

The user report suggests that this issue may go away on reload. This could indicate that it is a UI issue, but it could also be caused by workflow state-change deltas.

I have a cylc8 suite which is appearing as "state unknown" in the GUI and clicking the various menus doesn't seem to show any options - does anyone know how I make its state known again?

Refreshing didn't help, but I tried closing the browser entirely and then restarting it and that seems to have got it back!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants