Data store avoid graph re-walk #5660

dwsutherland · 2023-08-01T02:06:07Z

At present we've resolved an efficiency problem at n=1 window size by restricting the graph walk to only parents or only children (i.e. 'c' 'p', or 'ccc' 'ppp' for n=3) .. However this excludes cousins, which is evident at n>1 (i.e. 'cp' or 'pc' at n=2)

This restriction was a cheap way to avoid graph re-walks.. #5475

Here the problem is actually fixed at any window size, by changing the graph walk algorithm from a recursive to a walk location approach.. From the method's description:

Fills in graph walk from existing walks if possible, otherwise expands
the graph front from whereever hasn't been walked.
Walk nodes are grouped into locations which are tag according to
parent child path, i.e. 'cpc' would be children-parents-children away
from active/start task. Which not only provide a way to cheaply rewalk,
but also the edge distance from origin.
The futherest child boundary nodes are registered as prune triggers for
the origin's parents, so when they become active the parents are
assessed for pruning eligibility.

Window resizing will now happen as soon as it's set (not as the graph progresses)

N-window edge depth of nodes is now found/updated with new field graph_depth.. A good example workflow:

[scheduling]
    initial cycle point = 2020
   [[graph]]
      P1Y = """
         a => b => c => d => e => f => g
         b => h => i => f
         g[-P1Y] => a
      """

[runtime]
    [[root]]
        script = sleep 2

    [[b, c, d, e, f, i]]

    [[a]]
        script = """
set +e

read -r -d '' gqlDoc <<_DOC_
{"request_string": "
mutation {
  setGraphWindowExtent (
    workflows: [\"${CYLC_WORKFLOW_ID}\"],
    nEdgeDistance: 4) {
    result
  }
}",
"variables": null}
_DOC_

echo "${gqlDoc}"

cylc client "$CYLC_WORKFLOW_ID" graphql < <(echo ${gqlDoc}) 2>/dev/null


set -e
"""

    [[g]]
        script = """
# Slow 2nd branch down
sleep 20
"""

    [[h]]
        script = """
# Slow 2nd branch down
sleep 20

"""

Gives:

{
  "data": {
    "workflows": [
      {
        "id": "~sutherlander/pug/run1",
        "taskProxies": [
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/a",
            "state": "succeeded",
            "graphDepth": 2
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/b",
            "state": "succeeded",
            "graphDepth": 1
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/h",
            "state": "running",
            "graphDepth": 0
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/c",
            "state": "running",
            "graphDepth": 0
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/i",
            "state": "waiting",
            "graphDepth": 1
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/d",
            "state": "waiting",
            "graphDepth": 1
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/f",
            "state": "waiting",
            "graphDepth": 2
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/e",
            "state": "waiting",
            "graphDepth": 2
          },
          {
            "id": "~sutherlander/pug/run1//20200101T0000Z/g",
            "state": "waiting",
            "graphDepth": 3
          },
          {
            "id": "~sutherlander/pug/run1//20210101T0000Z/a",
            "state": "waiting",
            "graphDepth": 4
          }
        ]
      }
    ]
  }
}

Performance
Taken from #5435

[task parameters]
   m = 0..7000
[scheduling]
   [[queues]]
      [[[default]]]
         limit = 4
   [[graph]]
      R1 = "a => b<m>"
[runtime]
   [[a, b<m>]]

$ cylc play <id> --profile --no-detach

Run this for a few mins, when you get bored ctrl+c it, then:

$ pip install snakeviz
$ snakeviz ~/cylc-run/<id>/log/scheduler/profile.prof

This PR is even faster than current (#5475) at n=1:

     7001    0.156    0.000    1.738    0.000 /home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:695(increment_graph_window)

At n=2 there's a bit more load:

     7001    0.185    0.000    4.161    0.001 /home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:695(increment_graph_window)

but clearly not squared or cubed.

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes.
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included.
CHANGES.md entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.

dwsutherland · 2023-08-01T11:02:31Z

Coverage should improve with a n=2 window test.

oliver-sanders · 2023-08-10T08:52:34Z

I've been having a play with this branch with the n=1 window in cylc tui, looks good.

I've been playing with the following diff in order to track which children/parents are being walked over:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index b6a2960fb..e78cd767c 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -732,6 +732,7 @@ class DataStoreMgr:
             None
 
         """
+        print(f'# increment_graph_window({source_tokens.relative_id})')
 
         # common refrences
         active_id = source_tokens.id
diff --git a/cylc/flow/taskdef.py b/cylc/flow/taskdef.py
index a4ae681e2..d2517383b 100644
--- a/cylc/flow/taskdef.py
+++ b/cylc/flow/taskdef.py
@@ -36,6 +36,7 @@ if TYPE_CHECKING:
 
 def generate_graph_children(tdef, point):
     """Determine graph children of this task at point."""
+    print(f'# generate_graph_children({point}/{tdef.name})')
     graph_children = {}
     for seq, dout in tdef.graph_children.items():
         for output, downs in dout.items():
@@ -78,6 +79,7 @@ def generate_graph_parents(tdef, point, taskdefs):
 
     Infer parents be reversing upstream triggers that lead to point/task.
     """
+    print(f'# generate_graph_parents({point}/{tdef.name})')
     graph_parents = {}
     for seq, triggers in tdef.graph_parents.items():
         if not seq.is_valid(point):

Using the graph from your example above with the default n=1 window:

flow.cylc

[scheduler]
    cycle point format = CCYY

[scheduling]
    runahead limit = P1
    initial cycle point = 2020
   [[graph]]
      P1Y = """
         a => b => c => d => e => f => g
         b => h => i => f
         g[-P1Y] => a
      """

[runtime]
    [[root]]
        script = sleep 2
    [[b, c, d, e, f, i]]
    [[a]]
    [[g,h]]

It looks like the algorithm is looking one edge further infront (c) than behind (p).

E.G. here's a snippet from --no-detach output:

2023-08-10T09:33:54+01:00 INFO - New flow: 1 (original flow from 2020) 2023-08-10 09:33:54
# generate_graph_children(2020/a)
# increment_graph_window(2020/a)
# generate_graph_children(2020/a)
# generate_graph_children(2020/b)    <= b is n=1 so its children would be n=2
# generate_graph_parents(2020/a)
2023-08-10T09:33:54+01:00 INFO - [2020/a waiting(runahead) job:00 flows:1] => waiting
2023-08-10T09:33:54+01:00 INFO - [2020/a waiting job:00 flows:1] => waiting(queued)
2023-08-10T09:33:55+01:00 INFO - [2020/a waiting(queued) job:00 flows:1] => waiting
2023-08-10T09:33:55+01:00 INFO - [2020/a waiting job:01 flows:1] => preparing
2023-08-10T09:33:56+01:00 INFO - [2020/a preparing job:01 flows:1] submitted to localhost:background[34318]
2023-08-10T09:33:56+01:00 INFO - [2020/a preparing job:01 flows:1] => submitted
2023-08-10T09:33:56+01:00 INFO - [2020/a submitted job:01 flows:1] health: submission timeout=P1D, polling intervals=PT1H,...
2023-08-10T09:33:58+01:00 INFO - [2020/a submitted job:01 flows:1] => running
2023-08-10T09:33:58+01:00 INFO - [2020/a running job:01 flows:1] health: execution timeout=None, polling intervals=PT1H,...
2023-08-10T09:34:00+01:00 INFO - [2020/a running job:01 flows:1] => succeeded
# generate_graph_children(2020/b)
# increment_graph_window(2020/b)
# generate_graph_children(2020/b)   <= duplicate method call from previous walk
# generate_graph_children(2020/h)
# generate_graph_children(2020/c)   <= c is n=1 so its children would be n=2
# generate_graph_parents(2020/b)

E.G. in the above output, when the n=0 window was {1/a} the algorithm iterated over the graph children of 1/b which are outside the n-window. These edges aren't needed for the n-window but might be needed for the algorithm?

When the children of 1/b entered the n=1 window the iteration was repeated which might mean the cache was removed too early?

oliver-sanders · 2023-08-10T09:31:15Z

Moving onto the n=2 window (and dumping out the n-window after each call) with this diff:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index b6a2960fb..3a21bbbea 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -473,7 +473,7 @@ class DataStoreMgr:
         self.updated_state_families = set()
         # Update workflow state totals once more post delta application.
         self.state_update_follow_on = False
-        self.n_edge_distance = 1
+        self.n_edge_distance = 2
         self.next_n_edge_distance = None
         self.latest_state_tasks = {
             state: deque(maxlen=LATEST_STATE_TASKS_QUEUE_SIZE)
@@ -732,6 +732,7 @@ class DataStoreMgr:
             None
 
         """
+        print(f'# increment_graph_window({source_tokens.relative_id})')
 
         # common refrences
         active_id = source_tokens.id
@@ -1018,6 +1019,10 @@ class DataStoreMgr:
                 self.prune_trigger_nodes[active_id])
             del self.prune_trigger_nodes[active_id]
 
+        task_proxies = self.data[self.workflow_id][TASK_PROXIES]
+        for tp_id in sorted(list(task_proxies) + list(self.added[TASK_PROXIES])):
+            print(f'$ {tp_id}')
+
     def generate_edge(
         self,
         parent_tokens: Tokens,
diff --git a/cylc/flow/taskdef.py b/cylc/flow/taskdef.py
index a4ae681e2..d2517383b 100644
--- a/cylc/flow/taskdef.py
+++ b/cylc/flow/taskdef.py
@@ -36,6 +36,7 @@ if TYPE_CHECKING:
 
 def generate_graph_children(tdef, point):
     """Determine graph children of this task at point."""
+    print(f'# generate_graph_children({point}/{tdef.name})')
     graph_children = {}
     for seq, dout in tdef.graph_children.items():
         for output, downs in dout.items():
@@ -78,6 +79,7 @@ def generate_graph_parents(tdef, point, taskdefs):
 
     Infer parents be reversing upstream triggers that lead to point/task.
     """
+    print(f'# generate_graph_parents({point}/{tdef.name})')
     graph_parents = {}
     for seq, triggers in tdef.graph_parents.items():
         if not seq.is_valid(point):

Here's the log:

2023-08-10T10:17:30+01:00 INFO - New flow: 1 (original flow from 2020) 2023-08-10 10:17:30
# generate_graph_children(2020/a)
# increment_graph_window(2020/a)
# generate_graph_children(2020/a)
# generate_graph_children(2020/b)
# generate_graph_parents(2020/a)
# generate_graph_children(2020/b)   <= duplicate call
# generate_graph_children(2020/h)   <= h is n=2 so its children would be n=3
# generate_graph_children(2020/c)   <= c is n=2 so its children would be n=3
# generate_graph_parents(2020/b)
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/a
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/b
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/c
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/h
2023-08-10T10:17:30+01:00 INFO - [2020/a waiting(runahead) job:00 flows:1] => waiting
2023-08-10T10:17:30+01:00 INFO - [2020/a waiting job:00 flows:1] => waiting(queued)
2023-08-10T10:17:30+01:00 INFO - [2020/a waiting(queued) job:00 flows:1] => waiting
2023-08-10T10:17:30+01:00 INFO - [2020/a waiting job:01 flows:1] => preparing
2023-08-10T10:17:32+01:00 INFO - [2020/a preparing job:01 flows:1] submitted to localhost:background[71918]
2023-08-10T10:17:32+01:00 INFO - [2020/a preparing job:01 flows:1] => submitted
2023-08-10T10:17:32+01:00 INFO - [2020/a submitted job:01 flows:1] health: submission timeout=P1D, polling intervals=PT1H,...
2023-08-10T10:17:34+01:00 INFO - [2020/a submitted job:01 flows:1] => running
2023-08-10T10:17:34+01:00 INFO - [2020/a running job:01 flows:1] health: execution timeout=None, polling intervals=PT1H,...
2023-08-10T10:17:36+01:00 INFO - [2020/a running job:01 flows:1] => succeeded
# generate_graph_children(2020/b)
# increment_graph_window(2020/b)
# generate_graph_children(2020/h)   <= duplicate call
# generate_graph_children(2020/i)   <= i is n=2 so its children would be n=3
# generate_graph_parents(2020/h)
# generate_graph_children(2020/c)   <= duplicate call
# generate_graph_children(2020/d)   <= d is n=2 so its children would be n=3
# generate_graph_parents(2020/c)   <= duplicate call
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/a
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/b
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/c
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/d
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/h
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/i
2023-08-10T10:17:36+01:00 INFO - [2020/b waiting(runahead) job:00 flows:1] => waiting
2023-08-10T10:17:36+01:00 INFO - [2020/b waiting job:00 flows:1] => waiting(queued)
2023-08-10T10:17:37+01:00 INFO - [2020/b waiting(queued) job:00 flows:1] => waiting
2023-08-10T10:17:37+01:00 INFO - [2020/b waiting job:01 flows:1] => preparing
2023-08-10T10:17:38+01:00 INFO - [2020/b preparing job:01 flows:1] submitted to localhost:background[72234]
2023-08-10T10:17:38+01:00 INFO - [2020/b preparing job:01 flows:1] => submitted
2023-08-10T10:17:38+01:00 INFO - [2020/b submitted job:01 flows:1] health: submission timeout=P1D, polling intervals=PT1H,...
2023-08-10T10:17:40+01:00 INFO - [2020/b submitted job:01 flows:1] => running
2023-08-10T10:17:40+01:00 INFO - [2020/b running job:01 flows:1] health: execution timeout=None, polling intervals=PT1H,...
2023-08-10T10:17:42+01:00 INFO - [2020/b running job:01 flows:1] => succeeded
# generate_graph_children(2020/h)
# increment_graph_window(2020/h)
# generate_graph_children(2020/d)   <= duplicate call
# generate_graph_children(2020/e)   <= d is n=2 so its children would be n=3
# generate_graph_parents(2020/d)
# generate_graph_children(2020/i)   <= duplicate call
# generate_graph_children(2020/f)   <= d is n=2 so its children would be n=3
# generate_graph_parents(2020/i)
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/a
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/b
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/c
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/d
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/e
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/f
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/h
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/i
2023-08-10T10:17:42+01:00 INFO - [2020/h waiting(runahead) job:00 flows:1] => waiting
2023-08-10T10:17:42+01:00 INFO - [2020/h waiting job:00 flows:1] => waiting(queued)
# generate_graph_children(2020/c)
# increment_graph_window(2020/c)   <= fully cached graph walk, nice!
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/a
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/b
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/c
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/d
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/e
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/f
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/h
$ ~osanders/tmp.h3r2rW6eFX/run12//2020/i

Looks good, gives the right result, some possible savings to be made avoiding duplicate generate_graph_children calls.

oliver-sanders · 2023-08-10T09:35:38Z

There are duplicate generate_ghost_task calls, but this isn't a problem because the method returns early if the task already exists in the store.

I spotted a [very] small optimisation moving that return a little higher up in the code:

--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -1108,11 +1108,15 @@ class DataStoreMgr:
         Orphan tasks with no children return (True, None) respectively.
 
         """
+        tp_id = tokens.id
+        task_proxies = self.data[self.workflow_id][TASK_PROXIES]
+        if tp_id in task_proxies or tp_id in self.added[TASK_PROXIES]:
+            # ghost task already created
+            return
+
         name = tokens['task']
         point_string = tokens['cycle']
         t_id = self.definition_id(name)
-        tp_id = tokens.id
-        task_proxies = self.data[self.workflow_id][TASK_PROXIES]
 
         is_orphan = False
         if name not in self.schd.config.taskdefs:
@@ -1120,8 +1124,6 @@ class DataStoreMgr:
 
         if itask is None:
             itask = self.schd.pool.get_task(point_string, name)
-        if tp_id in task_proxies or tp_id in self.added[TASK_PROXIES]:
-            return
 
         if itask is None:
             itask = TaskProxy(

oliver-sanders · 2023-08-10T09:58:58Z

Tried out the new graphDepth field, the numbers came out as expected so this should close #5609 which will allow us to unlock the higher n-window views in the GUI!

Is the graph window that the data store is currently configured to track available in the schema? That would be handy for the UI.

dwsutherland · 2023-08-10T11:28:43Z

It looks like the algorithm is looking one edge further infront (c) than behind (p).

Yes, I rely on all but the first node in a branch having c and p already generated (because the first walk generates the walk stub for all walked nodes in it's graph).. Then the nodes of c and p are used to look up the "one edge(s) further".
Sorry, misunderstood the question.. I think that method is used elsewhere:

diff --git a/cylc/flow/data_store_mgr.py b/cylc/flow/data_store_mgr.py
index 03c77eb94..a085f2f3f 100644
--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -754,6 +754,7 @@ class DataStoreMgr:
         #   walk filling, that may not have been the entire walk).
         # If walk already completed, must have gone from non-active to active
         # again.. So redo walk (as walk nodes may be pruned).
+        print(f'# Active ID: {active_id}')
         if (
             active_id not in all_walks
             or active_id in self.n_window_completed_walks
@@ -899,6 +900,7 @@ class DataStoreMgr:
                             )
                         continue
 
+                    print(f'# Generate Children/Parents of {node_tokens.relative_id}')
                     # Children/downstream nodes
                     # TODO: xtrigger is workflow_state edges too
                     # Reference set for workflow relations
diff --git a/cylc/flow/taskdef.py b/cylc/flow/taskdef.py
index a4ae681e2..daae255f4 100644
--- a/cylc/flow/taskdef.py
+++ b/cylc/flow/taskdef.py
@@ -15,6 +15,7 @@
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
 
 """Task definition."""
+import inspect
 
 from collections import deque
 from typing import TYPE_CHECKING
@@ -36,6 +37,7 @@ if TYPE_CHECKING:
 
 def generate_graph_children(tdef, point):
     """Determine graph children of this task at point."""
+    print(f'# Children of {point}/{tdef.name} - Called by {inspect.currentframe().f_back.f_code.co_name}')
     graph_children = {}
     for seq, dout in tdef.graph_children.items():
         for output, downs in dout.items():
@@ -78,6 +80,7 @@ def generate_graph_parents(tdef, point, taskdefs):
 
     Infer parents be reversing upstream triggers that lead to point/task.
     """
+    print(f'# parents of {point}/{tdef.name} - Called by {inspect.currentframe().f_back.f_code.co_name}')
     graph_parents = {}
     for seq, triggers in tdef.graph_parents.items():
         if not seq.is_valid(point):

2023-08-11T16:46:42+12:00 INFO - New flow: 1 (original flow from 2020) 2023-08-11 16:46:42
# Children of 2020/a - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/a
# Generate Children/Parents of 2020/a
# Children of 2020/a - Called by increment_graph_window
# Children of 2020/b - Called by __init__
# parents of 2020/a - Called by increment_graph_window
2023-08-11T16:46:42+12:00 INFO - [2020/a waiting(runahead) job:00 flows:1] => waiting
2023-08-11T16:46:42+12:00 INFO - [2020/a waiting job:00 flows:1] => waiting(queued)
2023-08-11T16:46:42+12:00 INFO - [2020/a waiting(queued) job:00 flows:1] => waiting
2023-08-11T16:46:42+12:00 INFO - [2020/a waiting job:01 flows:1] => preparing
2023-08-11T16:46:43+12:00 INFO - [2020/a preparing job:01 flows:1] submitted to localhost:background[16340]
2023-08-11T16:46:43+12:00 INFO - [2020/a preparing job:01 flows:1] => submitted
2023-08-11T16:46:43+12:00 INFO - [2020/a submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-08-11T16:46:44+12:00 INFO - [2020/a submitted job:01 flows:1] => running
2023-08-11T16:46:44+12:00 INFO - [2020/a running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-11T16:46:46+12:00 INFO - [2020/a running job:01 flows:1] => succeeded
# Children of 2020/b - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/b
# Generate Children/Parents of 2020/b
# Children of 2020/b - Called by increment_graph_window
# Children of 2020/h - Called by __init__
# Children of 2020/c - Called by __init__
# parents of 2020/b - Called by increment_graph_window
2023-08-11T16:46:46+12:00 INFO - [2020/b waiting(runahead) job:00 flows:1] => waiting
2023-08-11T16:46:46+12:00 INFO - [2020/b waiting job:00 flows:1] => waiting(queued)
2023-08-11T16:46:47+12:00 INFO - [2020/b waiting(queued) job:00 flows:1] => waiting
2023-08-11T16:46:47+12:00 INFO - [2020/b waiting job:01 flows:1] => preparing
2023-08-11T16:46:48+12:00 INFO - [2020/b preparing job:01 flows:1] submitted to localhost:background[16365]
2023-08-11T16:46:48+12:00 INFO - [2020/b preparing job:01 flows:1] => submitted
2023-08-11T16:46:48+12:00 INFO - [2020/b submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-08-11T16:46:49+12:00 INFO - [2020/b submitted job:01 flows:1] => running
2023-08-11T16:46:49+12:00 INFO - [2020/b running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-11T16:46:51+12:00 INFO - [2020/b running job:01 flows:1] => succeeded
# Children of 2020/h - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/h
# Generate Children/Parents of 2020/h
# Children of 2020/h - Called by increment_graph_window
# Children of 2020/i - Called by __init__
# parents of 2020/h - Called by increment_graph_window
2023-08-11T16:46:51+12:00 INFO - [2020/h waiting(runahead) job:00 flows:1] => waiting
2023-08-11T16:46:51+12:00 INFO - [2020/h waiting job:00 flows:1] => waiting(queued)
# Children of 2020/c - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/c
# Generate Children/Parents of 2020/c
# Children of 2020/c - Called by increment_graph_window
# Children of 2020/d - Called by __init__
# parents of 2020/c - Called by increment_graph_window
2023-08-11T16:46:51+12:00 INFO - [2020/c waiting(runahead) job:00 flows:1] => waiting

There are duplicate generate_ghost_task calls, but this isn't a problem because the method returns early if the task already exists in the store.

As you've spotted, I've been relying on generate_ghost_task to return early (it has the same block at the top), but to be honest, the fact a walk stub exists would be enough to not call that initial one too (if it's that expensive).

Is the graph window that the data store is currently configured to track available in the schema? That would be handy for the UI.

I think we can add that as a workflow field (if it's not already).
Also I think instead of changing the window size for a request (as that would change all views), we need to have those min/max depth args, so we can just filter the request/sub by depth.

I've also thought of an optimization for the depth finder, so will put that up in the morning.

dwsutherland · 2023-08-11T05:03:06Z

I changed it to reuse the TaskProxy call to generate_graph_children:

2023-08-11T17:00:37+12:00 INFO - New flow: 1 (original flow from 2020) 2023-08-11 17:00:37
# Children of 2020/a - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/a
# Generate Children/Parents of 2020/a
# Children of 2020/b - Called by __init__
# parents of 2020/a - Called by increment_graph_window
2023-08-11T17:00:37+12:00 INFO - [2020/a waiting(runahead) job:00 flows:1] => waiting
2023-08-11T17:00:37+12:00 INFO - [2020/a waiting job:00 flows:1] => waiting(queued)
2023-08-11T17:00:37+12:00 INFO - [2020/a waiting(queued) job:00 flows:1] => waiting
2023-08-11T17:00:37+12:00 INFO - [2020/a waiting job:01 flows:1] => preparing
2023-08-11T17:00:38+12:00 INFO - [2020/a preparing job:01 flows:1] submitted to localhost:background[16606]
2023-08-11T17:00:38+12:00 INFO - [2020/a preparing job:01 flows:1] => submitted
2023-08-11T17:00:38+12:00 INFO - [2020/a submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-08-11T17:00:39+12:00 INFO - [2020/a submitted job:01 flows:1] => running
2023-08-11T17:00:39+12:00 INFO - [2020/a running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-11T17:00:41+12:00 INFO - [2020/a running job:01 flows:1] => succeeded
# Children of 2020/b - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/b
# Generate Children/Parents of 2020/b
# Children of 2020/c - Called by __init__
# Children of 2020/h - Called by __init__
# parents of 2020/b - Called by increment_graph_window
2023-08-11T17:00:41+12:00 INFO - [2020/b waiting(runahead) job:00 flows:1] => waiting
2023-08-11T17:00:41+12:00 INFO - [2020/b waiting job:00 flows:1] => waiting(queued)
2023-08-11T17:00:42+12:00 INFO - [2020/b waiting(queued) job:00 flows:1] => waiting
2023-08-11T17:00:42+12:00 INFO - [2020/b waiting job:01 flows:1] => preparing
2023-08-11T17:00:43+12:00 INFO - [2020/b preparing job:01 flows:1] submitted to localhost:background[16631]
2023-08-11T17:00:43+12:00 INFO - [2020/b preparing job:01 flows:1] => submitted
2023-08-11T17:00:43+12:00 INFO - [2020/b submitted job:01 flows:1] health: submission timeout=None, polling intervals=PT15M,...
2023-08-11T17:00:44+12:00 INFO - [2020/b submitted job:01 flows:1] => running
2023-08-11T17:00:44+12:00 INFO - [2020/b running job:01 flows:1] health: execution timeout=None, polling intervals=PT15M,...
2023-08-11T17:00:46+12:00 INFO - [2020/b running job:01 flows:1] => succeeded
# Children of 2020/c - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/c
# Generate Children/Parents of 2020/c
# Children of 2020/d - Called by __init__
# parents of 2020/c - Called by increment_graph_window
2023-08-11T17:00:46+12:00 INFO - [2020/c waiting(runahead) job:00 flows:1] => waiting
2023-08-11T17:00:46+12:00 INFO - [2020/c waiting job:00 flows:1] => waiting(queued)
# Children of 2020/h - Called by __init__
# Active ID: ~sutherlander/zig/run1//2020/h
# Generate Children/Parents of 2020/h
# Children of 2020/i - Called by __init__
# parents of 2020/h - Called by increment_graph_window
2023-08-11T17:00:46+12:00 INFO - [2020/h waiting(runahead) job:00 flows:1] => waiting

dwsutherland · 2023-08-11T06:05:47Z

There were still some duplicate generate_graph_children calls (as shown above) from the creation of TaskProxy objects at n > 0 (which are used for data store info).. I've excluded generating children for these (re-using the data_mode option)

oliver-sanders · 2023-10-30T17:17:44Z

Done some more testing...

Latest graph patch:

diff --git a/src/components/cylc/SVGTask.vue b/src/components/cylc/SVGTask.vue
index c242a32c..d5925bb5 100644
--- a/src/components/cylc/SVGTask.vue
+++ b/src/components/cylc/SVGTask.vue
@@ -37,6 +37,10 @@ along with this program.  If not, see <http://www.gnu.org/licenses/>.
       queued: task.isQueued && !task.isHeld,
       runahead: task.isRunahead && !(task.isHeld || task.isQueued),
     }"
+    :style="`
+      --foreground: ${foreground};
+      --background: ${background};
+    `"
   >
     <!-- status
 
@@ -279,6 +283,31 @@ export default {
       // otherwise the progress indicator may end up in the wrong place.
       type: Number,
       default: 0
+    },
+    colourMap: {
+      default: {
+        undefined: 'gray',
+        0: 'hsl(0, 100%, 70%)',
+        1: 'hsl(30, 100%, 70%)',
+        2: 'hsl(60, 100%, 70%)',
+        3: 'hsl(90, 100%, 70%)',
+        4: 'hsl(120, 100%, 70%)',
+        5: 'hsl(150, 100%, 70%)',
+        6: 'hsl(180, 100%, 70%)',
+        7: 'hsl(210, 100%, 70%)',
+        8: 'hsl(240, 100%, 70%)',
+        9: 'hsl(270, 100%, 70%)',
+        10: 'hsl(300, 100%, 70%)',
+        11: 'hsl(330, 100%, 70%)'
+      }
+    }
+  },
+  computed: {
+    foreground () {
+      return this.colourMap[this.task.graphDepth]
+    },
+    background () {
+      return 'white'
     }
   },
   methods: {
@@ -342,19 +371,21 @@ export default {
 </script>
 
 <style lang="scss">
-  $foreground: rgb(90,90,90);
-  $background: rgb(255,255,255);
+  :root {
+    --foreground: rgb(90,90,90);
+    --background: rgb(255,255,255);
+  }
 
   .c8-task {
     .status {
       .outline {
         // NOTE: ensure the outline is filled so that it can be clicked
-        fill: $background;
-        stroke: $foreground;
+        fill: var(--background);
+        stroke: var(--foreground);
       }
       .progress {
         fill: transparent;
-        stroke: $foreground;
+        stroke: var(--foreground);
         transform-origin: 50% 50%;
         opacity: 0.4;
         /* 0% progress */
@@ -397,75 +428,75 @@ export default {
     }
 
     &.preparing .status .dot {
-      fill: $foreground;
+      fill: var(--foreground);
     }
 
     &.submitted .status .dot {
-      fill: $foreground;
+      fill: var(--foreground);
     }
 
     &.running .status .hub {
-      fill: $foreground;
+      fill: var(--foreground);
     }
     &.running .status .progress {
-      fill: $foreground;
+      fill: var(--foreground);
     }
 
     &.succeeded .status .outline {
-      fill: $foreground;
+      fill: var(--foreground);
     }
 
     &.failed .status {
       .outline {
-        fill: $foreground;
+        fill: var(--foreground);
       }
       .cross rect {
-        fill: $background;
+        fill: var(--background);
       }
     }
 
     &.submit-failed .status {
       .outline {
-        fill: $background;
+        fill: var(--background);
       }
       .cross rect {
-        fill: $foreground;
+        fill: var(--foreground);
       }
     }
 
     &.expired .status {
       .outline {
-        fill: $foreground;
+        fill: var(--foreground);
       }
       .dot {
-        fill: $background;
+        fill: var(--background);
       }
       .expired rect {
-        fill: $background;
+        fill: var(--background);
       }
     }
 
     &.held .modifier {
       .outline {
-        stroke: $foreground;
+        stroke: var(--foreground);
       }
       .held rect {
-        fill: $foreground;
+        fill: var(--foreground);
       }
     }
 
     &.queued .modifier {
       .queued rect {
-        fill: $foreground;
+        fill: var(--foreground);
       }
     }
 
     &.runahead .modifier {
       .outline {
-        stroke: $foreground;
+        stroke: var(--foreground);
       }
       .runahead circle {
-        fill: $foreground;
+        fill: var(--foreground);
       }
     }
   }
diff --git a/src/views/Graph.vue b/src/views/Graph.vue
index 95781fa9..7d5a62fb 100644
--- a/src/views/Graph.vue
+++ b/src/views/Graph.vue
@@ -154,6 +154,7 @@ fragment TaskProxyData on TaskProxy {
   isRunahead
   isQueued
   name
+  graphDepth
   task {
     meanElapsedTime
   }
@@ -654,6 +655,9 @@ export default {
       for (const obj of json.objects) {
         const [x, y] = obj.pos.split(',')
         const bbox = nodeDimensions[obj.name]
+        if (!bbox) {
+          console.warn(`No Node: ${obj.name}`)
+        }
         // translations:
         // 1. The graphviz node coordinates
         // 2. Centers the node on this coordinate
diff --git a/src/views/Table.vue b/src/views/Table.vue
index 316157ac..589730df 100644
--- a/src/views/Table.vue
+++ b/src/views/Table.vue
@@ -110,6 +110,7 @@ fragment TaskProxyData on TaskProxy {
   isHeld
   isQueued
   isRunahead
+  graphDepth
   task {
     meanElapsedTime
   }
diff --git a/src/views/Tree.vue b/src/views/Tree.vue
index f191e7c5..86b26bd5 100644
--- a/src/views/Tree.vue
+++ b/src/views/Tree.vue
@@ -135,6 +135,7 @@ fragment TaskProxyData on TaskProxy {
   isHeld
   isQueued
   isRunahead
+  graphDepth
   task {
     meanElapsedTime
   }

Using longest rather than shortest n-window distance

Spotted whilst testing this example in the n=10 window:

[scheduler]
    allow implicit tasks = True
[scheduling]
    initial cycle point = 1
    cycling mode = integer
    [[graph]]
        P1 = """
            a => b => c
            b[-P1] => b
        """

The task "c" should be n=2 for the first 6 cycles, but I actually got:

6/c - n=2
5/c - n=3
4/c - n=4
3/c - n=5
2/c - n=6
1/c - n=7

Note, it doesn't reproduce the same way each time?!

Reloading resets graph window

Not strictly related to the changes on this PR. Would be good to squeeze in now if easy, otherwise can bump to an issue.

Reloading the workflow appears to reset the n-window. E.G. if I set the window to n=2, then reload, it goes back to n=1.

Graph errors

I've seen a couple of bits of traceback in the graph view with bbox undefined (note the error in the console won't reference the variable name properly due to minification). I think this can only happen if the edges reference nodes which are not present in the nodes.

I.E. if there are sources or targets in edges { source, target } which are not in TaskProxies { id }.

I've seen this when the workflow is paused which is surprising as all deltas should have been applied (rules out race conditions in delta batching).

I can't get a reproducible example at the moment so can't confirm the cause. Could be unrelated.

* Sometimes tasks may appear in the n-window, then dissappear only to reappear at a later time. * This is a difficult case for the `increment_graph_window` to handle. * See cylc#5660

dwsutherland · 2023-10-31T09:36:25Z

The task "c" should be n=2 for the first 6 cycles, but I actually got:

This one turned out to be an easy fix:

--- a/cylc/flow/data_store_mgr.py
+++ b/cylc/flow/data_store_mgr.py
@@ -1788,6 +1788,7 @@ class DataStoreMgr:
                 if k in self.all_task_pool
             ))
         )
+        self.update_window_depths = True
 
     def window_depth_finder(self):
         """Recalculate window depths, creating depth deltas."""

(needed to flag depth to be recalculated after window size change)

* Sometimes tasks may appear in the n-window, then dissappear only to reappear at a later time. * This is a difficult case for the `increment_graph_window` to handle. * See cylc#5660

Co-authored-by: Oliver Sanders <[email protected]>

* Test the added, updated and pruned deltas created by increment_graph_window as the window changes. * Test window resizing with n=1,2,3

MetRonnie

Worked well in my testing

oliver-sanders · 2023-11-02T10:17:23Z

Been back through the test workflow above from n=1 to n=4, no problems!

Re-ran the one which showed the longest-path issues before, perfect (and kinda pretty):

Tried a bunch of workflows from my cylc-src dir, the algorithm is now producing the correct results for all of them 🎉!

oliver-sanders · 2023-11-02T10:18:23Z

Profiling results for the following workflow in simulation mode:

[task parameters]
    x = 1..100
    y = 1..100
[scheduling]
    cycling mode = integer
    initial cycle point = 1
    final cycle point = 1
    [[graph]]
        P1 = d[-P1] => a => b<x> => c<y> => d

The number of increment_graph_window calls is down from 50501, to 201!
The total time taken by increment_graph_window is down from 7.6s to 4.3s (10% of total runtime down to 6%).
The number of generate_graph_(parents|children) calls are down from 10301 to 201.
Total Cylc runtime is down by 5%.
The number of calls to Prerequisite.api_dump (now the biggest offender) is unchanged.

Good positive change, increases functionality whilst improving performance 💯!

oliver-sanders

Thanks for all your hard work @dwsutherland, this looked like a tricky one!

Added changelog for #5660

dwsutherland added the efficiency For notable efficiency improvements label Aug 1, 2023

dwsutherland requested review from hjoliver and oliver-sanders August 1, 2023 02:06

dwsutherland self-assigned this Aug 1, 2023

dwsutherland force-pushed the data-store-avoid-graph-rewalk branch 3 times, most recently from f10aa7b to 65cb9fc Compare August 1, 2023 09:38

dwsutherland force-pushed the data-store-avoid-graph-rewalk branch 4 times, most recently from 99a86e8 to f110f63 Compare August 3, 2023 11:49

oliver-sanders added this to the cylc-8.3.0 milestone Aug 7, 2023

dwsutherland force-pushed the data-store-avoid-graph-rewalk branch 4 times, most recently from 714f758 to 58dbf6e Compare August 9, 2023 11:35

oliver-sanders linked an issue Aug 10, 2023 that may be closed by this pull request

data_store_mgr: n-window depth field #5609

Closed

dwsutherland force-pushed the data-store-avoid-graph-rewalk branch from 58dbf6e to bee6937 Compare August 11, 2023 03:57

dwsutherland force-pushed the data-store-avoid-graph-rewalk branch 3 times, most recently from 471ebc4 to 529c3ca Compare August 11, 2023 05:48

dwsutherland force-pushed the data-store-avoid-graph-rewalk branch from 529c3ca to 76aecbc Compare August 11, 2023 07:05

dwsutherland force-pushed the data-store-avoid-graph-rewalk branch from f68a63b to 7ca4035 Compare October 31, 2023 09:33

This comment was marked as resolved.

Sign in to view

dwsutherland and others added 14 commits November 1, 2023 10:14

no data-store graph rewalk

e1245a6

Re-create data-store n-window on resize

3a988d9

window depth finder

6977987

Fix graph walk bug and pruning regeneration

dfb6c34

nEdgeDistance API field

5726dda

graph depth API args

c3fb497

tests added

c06cfdd

increment_graph_window: test transient tasks

a2366ae

* Sometimes tasks may appear in the n-window, then dissappear only to reappear at a later time. * This is a difficult case for the `increment_graph_window` to handle. * See cylc#5660

xtrigger TODO comment ref

9b7eb68

Co-authored-by: Oliver Sanders <[email protected]>

increment_graph_window: test deltas and window resizing

b77e697

* Test the added, updated and pruned deltas created by increment_graph_window as the window changes. * Test window resizing with n=1,2,3

increment_graph_window: test cache

57bea9d

fix updated deltas test

2ab4022

fix prune at start

8163ecb

review fix-ups

bc1a017

dwsutherland force-pushed the data-store-avoid-graph-rewalk branch from 7ca4035 to bc1a017 Compare October 31, 2023 21:15

MetRonnie approved these changes Nov 1, 2023

View reviewed changes

oliver-sanders approved these changes Nov 2, 2023

View reviewed changes

oliver-sanders merged commit fd1b820 into cylc:8.2.x Nov 2, 2023
23 of 25 checks passed

MetRonnie added a commit to MetRonnie/cylc-flow that referenced this pull request Nov 2, 2023

Added changelog for cylc#5660

5c21860

oliver-sanders added a commit that referenced this pull request Nov 2, 2023

Merge pull request #5805 from MetRonnie/changelog

050d30f

Added changelog for #5660

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data store avoid graph re-walk #5660

Data store avoid graph re-walk #5660

dwsutherland commented Aug 1, 2023 •

edited

Loading

dwsutherland commented Aug 1, 2023

oliver-sanders commented Aug 10, 2023

oliver-sanders commented Aug 10, 2023

oliver-sanders commented Aug 10, 2023

oliver-sanders commented Aug 10, 2023 •

edited

Loading

dwsutherland commented Aug 10, 2023 •

edited

Loading

dwsutherland commented Aug 11, 2023

dwsutherland commented Aug 11, 2023 •

edited

Loading

oliver-sanders commented Oct 30, 2023 •

edited

Loading

dwsutherland commented Oct 31, 2023 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

MetRonnie left a comment

oliver-sanders commented Nov 2, 2023

oliver-sanders commented Nov 2, 2023

oliver-sanders left a comment

Data store avoid graph re-walk #5660

Data store avoid graph re-walk #5660

Conversation

dwsutherland commented Aug 1, 2023 • edited Loading

dwsutherland commented Aug 1, 2023

oliver-sanders commented Aug 10, 2023

oliver-sanders commented Aug 10, 2023

oliver-sanders commented Aug 10, 2023

oliver-sanders commented Aug 10, 2023 • edited Loading

dwsutherland commented Aug 10, 2023 • edited Loading

dwsutherland commented Aug 11, 2023

dwsutherland commented Aug 11, 2023 • edited Loading

oliver-sanders commented Oct 30, 2023 • edited Loading

Using longest rather than shortest n-window distance

Reloading resets graph window

Graph errors

dwsutherland commented Oct 31, 2023 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

MetRonnie left a comment

Choose a reason for hiding this comment

oliver-sanders commented Nov 2, 2023

oliver-sanders commented Nov 2, 2023

oliver-sanders left a comment

Choose a reason for hiding this comment

dwsutherland commented Aug 1, 2023 •

edited

Loading

oliver-sanders commented Aug 10, 2023 •

edited

Loading

dwsutherland commented Aug 10, 2023 •

edited

Loading

dwsutherland commented Aug 11, 2023 •

edited

Loading

oliver-sanders commented Oct 30, 2023 •

edited

Loading

dwsutherland commented Oct 31, 2023 •

edited

Loading