Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify WDL Toil job graphs #4524

Merged
merged 29 commits into from
Aug 2, 2023
Merged

Simplify WDL Toil job graphs #4524

merged 29 commits into from
Aug 2, 2023

Conversation

adamnovak
Copy link
Member

This will fix #4465 by allowing multiple WDL workflow nodes to (at least sometimes) live in one Toil job.

It also removes some Toil jobs that don't really need to be separate jobs, by baking the transformations they would do into the WDLBaseJob.

I ran a version of the conformance tests (2b91928600bfb1397c190a9900ac323db5111d1b) against this and I got:

Before:

64 tests run, 37 succeeded, 27 failed, 6 skipped
    Failures: 19,20,27,31,41,42,43,44,45,46,47,48,49,50,51,52,54,55,56,60,61,63,64,65,66,68,69
DOCKER_HOST=unix:///Users/anovak/.docker/run/docker.sock python3 run.py    1.  334.19s user 104.37s system 297% cpu 2:27.39 total

After:

64 tests run, 37 succeeded, 27 failed, 6 skipped
	Failures: 19,20,27,31,41,42,43,44,45,46,47,48,49,50,51,52,54,55,56,60,61,63,64,65,66,68,69
DOCKER_HOST=unix:///Users/anovak/.docker/run/docker.sock python3 run.py    1.  152.97s user 49.88s system 279% cpu 1:12.58 total

So it's maybe twice as fast and no less conformant.

Changelog Entry

To be copied to the draft changelog by merger:

  • WDL interpreter job postprocessing operations can now get attached to the nearest job with actual content instead of needing to be separate whole jobs
  • Multiple WDL declarations can now be evaluated in one job

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@adamnovak
Copy link
Member Author

Looks like this failed the Giraffe WDL test with:

[2023-07-06T22:13:22+0000] [MainThread] [W] [toil.leader] Log from job "kind-WDLWorkflowNodeListJob/instance-p3kiwm04" follows:
=========>
	[2023-07-06T22:13:21+0000] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
	[2023-07-06T22:13:21+0000] [MainThread] [I] [toil] Running Toil version 5.12.0a1-b6cfd132e2cba6a361e00dbcecd674483562586c on host runner-svulqt5t-project-3-concurrent-12bgtdt.
	[2023-07-06T22:13:21+0000] [MainThread] [I] [toil.worker] Working on job 'WDLWorkflowNodeListJob' decl-reference_index_file+ kind-WDLWorkflowNodeListJob/instance-p3kiwm04 v1
	[2023-07-06T22:13:21+0000] [MainThread] [I] [toil.worker] Loaded body Job('WDLWorkflowNodeListJob' decl-reference_index_file+ kind-WDLWorkflowNodeListJob/instance-p3kiwm04 v1) from description 'WDLWorkflowNodeListJob' decl-reference_index_file+ kind-WDLWorkflowNodeListJob/instance-p3kiwm04 v1
	[2023-07-06T22:13:21+0000] [MainThread] [I] [toil.wdl.wdltoil] Setting reference_index_file to select_first([REFERENCE_INDEX_FILE, indexReference.reference_index_file])
	[2023-07-06T22:13:21+0000] [MainThread] [I] [toil.wdl.wdltoil] Setting reference_dict_file to select_first([REFERENCE_DICT_FILE, indexReference.reference_dict_file])
	[2023-07-06T22:13:21+0000] [MainThread] [I] [toil.job] Saving graph of 1 jobs, 1 non-service, 0 new
	Traceback (most recent call last):
	  File "/builds/databiosphere/toil/src/toil/worker.py", line 403, in workerScript
	    job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
	  File "/builds/databiosphere/toil/src/toil/job.py", line 2782, in _runner
	    self._saveJobGraph(jobStore, saveSelf=False, returnValues=returnValues)
	  File "/builds/databiosphere/toil/src/toil/job.py", line 2572, in _saveJobGraph
	    self._fulfillPromises(returnValues, jobStore)
	  File "/builds/databiosphere/toil/src/toil/job.py", line 2318, in _fulfillPromises
	    pickle.dump(promisedValue, fileHandle, pickle.HIGHEST_PROTOCOL)
	RecursionError: maximum recursion depth exceeded while calling a Python object
	[2023-07-06T22:13:21+0000] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host runner-svulqt5t-project-3-concurrent-12bgtdt
<=========

Maybe I managed to override the recursion limit bump code to not get called?

@adamnovak
Copy link
Member Author

@DailyDreaming Would you be able to review this before the Toil meeting tomorrow?

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks well done. I like the organization and the comments are invaluable. Mostly minor comments, except I wonder if there should be a test for if something like coalesce_nodes running IDs together that we would expect it to run together.

@@ -885,6 +890,11 @@ def __init__(self, **kwargs: Any) -> None:
# TODO: Make sure C-level stack size is also big enough for this.
sys.setrecursionlimit(10000)

# We need an ordered list of postprocessing steps to apply, because we
# may ahve coalesced postprocessing steps deferred by several levels of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahve

@@ -885,6 +890,11 @@ def __init__(self, **kwargs: Any) -> None:
# TODO: Make sure C-level stack size is also big enough for this.
sys.setrecursionlimit(10000)

# We need an ordered list of postprocessing steps to apply, because we
# may ahve coalesced postprocessing steps deferred by several levels of
# jobs returing other jobs' promised RVs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returing

# A conditional might only appear to depend on the variables in the
# conditional expression, but its body can depend on other stuff, and
# we need to make sure that that stuff has finished and updated the
# environment before the conditional body runs. TODO: This is because
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make a ticket for this.

"""
Given a topological order of WDL workflow node IDs, produce a list of
lists of IDs, still in topological order, where each list of IDs can be
run under a single Toil job.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well stated.

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for adding the (sub)tests.

@DailyDreaming DailyDreaming enabled auto-merge (squash) July 27, 2023 17:28
@DailyDreaming DailyDreaming merged commit 017a618 into master Aug 2, 2023
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor WDL interpreter to run multiple WDL workflow nodes per Toil job
2 participants