Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name WDL output files in a human-readable way #5046

Merged
merged 10 commits into from
Aug 14, 2024
Merged

Conversation

adamnovak
Copy link
Member

This fixes #5008 by cramming WDL task names into the encoded Toil file URIs, along with the UUIDs that identify source directories. When we make a directory to hold the files from a given UUID, we will name it after the task (or workflow) that uploaded the files.

If I run:

toil-wdl-runner src/toil/test/wdl/miniwdl_self_test/self_test.wdl src/toil/test/wdl/miniwdl_self_test/inputs.json -o test_out

I get an output directory like this:

test_out
├── hello_caller.1.hello
│   └── Alyssa P. Hacker.txt
└── hello_caller.2.hello
    └── Ben Bitdiddle.txt

If a task uploads files from multiple directories, we will start adding deduplicating numbers onto the ends of the directories.

Weird cases like a scatter node in a workflow or a subworkflow uploading a file from a directory also referenced by the main workflow should work, but only one WDL task path gets to name the directory we create.

This adds some complexity to devirtualizing files because now you need to keep track of this additional bit of state. We might want to roll it up into an object with the caches. I had to change the share_files function to a kwarg on the standard library constructor because I neither wanted nor needed to deal with what would happen if you tried to merge this new state.

Since we have one copy of the download logic for outputs and for task input files, tasks should also get nicer input file names for free now.

Changelog Entry

To be copied to the draft changelog by merger:

  • WDL output files will now live in directories named after their tasks instead of UUID directories

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passes tests.
  • Make sure the PR has been reviewed since its last modification. If not, review it.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

Copy link
Member

@DailyDreaming DailyDreaming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just minor comments. It seems unlikely, but I just wanted to ask if there's a possibility of longer filenames (possibly too long) with the deduplication.

Thanks for adding the tests!

"""
Set up the standard library.


:param task_path: Dotted WDL name of the part of the wrokflow this library is working for.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrokflow

:param enforce_existence: If true, then if a file is detected as
nonexistent, raise an error. Else, let it pass through
:param share_files_with: Use the same file upload and download paths as
the provided standard library.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description makes share_files_with sound like a bool. Could it be a little more descriptive?

file_to_mountpoint: Dict[str, str],
current_directory_override: Optional[str] = None,
share_files_with: Optional[ToilWDLStdLibBase] = None
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could a note in the docstring be added for share_files_with?

@adamnovak adamnovak merged commit b5fd307 into master Aug 14, 2024
3 checks passed
@adamnovak adamnovak deleted the issues/5008-hide-uuids branch August 14, 2024 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Name WDL output subdirectories with human-readable names instead of UUIDs
2 participants