Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix non-blocking option #354

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Fix non-blocking option #354

wants to merge 1 commit into from

Conversation

forsyth2
Copy link
Collaborator

Issue resolution

Select one: This pull request is...

  • a bug fix: increment the patch version
  • a small improvement: increment the minor version
  • an incompatible (non-backwards compatible) API change: increment the major version

1. Does this do what we want it to do?

Objectives:

  • Objective 1
  • Objective 2
  • ...
  • Objective n

Required:

  • Product Management: I have confirmed with the stakeholders that the objectives above are correct and complete.
  • Testing: I have added at least one automated test. Every objective above is represented in at least one test.
  • Testing: I have considered likely and/or severe edge cases and have included them in testing.

If applicable:

  • Testing: this pull request adds at least one new possible command line option. I have tested using this option with and without any other option that may interact with it.

2. Are the implementation details accurate & efficient?

Required:

  • Logic: I have visually inspected the entire pull request myself.
  • Logic: I have left GitHub comments highlighting important pieces of code logic. I have had these code blocks reviewed by at least one other team member.

If applicable:

  • Dependencies: This pull request introduces a new dependency. I have discussed this requirement with at least one other team member. The dependency is noted in zstash/conda, not just an import statement.

3. Is this well documented?

Required:

  • Documentation: by looking at the docs, a new user could easily understand the functionality introduced by this pull request.

4. Is this code clean?

Required:

  • Readability: The code is as simple as possible and well-commented, such that a new team member could understand what's happening.
  • Pre-commit checks: All the pre-commits checks have passed.

If applicable:

  • Software architecture: I have discussed relevant trade-offs in design decisions with at least one other team member. It is unlikely that this pull request will increase tech debt.

@forsyth2 forsyth2 added the semver: bug Bug fix (will increment patch version) label Oct 31, 2024
@forsyth2
Copy link
Collaborator Author

@golaz updates on diagnosing the issue:

Where is the non_blocking option referenced?

Debugging the call path for non_blocking, on main branch:

$ git grep -in non_blocking
create.py:92:    globus_finalize(non_blocking=args.non_blocking)
create.py:168:    if args.non_blocking:
globus.py:297:def globus_finalize(non_blocking: bool = False):
globus.py:322:    if not non_blocking:
update.py:48:    globus_finalize(non_blocking=args.non_blocking)

In create.py:

    globus_finalize(non_blocking=args.non_blocking)
    if args.non_blocking:
        args.keep = True

In globus.py:

def globus_finalize(non_blocking: bool = False):
    if not non_blocking:
        if task_id:
            globus_wait(task_id)
        if last_task_id:
            globus_wait(last_task_id)

In update.py:

    globus_finalize(non_blocking=args.non_blocking)

It looks like globus_wait is what enacts the non_blocking behavior. not non_blocking indicates that we should block, i.e., wait.

What are the call paths to get to globus_wait?

$ git grep -in globus_wait
globus.py:250:        globus_wait(task_id)
globus.py:253:def globus_wait(task_id: str):
globus.py:324:            globus_wait(task_id)
globus.py:326:            globus_wait(last_task_id)

The 2nd match is the definition of globus_wait, and the 3rd and 4th matches are the calls listed above. That leaves the first match, inside the globus_transfer function:

    if transfer_type == "get" and task_id:
        globus_wait(task_id)

That is currently always run. Meaning we always wait.

Our problem however is that non-blocking's behavior is always occurring -- that is, even when it's off, we're not waiting.

Option Expected behavior Current behavior Behaviors match?
--non-blocking on do not wait do not wait
--non-blocking off (i.e., not non_blocking) wait do not wait

So... the code is suggesting we're always waiting, but #290 tells us that we're never waiting.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Nov 4, 2024

What are the call paths to globus_transfer?

As mentioned above, the one call to globus_wait not wrapped in if not not_blocking (i.e., run if we should wait) is in globus_transfer.

Again, on main (not this PR's branch):

$ git grep -in "globus_transfer("
zstash/globus.py:160:def globus_transfer(
zstash/hpss.py:90:            globus_transfer(endpoint, url_path, name, transfer_type)

globus_transfer is called by hpss.hpss_transfer:

$ git grep -in "hpss_transfer("
zstash/hpss.py:14:def hpss_transfer(
zstash/hpss.py:111:    hpss_transfer(hpss, file_path, "put", cache, keep)
zstash/hpss.py:118:    hpss_transfer(hpss, file_path, "get", cache, False)

hpss_transfer called by hpss.hpss_put and hpss.hpss_get.

$ git grep -in "hpss_put("
zstash/create.py:90:    hpss_put(hpss, get_db_filename(cache), cache, keep=True)
zstash/hpss.py:107:def hpss_put(hpss: str, file_path: str, cache: str, keep: bool = True):
zstash/hpss_utils.py:159:            hpss_put(hpss, os.path.join(cache, tfname), cache, keep)
zstash/update.py:46:    hpss_put(hpss, get_db_filename(cache), cache, keep=True)

Continuing on in this fashion, we find the following call paths to the primary zstash commands:

globus.globus_transfer
-hpss.hpss_transfer
--hpss.hpss_put
---create.create
---hpss_utils.add_files
----create.create_database (2 appearances, 1 wrapped in try/except block for symlink checking)
-----create.create (calls `create.create_database` just before calling `hpss.hpss_put`)
----update.update_database (2 appearances, 1 wrapped in try/except block for symlink checking)
-----update.update (calls `update.update_database` a few lines before calling `hpss.hpss_put`)
---update.update 
--hpss.hpss_get
---extract.extract_database
----extract.extract
---extract.extractFiles
----extract.extract_database
-----extract.extract
---ls.ls_database
----ls.ls
---update.update_database 
----update.update

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Nov 4, 2024

@TonyB9000 Thanks for agreeing to look into this more. See my comments above for what I've found so far.

@TonyB9000
Copy link
Collaborator

@forsyth2 OK. Ill probably conduct a series of tests between acme1 and chrysalis, to determine when and if the globus blocking can be effected (a "simplest driver"). If globus blocking CAN be effected, it should be possible to do a code-wise side-by-side.

Note: Globus blocking/non-blocking should be independent of HPSS, right? I want to ensure that HPSS tape-drive does not send a weird "we're all done - nothing to see here" signal to the calling apparatus, the moment it gets invoked.

@TonyB9000
Copy link
Collaborator

@forsyth2 I cannot seem to git-clone the repo:

(dsm_rel_e2c) [bartoletti1@acme1 gitrepo]$ git clone [email protected]:E3SM-Project/zstash.git
Cloning into 'zstash'...
The authenticity of host 'github.com (140.82.116.4)' can't be established.
ECDSA key fingerprint is SHA256:p2QAMXNIC1TJYWeIOttrVc98/R1BUFWu3/LiyKgUfQM.
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
Warning: Permanently added 'github.com,140.82.116.4' (ECDSA) to the list of known hosts.
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@forsyth2
Copy link
Collaborator Author

forsyth2 commented Nov 5, 2024

Globus blocking/non-blocking should be independent of HPSS, right?

I believe so, but I'd need to go through the code to really check.

The authenticity of host 'github.com (140.82.116.4)' can't be established.

I think you need to delete a line in your ssh config file. I remember needing to do something like that, but I don't recall the exact file name.


# This is the `--non-blocking` behavior, even though we did not specify it.

# Now, with changes in this PR:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this PR mainly involve propagating if not non_blocking: on the remaining globus_wait not wrapped in a conditional. But I don't think that really matters, given the summary table in #354 (comment)


# In a different window:
ls /lcrc/group/e3sm/ac.forsyth2/E3SMv2_test/zstash_extractions/v2.NARRM.historical_0151/tests/zstash
# 000000.tar 000001.tar 000002.tar 000003.tar 000004.tar 000005.tar index.db
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tar files completed? empty?


# From https://docs.e3sm.org/zstash/_build/html/main/usage.html:
# `--maxsize MAXSIZE`` specifies the maximum size (in GB) for tar files. The default is 256 GB. Zstash will create tar files that are smaller than MAXSIZE except when individual input files exceed MAXSIZE (as individual files are never split up between different tar files).
# `--non-blocking` Zstash will submit a Globus transfer and immediately create a subsequent tarball. That is, Zstash will not wait until the transfer completes to start creating a subsequent tarball. On machines where it takes more time to create a tarball than transfer it, each Globus transfer will have one file. On machines where it takes less time to create a tarball than transfer it, the first transfer will have one file, but the number of tarballs in subsequent transfers will grow finding dynamically the most optimal number of tarballs per transfer. NOTE: zstash is currently always non-blocking.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, you want transfer to go as fast possible, so --non-blocking make sense.

BUT there are situations where the disk is nearly full, so you want to block the transfers (and delete local cache) to avoid filling up the disk space.

=> So we want to hold up generating a new tarball until the previous tarball was transferred over (and its local copy is deleted)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver: bug Bug fix (will increment patch version)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

zstash is always non-blocking
2 participants