Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of a new method to StageOutImpl to log details about failing gfal commands #12081

Merged
merged 2 commits into from
Oct 17, 2024

Conversation

anpicci
Copy link
Contributor

@anpicci anpicci commented Aug 23, 2024

Fixes #11731

Status

In development

Description

This PR introduces a new method for StageOutImpl.py, further defined in Backends/GFAL2Impl.py, in order to address @stlammel's request to enforce the information stored in the logs for a failing StageOut command.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

None

External dependencies / deployment changes

No

@anpicci anpicci added the Technical Debt Used to track issues that address technical needs internal to WM team label Aug 23, 2024
@anpicci anpicci requested a review from amaltaro August 23, 2024 16:12
@anpicci anpicci self-assigned this Aug 23, 2024
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 31 new failures
    • 47 tests deleted
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 3 warnings and errors that must be fixed
    • 19 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 8 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15167/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrea, please find a few comments along the code.

src/python/WMCore/Storage/Backends/GFAL2Impl.py Outdated Show resolved Hide resolved
src/python/WMCore/Storage/Backends/GFAL2Impl.py Outdated Show resolved Hide resolved
src/python/WMCore/Storage/Backends/GFAL2Impl.py Outdated Show resolved Hide resolved
src/python/WMCore/Storage/StageOutImpl.py Outdated Show resolved Hide resolved
src/python/WMCore/Storage/StageOutImpl.py Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 40 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15201/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 93 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 16 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15202/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests no longer failing
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 91 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15203/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anpicci I think the changes look good at a first glance. I did not check in the PR description if this is the final version though, but anyway, I asked only one question inline in the same line of thought.

src/python/WMCore/Storage/StageOutImpl.py Show resolved Hide resolved
copyCommandDict['source'] = self.createFinalPFN(sourcePFN)
copyCommandDict['destination'] = self.createFinalPFN(targetPFN)

copyCommand = self.copyCommand % copyCommandDict
result += copyCommand

logging.debug("Actual command which failed: %s", copyCommand)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't know if the command failed (retries > 0) or not (retry=0), so it is a misleading message.

echo "ERROR: gfal-copy exited with $EXIT_STATUS"
echo "Source PFN: {source}"
echo "Target PFN: {destination}"
echo "Cleaning up failed file: {remove_command}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command is now missing the file removal logic, as you deleted that in favor of simply printing it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have missed this comment.

src/python/WMCore/Storage/Backends/GFAL2Impl.py Outdated Show resolved Hide resolved
"""
result = "#!/bin/bash\n"

copyCommandDict = {'checksum': '', 'options': '', 'source': '', 'destination': ''}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, the actual construction of the stage out command (without any of the exit status check and file removal) is exactly the same between this method and createStageOutCommand. If that is correct, then I think we should separate that logic in a different method and call that method inside createStageOutCommand() or createDebuggingCommand(). This way we guarantee that the code/command will never diverge by accident.

Another option, but it is a larger step, would be to save the actual stage out command (without exit status check/file removal overload) in a instance variable and just access that inside createDebuggingCommand(). This gives us even less room for potential differences in the command construction for real execution and debugging execution (plus, it avoids many internal calls).

src/python/WMCore/Storage/Backends/GFAL2Impl.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 2 tests no longer failing
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 90 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 14 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15224/artifact/artifacts/PullRequestReport.html

@anpicci anpicci requested a review from amaltaro September 24, 2024 08:15
@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 10 warnings and errors that must be fixed
    • 92 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 21 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15231/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 90 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 18 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15232/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 91 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15233/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 91 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15234/artifact/artifacts/PullRequestReport.html

@todor-ivanov
Copy link
Contributor

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 4 tests no longer failing
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 91 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 19 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15239/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Andrea, please find some comments inline.
In addition, please review the ticket requirements. AFAICT, it is missing still 2 desired information:

  • location of the gfal-cp binary (to be retrieved with which gfal-cp)
  • version of gfal-cp (please check the documentation to see how to get it, perhaps a --version)

logging.error("Maximum number of retries exhausted. Further details on the failed command reported below.")
command = self.createDebuggingCommand(sourcePFN, targetPFN, options, checksums)
self.executeCommand(command)
raise stageOutEx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this will have to be tested, but I wonder if we should do raise stageOutEx from None to avoid chaining exceptions? Just reading this code, I am not really sure what will happen.

echo "ERROR: gfal-copy exited with $EXIT_STATUS"
echo "Source PFN: {source}"
echo "Target PFN: {destination}"
echo "Cleaning up failed file: {remove_command}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might have missed this comment.


proxyInfo = None
try:
stdout, stderr, returnCode = runCommand("voms-proxy-info", timeout=10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it will work either, as I am not sure it can find the proxy file. We might have to append this command with -file $X509_USER_PROXY, but these things need to be tested.

destination=copyCommandDict['destination'],
setup_info=self.setups
)
if proxyInfo is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like just printing the proxyInfo content in the try/except above will give log viewers a much better experience than passing it over (the actual content already retrieved) to a sub-process to execute all together. Especially if that can fail, because then it will print the whole command that failed

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 92 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 20 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15242/artifact/artifacts/PullRequestReport.html

@anpicci
Copy link
Contributor Author

anpicci commented Sep 27, 2024

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 3 new failures
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 91 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 17 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15243/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 91 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 17 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15244/artifact/artifacts/PullRequestReport.html

@anpicci
Copy link
Contributor Author

anpicci commented Sep 27, 2024

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
  • Python3 Pylint check: succeeded
    • 89 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 15 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15245/artifact/artifacts/PullRequestReport.html

@anpicci
Copy link
Contributor Author

anpicci commented Oct 16, 2024

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 2 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 88 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15319/artifact/artifacts/PullRequestReport.html

@anpicci anpicci force-pushed the devb_11731 branch 2 times, most recently from 2c576e8 to 3f5b620 Compare October 16, 2024 15:41
@anpicci
Copy link
Contributor Author

anpicci commented Oct 16, 2024

Hi @stlammel , I have added some newlines , as it can be seen here.

Regarding the stdout/stderr, I agree with your sentiment. However, changing it would require make modifications to a dependency that affects also other WMCore scripts. As a result, it is easily out of the scope of this PR, and I agreed with @amaltaro to open a new issue to fix the confusion between stdout and stderr appearing in the logs

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 88 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15325/artifact/artifacts/PullRequestReport.html

@anpicci anpicci requested a review from amaltaro October 16, 2024 16:20
@stlammel
Copy link

Ok, thanks Andrea! - Stephan

@anpicci
Copy link
Contributor Author

anpicci commented Oct 16, 2024

@amaltaro everything should be fine with my additions in this PR, and ready to be reviewed

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anpicci these changes are looking good to me.
However, I would suggest revisiting the pycodestyle report in jenkins and resolve those (it looks like all of them have been added by this PR).
Once you provide those, feel free to already squash commits accordingly.

for retryCount in range(self.numRetries + 1):
try:
logging.info("Running the stage out...")
self.executeCommand(command)
break
break # This line won't be reached due to the raised error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless if there is no exception, then it gets executed.

src/python/WMCore/Storage/StageOutImpl.py Outdated Show resolved Hide resolved
src/python/WMCore/Storage/Backends/GFAL2Impl.py Outdated Show resolved Hide resolved
test/python/WMCore_t/Storage_t/Backends_t/GFAL2Impl_t.py Outdated Show resolved Hide resolved
@anpicci
Copy link
Contributor Author

anpicci commented Oct 17, 2024

test this please

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 88 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15330/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 1 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 88 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15331/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 6 new failures
    • 1 changes in unstable tests
  • Python3 Pylint check: failed
    • 2 warnings and errors that must be fixed
    • 79 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 4 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15332/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 81 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 7 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15333/artifact/artifacts/PullRequestReport.html

@anpicci
Copy link
Contributor Author

anpicci commented Oct 17, 2024

@amaltaro it should be ready now

@cmsdmwmbot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 new failures
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 81 comments to review
  • Pylint py3k check: succeeded
  • Pycodestyle check: succeeded
    • 5 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15334/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Andrea. It is looking better now and apparently you even make changes that were not required. However, I still see some easy fixes reported in the jenkins report section "Warnings from pycodestyle (pep8) by file" and we should make use of those.

Maybe it is something that we can improve in the contribution guidelines, giving people the relevant tools/directions on how they can check those out in their local nodes instead of relying on Jenkins.

We need this for the new agent release, so let us move on with this and keep these in mind for future contributions. Thanks!

@amaltaro amaltaro removed PR: Do not merge yet PR: squashing needed Technical Debt Used to track issues that address technical needs internal to WM team labels Oct 17, 2024
@amaltaro amaltaro merged commit 64c182f into dmwm:master Oct 17, 2024
2 of 4 checks passed
@anpicci
Copy link
Contributor Author

anpicci commented Oct 17, 2024

@amaltaro I agree, there is in particular the pylint comment about f-strings that is worth to address on regular basis. It seems like using .format() is not sufficient to resolve such comment, so that means that we should adopt the f"string{python_var}" approach

@stlammel
Copy link

We/WMCore is probably picking up gfal2 from the container, right? Do we log the full container name that is being used or the gfal2 version/RPM we use? (It just came to me over night that using /usr/bin/gfal-copy may be different for the same OS container request as we select by generic names which are links that change.

  • Stephan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WMCore stage-out script improvement
5 participants