Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Factory monitor not updating after factory lockup #324

Open
mmascher opened this issue Jul 31, 2023 · 1 comment
Open

Factory monitor not updating after factory lockup #324

mmascher opened this issue Jul 31, 2023 · 1 comment
Assignees
Labels
BUG For BUGS factory-mon for affected component factoryops Factory Operations stakeholder Low Low priority

Comments

@mmascher
Copy link
Contributor

Describe the bug
The UCSD factory machine locked up for an unknown reason (possibly a cooling issue in the room). Once the machine recovered the monitor was not available. Turned out some monitor cache file were empty and the factory was not expecting that.

To Reproduce
Run the factory for a while and then make one of the ftspk file empty, for example:
/var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk

Expected behavior
The corner case should be handled correctly and monitor available.

Info (please complete the following information):

  • Priority: low
  • Stakeholders: FactoryOps
  • Components: factory monitoring

Additional context

...
[2023-07-29 23:37:07,142] DEBUG: glideFactoryEntry:1058: Checking security credentials for client UCSD-CMS-Frontend.main
[2023-07-29 23:37:07,218] ERROR: glideFactoryEntry:1819: Could not read /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 1386, in loadCache
    data = util.file_pickle_load(fname)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/util.py", line 306, in file_pickle_load
    conditional_raise(mask_exceptions)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/util.py", line 295, in file_pickle_load
    data = pickle.load(fo)
EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/glideinwms/factory/glideFactoryEntry.py", line 1817, in perform_work_v3
    log_stats[credential_username + ":" + client_int_name].load()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 671, in load
    obj.load()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 82, in load
    return self.loadCache()
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 104, in loadCache
    self.data = loadCache(self.cachename)
  File "/usr/lib/python3.6/site-packages/glideinwms/lib/condorLogParser.py", line 1388, in loadCache
    raise RuntimeError("Could not read %s" % fname)
RuntimeError: Could not read /var/log/gwms-factory/server/entry_CMSHTPC_T2_US_Purdue_Negishi_Op/condor_activity_20230729_UCSD-CMS-Frontend.main.log.fecmsucsd.ftstpk
[2023-07-29 23:38:34,834] DEBUG: glideFactoryEntry:1058: Checking security credentials for client UCSD-CMS-Frontend.main
...
@mmascher mmascher self-assigned this Jul 31, 2023
@github-actions github-actions bot added BUG For BUGS factory-mon for affected component factoryops Factory Operations stakeholder Low Low priority labels Jul 31, 2023
@mambelli
Copy link
Contributor

mambelli commented Oct 3, 2023

This is similar to Issue #338, fixed in PR #339. This was visible also in the upgrades tested under EL9.
A protection was added before invoking the RRD libraries.
@mmascher To test and under 3.10.5 close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BUG For BUGS factory-mon for affected component factoryops Factory Operations stakeholder Low Low priority
Projects
None yet
Development

No branches or pull requests

2 participants