Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StepChainParentageFix ReqMgr2 task getting killed #11693

Closed
amaltaro opened this issue Aug 24, 2023 · 8 comments · Fixed by #11694
Closed

StepChainParentageFix ReqMgr2 task getting killed #11693

amaltaro opened this issue Aug 24, 2023 · 8 comments · Fixed by #11694
Assignees

Comments

@amaltaro
Copy link
Contributor

Impact of the bug
ReqMgr2 CherryPy threads (reqmgr2-tasks k8s service)

Describe the bug
Resuming investigation of the high number of workflows stuck in announced status:
https://its.cern.ch/jira/browse/CMSPROD-26

and as a follow up of this ticket:
#11260

I found out gazillions of DBS related errors, example [1] and also something more worrisome, which is actual a restart of the service itself inside the kubernetes pod [2]. These records are available in the log reqmgr2-20230824-reqmgr2-tasks-77664bfc6c-xwcrc.log. It is not clear to me whether this restart gets somehow triggered by CherryPy framework itself, or from a different cmsweb related watchdog.

The effect of these restarts is that the service never manages to converge, given that it first deals with all the dataset parantage, and only then it tries to update the relevant request status to archived.

How to reproduce it
Not sure

Expected behavior
The service should not crash!

Second, I am inclined to deal with one workflow at a time, even if that means that some datasets will be looked twice.

Third, we need to understand those DBS errors. It could be a problem on our client side.

Additional context and error message
[1]

DBS Server error: [{'error': {'reason': 'DBSError Code:113 Description:DBS validation error, e.g. input parameter does not match lexicon rules Function:dbs.fileparents.InsertFileParentsBlockTxt Message: Error: nested DBSError Code:115 De
scription:DBS decode record failure, e.g. mailformed JSON Function:dbs.DecodeValidatorError Message: Error: validation error', 'message': '', 'function': 'dbs.fileparents.InsertFileParents', 'code': 110, 'stacktrace': '\ngoroutine 138212
03362 [running]:\ngithub.com/dmwm/dbs2go/dbs.Error({0xae4c20?, 0xc001d6b590?}, 0x6e, {0x0, 0x0}, {0xa12f61, 0x21})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99\ngithub.com/dmwm/dbs2go/dbs.(*API).InsertFileParents(0xc00032
c460)\n\t/go/src/github.com/vkuznet/dbs2go/dbs/fileparents.go:261 +0x14b\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xae81b0, 0xc000123f80}, 0xc000bf5b00, {0x9ff6a1, 0xb})\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:549 +0x12f7
\ngithub.com/dmwm/dbs2go/web.FileParentsHandler({0xae81b0?, 0xc000123f80?}, 0x454134?)\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:792 +0x45\nnet/http.HandlerFunc.ServeHTTP(0x870000c000d62101?, {0xae81b0?, 0xc000123f80?}, 0x0?)\
n\t/usr/local/go/src/net/http/server.go:2109 +0x2f\ngithub.com/dmwm/dbs2go/web.limitMiddleware.func1({0xae81b0?, 0xc000123f80?}, 0xc000887520?)\n\t/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:111 +0x38\nnet/http.HandlerFunc.Serve
HTTP(0x94c100?, {0xae81b0?, 0xc000123f80?}, 0x11?)\n\t/usr/local/go/src/net/http/'}, 'http': {'method': 'POST', 'code': 400, 'timestamp': '2023-08-24 12:09:49.180341384 +0000 UTC m=+1516348.589639680', 'path': '/dbs/prod/global/DBSWriter
/fileparents', 'user_agent': 'DBSClient/Unknown/', 'x_forwarded_host': 'dbs-prod.cern.ch', 'x_forwarded_for': '137.138.31.95', 'remote_addr': '188.185.29.95:56452'}, 'exception': 400, 'type': 'HTTPError', 'message': 'DBSError Code:110 De
scription:DBS DB insert record error Function:dbs.fileparents.InsertFileParents Message: Error: nested DBSError Code:113 Description:DBS validation error, e.g. input parameter does not match lexicon rules Function:dbs.fileparents.InsertF
ileParentsBlockTxt Message: Error: nested DBSError Code:115 Description:DBS decode record failure, e.g. mailformed JSON Function:dbs.DecodeValidatorError Message: Error: validation error'}]

[2]

[24/Aug/2023:12:10:50] reqmgr2-tasks-77664bfc6c-xwcrc 127.0.0.1 "GET /reqmgr2/data/info HTTP/1.1" 200 OK [data: 296 in 706 out 99937 us ] [auth: OK "" "" ] [ref: "" "Go-http-client/1.1" ]
[24/Aug/2023:12:11:53]  WATCHDOG: server exited with exit code signal 9... restarting
[24/Aug/2023:12:11:53]  INFO: final CherryPy configuration: {'engine.autoreload.on': False,
@vkuznet
Copy link
Contributor

vkuznet commented Aug 24, 2023

Alan, DBS provides full report of what has happened and you need to use your python skills to make it human friendly. Here is one suggestion:

  • put DBS error into /tmp/erro.rjson
  • fire python seesion with the following codebase:
>>> import json
>>> data=json.load(open('/tmp/error.json'))
>>> data[0]
>>> import pprint
>>> pprint.pprint(data[0])
{'error': {'code': 110,
           'function': 'dbs.fileparents.InsertFileParents',
           'message': '',
           'reason': 'DBSError Code:113 Description:DBS validation error, e.g. '
                     'input parameter does not match lexicon rules '
                     'Function:dbs.fileparents.InsertFileParentsBlockTxt '
                     'Message: Error: nested DBSError Code:115 Description:DBS '
                     'decode record failure, e.g. mailformed JSON '
                     'Function:dbs.DecodeValidatorError Message: Error: '
                     'validation error',
           'stacktrace': '\n'
                         'goroutine 13821203362 [running]:\n'
                         'github.com/dmwm/dbs2go/dbs.Error({0xae4c20?, '
                         '0xc001d6b590?}, 0x6e, {0x0, 0x0}, {0xa12f61, 0x21})\n'
                         '\t'
                         '/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 '
                         '+0x99\n'
                         'github.com/dmwm/dbs2go/dbs.(*API).InsertFileParents(0xc00032c460)\n'
                         '\t'
                         '/go/src/github.com/vkuznet/dbs2go/dbs/fileparents.go:261 '
                         '+0x14b\n'
                         'github.com/dmwm/dbs2go/web.DBSPostHandler({0xae81b0, '
                         '0xc000123f80}, 0xc000bf5b00, {0x9ff6a1, 0xb})\n'
                         '\t'
                         '/go/src/github.com/vkuznet/dbs2go/web/handlers.go:549 '
                         '+0x12f7\n'
                         'github.com/dmwm/dbs2go/web.FileParentsHandler({0xae81b0?, '
                         '0xc000123f80?}, 0x454134?)\n'
                         '\t'
                         '/go/src/github.com/vkuznet/dbs2go/web/handlers.go:792 '
                         '+0x45\n'
                         'net/http.HandlerFunc.ServeHTTP(0x870000c000d62101?, '
                         '{0xae81b0?, 0xc000123f80?}, 0x0?)\n'
                         '\t/usr/local/go/src/net/http/server.go:2109 +0x2f\n'
                         'github.com/dmwm/dbs2go/web.limitMiddleware.func1({0xae81b0?, '
                         '0xc000123f80?}, 0xc000887520?)\n'
                         '\t'
                         '/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:111 '
                         '+0x38\n'
                         'net/http.HandlerFunc.ServeHTTP(0x94c100?, '
                         '{0xae81b0?, 0xc000123f80?}, 0x11?)\n'
                         '\t/usr/local/go/src/net/http/'},
 'exception': 400,
 'http': {'code': 400,
          'method': 'POST',
          'path': '/dbs/prod/global/DBSWriter/fileparents',
          'remote_addr': '188.185.29.95:56452',
          'timestamp': '2023-08-24 12:09:49.180341384 +0000 UTC '
                       'm=+1516348.589639680',
          'user_agent': 'DBSClient/Unknown/',
          'x_forwarded_for': '137.138.31.95',
          'x_forwarded_host': 'dbs-prod.cern.ch'},
 'message': 'DBSError Code:110 Description:DBS DB insert record error '
            'Function:dbs.fileparents.InsertFileParents Message: Error: nested '
            'DBSError Code:113 Description:DBS validation error, e.g. input '
            'parameter does not match lexicon rules '
            'Function:dbs.fileparents.InsertFileParentsBlockTxt Message: '
            'Error: nested DBSError Code:115 Description:DBS decode record '
            'failure, e.g. mailformed JSON Function:dbs.DecodeValidatorError '
            'Message: Error: validation error',
 'type': 'HTTPError'}

Now, you can easily read it and it is very clear. In particular, read message section which provides nested error explanation of what happened on server side. Here we have validation error which means that your data input did not confirm with DBS data-type exception, i.e. if you supply string where it expects integer. The error also clearly indicate the client, provide timestamp, IPs, and event fully provides a stack trace of DBS function calls such that you can easily find lines in a DBS code where it happen. It also provide DBS and HTTP error codes, in former case you can look-up DBS error code from DBS codebase which in this case was 110 and you can see its explanation here https://github.com/dmwm/dbs2go/blob/master/dbs/errors.go#L49 And, if you inspect message you can see how this code happens in chain of DBS calls where it reports each time proper code, i.e. 110->113->115 and again you can look-up all those codes in dbs/errors.go file. I hope this will help to investigate the issue. Bottom line, check your client data, the JSON which it seems to me contains wrong data-type for some of its attributes.

@amaltaro
Copy link
Contributor Author

Valentin, thank you for providing these details. However, I have to strongly disagree with your statement that:

 'message': 'DBSError Code:110 Description:DBS DB insert record error '
            'Function:dbs.fileparents.InsertFileParents Message: Error: nested '
            'DBSError Code:113 Description:DBS validation error, e.g. input '
            'parameter does not match lexicon rules '
            'Function:dbs.fileparents.InsertFileParentsBlockTxt Message: '
            'Error: nested DBSError Code:115 Description:DBS decode record '
            'failure, e.g. mailformed JSON Function:dbs.DecodeValidatorError '
            'Message: Error: validation error',

is "easy to read and very clear"! It might be easier for you because you implemented it, but I doubt any mortal client would manage to consume this information.

My comments and questions are:

  • What is a "nested DBSError"?
  • what is the origin - read the first exception that triggered the others - in this stacked error?
  • it failed in the InsertFileParents, yes, because I used insertFileParents. But what failed in this API? Missing data? Data not compliant with the server lexicon? If so, which data? What I provided? What I should have provided?
  • returning a stack trace with registers/memory address to the end client is not helpful

This is not the first problem that we can't decode from the error message, example:
dmwm/dbs2go#98

Is there any chance that the service actually logs some more interesting and helpful information?

@vkuznet
Copy link
Contributor

vkuznet commented Aug 24, 2023

Alan

  • nested DBSError comes from the stack trace shown in error message, i.e. one function call another, etc. For each function call we report the error, that's why it is nested
  • The origin comes from the left of the message, e.g. DBSError Code:110 ... and as you read it you go deeper and deeper in the function call nesteness
  • the insertFileParents is top level API reported in error which calls data validation, etc. The validation error reports parameter does not match lexicon rules so there is a mismatch of input and lexicon rules, and then it reports validation error mailformed JSON ... validation error . In other words check your JSON you send to DBS and check data-types in it and/or patterns. Since there are lots of checks applied and your data input can be huge (e.g. hundreds of MB for some blocks) we do not return/print it. You need to look at your client and the data it sends and validate it.
  • finally, returned stack is useful to debug DBS server itself, and not so useful for the client.

To resolve the issue you need to look at JSON data from your client and check it. I understand that it is not ideal error report but it is the best we can do based on amount of DBS load and dynamic size of our data. If you need to know lexicon rules DBS uses they can be found for reader, writer, migration DBS servers over here https://github.com/dmwm/dbs2go/tree/master/static We had a long standing issue with lexicon convergence between DBS and WMCore and it is still unresolved. In particular, we may need to look at

@vkuznet
Copy link
Contributor

vkuznet commented Aug 24, 2023

and if you look at a stack which provides <file name:line_number>, you will see that error happened at dbs/fileparents.go:261 which shows that actual issue is file parent block of the JSON which failed validation. And, if you'll follow the code you'll see that API which fails is InsertFileParentsBlockTxt which insert every single file parent, see loop https://github.com/dmwm/dbs2go/blob/master/dbs/fileparents.go#L361 and in this loop (I think) validation error happen, see line https://github.com/dmwm/dbs2go/blob/master/dbs/fileparents.go#L368 which uses FileParents struct. This struct has required fields:

  • json:"this_file_id" validate:"required,number,gt=0"
  • json:"parent_file_id" validate:"required,number,gt=0"
    which means that either field should be present in JSON and it is required and it should have number larger than zero. I suggest that you look-up your JSON data and see if it provides parents in it and check what JSON key-values for these parents are.

At the end, stack info is very valuable to debug process as it points to exact lines of the code which someone can inspect to understand the logic of the code flow.

@amaltaro
Copy link
Contributor Author

An update on the service getting auto-restarted. I am still not sure whether it is triggered by a single workflow, but I did see this workflow

cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766

causing a kill command of the CherryPy application multiple times. As it can be seen in this grafana plot:
https://monit-grafana.cern.ch/d/M2U6T6uGk/kube-eagle-k8s-metrics?orgId=11&from=1692890827369&to=1692977227369&refresh=30m&viewPanel=223

memory footprint goes from a standard ~500MB all the way up to 5GB and beyond, making the service to restart itself.
Note that this memory pattern is present either in the current production implementation and in the one proposed in this PR.

I created an ad-hoc script to deal only with a (that) workflow, which is running for more than 30min now in dry-run mode. I am soon going to update the PR with that script such that we have an easier way to deal with this in the future. A long term solution will be tracked in a new GH issue.

@amaltaro
Copy link
Contributor Author

It turns out I misused the psutil module and I failed to fetch the max memory used when resolving that workflow. From grafana schedd dashboard, I believe it was around the 5GB.

I just wanted to say that it took almost 2h to fix this one workflow as well:

INFO:root:  time spent with workflow cmsunified_task_SMP-RunIISummer20UL18wmLHEGEN-00542__v1_T_230314_093054_8766 is: 8274.428401231766 secs
INFO:root:Final memory info: pfullmem(rss=56569856, vms=283303936, shared=6991872, text=4096, lib=0, data=124456960, dirty=0, uss=53334016, pss=53663744, swap=0) with PSS: 51.177734375 MB
INFO:root:Final memory info: pcputimes(user=3270.38, system=10.77, children_user=0.0, children_system=0.0, iowait=0.05)

Note that many of the 3k+ blocks actually failed to get the parentage resolved, since many files lack their parentage, thus failing on the server side. Right now the production instance of reqmgr2-tasks is supposed to skip the workflow above, such that it can gracefully process the others.

@amaltaro
Copy link
Contributor Author

Quick update from patching reqmgr2-tasks and continuously skipping the large workflow from the comment above:

  • we managed to archive around 1k workflows
  • yet, 2334 are at the moment stuck due to ParentageResolved issues (failing to have partial block inserted into DBS server)
  • and another 379 workflows are stuck at the moment due to tape transfers not yet completed

Today we have a chat with Dennis (DBS expert) and we will try to get this issue resolved on the server side in the next week or two.

@amaltaro
Copy link
Contributor Author

Moved from 2023 Q3 to Q4, as we plan to finish it in the coming days/weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants