Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

large delay in file injection #9543

Open
vlimant opened this issue Feb 14, 2020 · 20 comments
Open

large delay in file injection #9543

vlimant opened this issue Feb 14, 2020 · 20 comments

Comments

@vlimant
Copy link
Contributor

vlimant commented Feb 14, 2020

Impact of the bug
workflows are ending in completed with missing files/statistics. talked about this to @amaltaro who did not have time to put this on anyone's radar before leaving apparently

Describe the bug
workflow are getting in "completed" while files are still being injected

How to reproduce it
https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_task_EXO-RunIIFall17wmLHEGS-02448__v1_T_200211_194626_1071
https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_EXO-RunIIFall17wmLHEGS-02448

and many others in "filemismatch" under https://cms-unified.web.cern.ch/cms-unified//assistance.html are likely affected

Expected behavior
#8148 for example

@todor-ivanov
Copy link
Contributor

@vlimant Thanks for reporting it. I am looking at the issue now. May come back with further questions later, though.

@todor-ivanov
Copy link
Contributor

I still do not have the reason why it happens, but would like to add some more info regarding the consequences of that. For the workflow in question: It is a an ACDC and the original request is this one:
https://cmsweb.cern.ch/reqmgr2/fetch?rid=pdmvserv_task_EXO-RunIIFall17wmLHEGS-02448__v1_T_191031_011220_8129

Checking the number of events in the request and the numbers (both events and files) in the Output datasets there is some difference :

----------------------------------------------------
==> pdmvserv_task_EXO-RunIIFall17wmLHEGS-02448__v1_T_191031_011220_8129
----------------------------------------------------

|***************************************************|
|                         | CouchDB | PhEDEx | DBS  |
|---------------------------------------------------|
| Same dataset name       | ok      | ok     | ok   |
| Same dataset size       | NOPE    | NOPE   | NOPE | 
| Same number of blocks   | --      | ok     | ok   |
| Same number of files    | NOPE    | NOPE   | NOPE | 
| Same number of events   | NOPE    | --     | NOPE |
| Same number of lumis    | ok      | --     | ok   |
| Same PhEDEx Node Name   | --      | NOPE   | NOPE |
|***************************************************|

======> Request information from reqmgr2 db: 
    InputDataset: 
    RequestNumEvents: 50000
    Runtime: 2439.70333333
    TotalEstimatedJobs: 13
    TotalInputEvents: 50000
    TotalInputFiles: 0
    TotalInputLumis: 50
    lumis: 50
...
======> Couch output dataset info: 
    /LQToBMu_M-1000_single_TuneCP2_PSweights_13TeV-madgraph-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM: 
        dsetSize: 3084801131
        events: 46000
        numFiles: 7
...
======> DBS info: 
    /LQToBMu_M-1000_single_TuneCP2_PSweights_13TeV-madgraph-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM: 
        dsetSize: 3361332847
        events: 50000
        lumis: 50
        numBlocks: 8
        numFiles: 8
        prep_id: EXO-RunIIFall17MiniAODv2-03804
...
======> PhEDEx info: 
    /LQToBMu_M-1000_single_TuneCP2_PSweights_13TeV-madgraph-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM: 
        dsetSize: 3361332847
        numBlocks: 8
        numFiles: 8

Obviously DBS and Phedex are in agreement for the number of files while couchdb is having some discrepancy. Then querying couchdb about the ACDC clearly shows that the ACDC covered the event deficit and the missing file is there:

    output: 
        /LQToBMu_M-1000_single_TuneCP2_PSweights_13TeV-madgraph-pythia8/RunIIFall17MiniAODv2-PU2017_12Apr2018_94X_mc2017_realistic_v14-v1/MINIAODSIM: 
            events: 4000
            nFiles: 1
            size: 276531716
            tasks: 
                /cmsunified_ACDC0_task_EXO-RunIIFall17wmLHEGS-02448__v1_T_200211_194626_1071/EXO-RunIIFall17DRPremix-03855_0/EXO-RunIIFall17DRPremix-03855_1/EXO-RunIIFall17MiniAODv2-03804_0/EXO-RunIIFall17MiniAODv2-03804_0MergeMINIAODSIMoutput

Obviously due to the delayed injection of the files (after the workflow has been completed) Couchdb is not updating the information about the output. But at least the requested number of events are present.

@amaltaro
Copy link
Contributor

I suggest we follow this up in this original issue:
#8148

which is related to the asynchronous completion of workflows, where the workflow goes to completed and files are still left to be injected into DBS/PhEDEx. Closing this issue as duplicate.

Todor, from your report, files are fully available in PhEDEx and DBS. I think Sharad tagged us on a couple of JIRA tickets in the last few days, reporting what Jean-Roch reported privately to me on slack.

When you have a chance, could you please have a look at one of those tickets and check the DBS3Upload component logs just to make sure DBS is working properly and injections are happening without much delay? Please report on the other thread and/or JIRA issues.

@vlimant
Copy link
Contributor Author

vlimant commented Feb 19, 2020

@amaltaro I think this issue has value on its own, as we are not talking about the hard development of synchronizing completed and all file being injected (#8148) but really the fact that DBSuploader is lagging more than 2 days behind for injecting all files. I let you guys reopen it to address the actual problem here, impacting production over the last several weeks.

2 days is already a vey long grace period IMO, given that it's only about making POST to DBS and Phedex.

@vlimant
Copy link
Contributor Author

vlimant commented Feb 19, 2020

the 2 days grace period is defined here https://github.com/CMSCompOps/WmAgentScripts/blob/master/Unified/checkor.py#L1051

@vlimant
Copy link
Contributor Author

vlimant commented Feb 19, 2020

https://its.cern.ch/jira/browse/CMSCOMPPR-11418 > 1 week delay in injection if you want to look at something urgent @todor-ivanov

@todor-ivanov
Copy link
Contributor

Looking at this exact one now.

@todor-ivanov todor-ivanov reopened this Feb 19, 2020
@todor-ivanov
Copy link
Contributor

todor-ivanov commented Feb 20, 2020

In the DBSUploader component log I find warning as this one:

2020-02-16 17:22:19,034:140114997479168:WARNING:DBSUploadPoller:Got a proxy error for block /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#f00d851c-d75c-4211-9651-c456f0784501.

This block in particular was retried few lines bellow, but for many others I do not see a retry to inject them:

2020-02-16 17:28:02,499:140114997479168:INFO:DBSUploadPoller:About to insert block /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#f00d851c-d75c-4211-9651-c456f0784501
2020-02-16 17:28:03,404:140114997479168:WARNING:DBSUploadPoller:Block /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#f00d851c-d75c-4211-9651-c456f0784501 already exists. Marking it as uploaded.

The aforementioned dataset still has some discrepancy between the blocks injected to PhEDEx and those to DBS, but they are for some other blocks not for this one:

cmst1@vocms0192:/data/srv/wmagent/current $ python checkDsetFileCount.py -d /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM
Dataset: /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM
Phedex file count :  36775
DBS file count    :  36751
 - valid files    :  36751
 - invalid files  :  0
 - valid+invalid  :  36751
Blocks in PhEDEx but not in DBS:  set([u'/BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#9aea6679-5127-4ebe-8fe5-8246eca8dfc0', u'/BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#1d8c92eb-79de-4f9d-b5aa-38a517390533', u'/BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#e4647650-e232-4ba5-b44c-0e6180d77be5'])
Blocks in DBS but not in PhEDEx:  set([])
Block with file mismatch: /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#1d8c92eb-79de-4f9d-b5aa-38a517390533
	PhEDEx: 1		DBS: None
Block with file mismatch: /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#9aea6679-5127-4ebe-8fe5-8246eca8dfc0
	PhEDEx: 6		DBS: None
Block with file mismatch: /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#e4647650-e232-4ba5-b44c-0e6180d77be5
	PhEDEx: 17		DBS: None

The situation is similar for all the blocks I find with a proxy WARNING in the logs.

So from the code I find here [1] the place where the proxy error is caught and turned into a WARNING and the relevant block is filled into a checklist. I am now searching to see where is the place where this checklist is referred and if those injections for the blocks in this list are supposed to be retried.

[1]
https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/DBS3Buffer/DBSUploadPoller.py#L122-L127

@yuyiguo
Copy link
Member

yuyiguo commented Feb 20, 2020

@todor-ivanov
I assumed that you were uploading to DBS prod, right? What was the time frame you got problem? What is the nodes or DN that you used to upload block to DBS. Please give me as much as details.
If you could copy/paste the DBS error and timestamp you got, it would help.

2020-02-16 17:28:02,499:140114997479168:INFO:DBSUploadPoller:About to insert block /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#f00d851c-d75c-4211-9651-c456f0784501
2020-02-16 17:28:03,404:140114997479168:WARNING:DBSUploadPoller:Block /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#f00d851c-d75c-4211-9651-c456f0784501 already exists. Marking it as uploaded.

The block was already in DBS. What was the problem?

Here is the DBS monitoring page and it is in a health status:
DBS monitoring

Here is the DBS prod writer monitoring and it looked fine except for some high load this morning FNAL time.
DBS writer
Thanks,
yuyi

@todor-ivanov
Copy link
Contributor

Hi @yuyiguo, thanks for taking a look. So I took a bad example before. Bu here is another block which is missing in dbs:

2020-01-28 08:28:35,596:139757365487360:WARNING:DBSUploadPoller:Got a proxy error for block /EGamma/Run2018D-SinglePhotonJetPlusHOFilter-12Nov2019_UL2018-v4/RAW-RECO#90a68d7f-b744-4bce-9ef6-253dfed91b8f

see the comparison between DBS and PhEDEx regarding this dataset:

Dataset: /EGamma/Run2018D-SinglePhotonJetPlusHOFilter-12Nov2019_UL2018-v4/RAW-RECO
Phedex file count :  8199
DBS file count    :  8046
 - valid files    :  8046
 - invalid files  :  0
 - valid+invalid  :  8046
Blocks in PhEDEx but not in DBS:  set([u'/EGamma/Run2018D-SinglePhotonJetPlusHOFilter-12Nov2019_UL2018-v4/RAW-RECO#c1d73877-fb3e-456c-a22c-796ddd584e8b', u'/EGamma/Run2018D-SinglePhotonJetPlusHOFilter-12Nov2019_UL2018-v4/RAW-RECO#90a68d7f-b744-4bce-9ef6-253dfed91b8f'])
Blocks in DBS but not in PhEDEx:  set([])
Block with file mismatch: /EGamma/Run2018D-SinglePhotonJetPlusHOFilter-12Nov2019_UL2018-v4/RAW-RECO#c1d73877-fb3e-456c-a22c-796ddd584e8b
	PhEDEx: 152		DBS: None
Block with file mismatch: /EGamma/Run2018D-SinglePhotonJetPlusHOFilter-12Nov2019_UL2018-v4/RAW-RECO#90a68d7f-b744-4bce-9ef6-253dfed91b8f
	PhEDEx: 1		DBS: None

Please, find a list of proxy errors from the agents logs attached bellow. Plenty of those are retried - I can see it now. But for some blocks, we are actually having quite significant delays until they make it to DBS.
blocks.proxyErr.txt

@yuyiguo
Copy link
Member

yuyiguo commented Feb 21, 2020

@todor-ivanov
I checked several DBS logs and did not find the block upload API bulkblocks had error or timeout.
I got email alert that CMSWEB frontend was heavy loaded. Could be the errors from the frontends?
In order for me to investigate the problem further, I need details: error code , error message, timestamp, CN and dbs server name.

@todor-ivanov
Copy link
Contributor

Thank you @yuyiguo for checking it. If the situation on the DBS side is calm, I am in a very hard situation, then. In case you are still willing to dig further in DBS, you may find all the information I can provide in the file attached to my previous message. There are the timestamps (please, notice that they are well spread among the last two months, only because that many logs We have). The error code is 502 again, the same as in the problem with the lost parentage dataset information - this usually comes from the Frontend indeed, but because it simply cannot forward the request to the backend (either due to timeouts from the backend or any other communication issue). Machine names are also there. Block ids that have been tried are also there - most of them succeeded after N retires, but some of them never make it to DBS. I am currently trying to understand that exact part - if the retires for some of the blocks are stopped for real at some point? But this part of the code has not been touched lately, so I am skeptical.

@yuyiguo
Copy link
Member

yuyiguo commented Feb 21, 2020

@todor-ivanov
Just saw your attached file. Let me follow the timestamps to find out more.

@yuyiguo
Copy link
Member

yuyiguo commented Feb 21, 2020

@todor-ivanov
vocms0250: 2020-02-17 15:42:37,121:140114997479168:WARNING:DBSUploadPoller:Got a proxy error for block /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#3c50965e-a4d9-42dd-a006-83b1430c93a1.
What is the CN used for this request at 2020-02-17 15:42:37?

@yuyiguo
Copy link
Member

yuyiguo commented Feb 21, 2020

@todor-ivanov
what is the timezone of these time?

@todor-ivanov
Copy link
Contributor

@yuyiguo the times are all in CERN/Geneva time.
the CN should be:
wmagent/vocms0280.cern.ch

@todor-ivanov
Copy link
Contributor

todor-ivanov commented Feb 21, 2020

Sorry @yuyiguo need to correct myself here:

the times are all in CERN/Geneva time.

The correct statement should be:

All CERN machines (vocms0***) are in Geneva time
All FNAL machines (cmsgwms-submit*) are in Chicago time

@yuyiguo
Copy link
Member

yuyiguo commented Feb 21, 2020

@todor-ivanov
Thanks !

@yuyiguo
Copy link
Member

yuyiguo commented Feb 21, 2020

@todor-ivanov

I only got eight days of log files avaliable to be. They are from Feb 14 to 21.
I used the first block in your attached file as an example:
vocms0250: 2020-02-17 15:42:37,121:140114997479168:WARNING:DBSUploadPoller:Got a proxy error for block /BdToDStarMuNu_ToD0Pi_HardQCD_TuneCP5_13TeV-pythia8-evtgen/RunIIFall18GS-102X_upgrade2018_realistic_v11-v1/GEN-SIM#3c50965e-a4d9-42dd-a006-83b1430c93a1.

Here is what I found in one of the log files. The message matched node vocms0250/188.184.86.128 and the time is closed to 2020-02-17 15:45.
INFO:cherrypy.access:REQUEST [17/Feb/2020 13:54:24] 188.184.86.128 47836 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:13:54:25] vocms0165.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 39109 in 4 out 756829 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:REQUEST [17/Feb/2020 14:26:04] 188.184.86.128 56146 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:14:26:05] vocms0165.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 31201 in 4 out 292460 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:REQUEST [17/Feb/2020 15:37:36] 188.184.86.128 46966 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:15:37:37] vocms0165.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 20809 in 4 out 247050 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] **INFO:cherrypy.access:REQUEST [17/Feb/2020 15:48:09] 188.184.86.128 54000 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:15:48:09] vocms0165.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 2607655 in 255 out 390192 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ]** INFO:cherrypy.access:REQUEST [17/Feb/2020 17:41:24] 188.184.86.128 45126 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:REQUEST [17/Feb/2020 17:41:25] 188.184.86.128 49164 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:17:41:26] vocms0165.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 5999 in 4 out 2176891 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:[17/Feb/2020:17:41:28] vocms0165.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 205739 in 4 out 3158761 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:REQUEST [17/Feb/2020 18:14:05] 188.184.86.128 55096 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:18:14:06] vocms0165.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 3311 in 4 out 208248 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ]
400 Bad Request [data: 2607655 in 255 out 390192 us ] has much bigger input (2607655 in) than others. It end up with HTTP 400.

`INFO:cherrypy.access:[17/Feb/2020:12:26:11] vocms0163.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 2613442 in 255 out 133515382 us ] [auth: OK "/DC=c
h/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ]

vocms0766/dbs/DBSGlobalWriter-20200216-vocms0766.log:INFO:cherrypy.access:[16/Feb/2020:17:28:03] vocms0766.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 2611039 in 255 out 360011 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ]
`
are the 2nd and 3rd in you file. But it seemed to be late to the second one in your file. All of them had bad input. why?

grep bulkblocks vocms0136/dbs/*.log | grep vocms0280 | grep "400 Bad Request" vocms0136/dbs/DBSGlobalWriter-20200221-vocms0136.log:INFO:cherrypy.access:[21/Feb/2020:02:35:51] vocms0136.cern.ch 137.138.152.19 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 6685 in 182 out 167820 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] vocms0136/dbs/DBSGlobalWriter-20200221-vocms0136.log:INFO:cherrypy.access:[21/Feb/2020:02:36:19] vocms0136.cern.ch 137.138.152.19 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 18125 in 200 out 35475 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] vocms0136/dbs/DBSGlobalWriter-20200221-vocms0136.log:INFO:cherrypy.access:[21/Feb/2020:02:36:23] vocms0136.cern.ch 137.138.152.19 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 3694 in 184 out 76390 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] vocms0136/dbs/DBSGlobalWriter-20200221-vocms0136.log:INFO:cherrypy.access:[21/Feb/2020:02:54:47] vocms0136.cern.ch 137.138.152.31 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 3757 in 183 out 34609 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] vocms0136/dbs/DBSGlobalWriter-20200221-vocms0136.log:INFO:cherrypy.access:[21/Feb/2020:02:54:48] vocms0136.cern.ch 137.138.152.31 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 3339 in 205 out 62440 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] vocms0136/dbs/DBSGlobalWriter-20200221-vocms0136.log:INFO:cherrypy.access:[21/Feb/2020:02:54:49] vocms0136.cern.ch 137.138.152.31 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 400 Bad Request [data: 3335 in 205 out 20000 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ]

Above requests were all from CN/vocms028 and they were not big blocks. They all failed with "400 Bad Request". How the data was generated?

How WMAgent call bulkblocks API? This is a very heavy API that insert everything regarding a block into DBS, this includes files, lumi sections, parentages and so on. In one second 17/Feb/2020 12:13:08 on only one of DBS servers, WMAgent send SIX blocks to insert into DBS. The agents may compete resource among them. I would suggest you stretch the uploading a bit. This may actually speed up because it finishes in one upload instead of retries.

INFO:cherrypy.access:REQUEST [17/Feb/2020 12:13:08] 188.184.86.128 60496 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:REQUEST [17/Feb/2020 12:13:08] 188.184.86.128 35868 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:REQUEST [17/Feb/2020 12:13:08] 188.184.86.128 38584 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:12:13:08] vocms0136.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 2831 in 4 out 90604 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:[17/Feb/2020:12:13:08] vocms0136.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 4348 in 4 out 70541 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:[17/Feb/2020:12:13:08] vocms0136.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 4220 in 4 out 96751 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:REQUEST [17/Feb/2020 12:13:08] 188.184.86.128 37856 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:REQUEST [17/Feb/2020 12:13:08] 188.184.86.128 60580 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:12:13:08] vocms0136.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 2950 in 4 out 96441 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:[17/Feb/2020:12:13:08] vocms0136.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 3141 in 4 out 38843 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ] INFO:cherrypy.access:REQUEST [17/Feb/2020 12:13:08] 188.184.86.128 38644 POST /bulkblocks [/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch] [{}] INFO:cherrypy.access:[17/Feb/2020:12:13:08] vocms0136.cern.ch 188.184.86.128 "POST /dbs/prod/global/DBSWriter/bulkblocks HTTP/1.1" 200 OK [data: 4924 in 4 out 77913 us ] [auth: OK "/DC=ch/DC=cern/OU=computers/CN=wmagent/vocms0280.cern.ch" "" ] [ref: "" "DBSClient/3.10.0/" ]

Let em know if you need more info from me. So far, I saw DBS had been doing well. The errors were all "400 Bad Request". You may want to check the input data.

@todor-ivanov
Copy link
Contributor

Hi @yuyiguo Thanks a lot. What you provide here is quite exhaustive and helpful. Just to confirm, when you say:

This is a very heavy API that insert everything regarding a block into DBS, this includes files, lumi sections, parentages and so on. In one second 17/Feb/2020 12:13:08 on only one of DBS servers, WMAgent send SIX blocks to insert into DBS.

Are you meaning that the situation is similar to the one with this heavy API here [1] too?

This indeed, may explain why we have the injection retries, and the more we have the slower we get. Of course we may expect that we will reach some saturation point, but such a situation could explain what P&R sees as files being present in PhEDEx but delayed in DBS.

[1]
#9537
dmwm/DBS#613

@amaltaro amaltaro removed this from the July_2020 milestone Sep 27, 2021
@klannon klannon added Technical Debt Used to track issues that address technical needs internal to WM team Stakeholders and removed Technical Debt Used to track issues that address technical needs internal to WM team labels Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants