Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMAgent fails to inject some files into rucio #9763

Closed
nsmith- opened this issue Jun 25, 2020 · 21 comments · Fixed by #9809
Closed

WMAgent fails to inject some files into rucio #9763

nsmith- opened this issue Jun 25, 2020 · 21 comments · Fixed by #9809

Comments

@nsmith-
Copy link

nsmith- commented Jun 25, 2020

Impact of the bug
Bug causes DBS-rucio inconsistency.

Describe the bug
Unified detected a mismatch between DBS and Rucio (well, rucio/phedex) in the number of files for a NanoAOD output dataset. Indeed, if one compares the DBS block file listing with

rucio list-content cms:/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/RunIIFall17NanoAODv7-PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/NANOAODSIM#77be8101-f733-4a03-b8df-a6bd3a2e1d8b

we see that the following two files are in the block in DBS but not in rucio:

/store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root
/store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/31AA4395-4218-334E-AEF9-96082A6867D8.root

Both files are present on storage at T2_US_Wisconsin, the origin site for the block.

How to reproduce it
Unknown

Expected behavior
All files in DBS should be part of the block in rucio.

@nsmith-
Copy link
Author

nsmith- commented Jun 25, 2020

@FernandoGarzon found this initially while working through the filemismatch backlog in unified.

@amaltaro
Copy link
Contributor

Hi @nsmith- @FernandoGarzon thanks for reporting it.
I believe it's related to an old issue: #8148
or a similar one: #9543

which basically report the same temporary inconsistency between DBS and PhEDEx.

I'll look into it once I finish the bug-fix for input data placement. Meanwhile, please keep this ticket up-to-date in case you see that things went into a consistent state.

@dmielaikaite
Copy link

dmielaikaite commented Jun 30, 2020

Hi @amaltaro ,
We found one more example:
/store/mc/RunIIAutumn18NanoAODv7/Wprimetotb_M2400W480_LH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/230000/23AE2536-2EC3-4D41-8B09-EF65978009CA.root

DBS link

rucio:
rucio list-file-replicas cms:/store/mc/RunIIAutumn18NanoAODv7/Wprimetotb_M2400W480_LH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/230000/23AE2536-2EC3-4D41-8B09-EF65978009CA.root 2020-06-30 13:49:41,964 ERROR Data identifier not found. Details: Data identifier 'cms:/store/mc/RunIIAutumn18NanoAODv7/Wprimetotb_M2400W480_LH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/230000/23AE2536-2EC3-4D41-8B09-EF65978009CA.root' not found

@dmielaikaite
Copy link

and here are the wfs which are stuck because of it:
assistance-filemismatch

@FernandoGarzon
Copy link

FernandoGarzon commented Jul 7, 2020

Hi @amaltaro

Some more few examples:

cms:/store/mc/RunIISummer19UL16MiniAODAPV/DYJetsToLL_M-50_TuneCP5_13TeV-madgraphMLM-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_preVFP_v8-v1/270000/34A729BB-989D-F44B-8A75-96F042DBF408.root cms:/store/mc/RunIIFall17NanoAODv7/Wprimetotb_M4000W40_RH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/PU2017_12Apr2018_Nano02Apr2020_102X_mc2017_realistic_v8-v1/230000/C21AD0C2-E567-8C46-AAC9-C0ACA1A622CE.root cms:/store/mc/RunIIFall17NanoAODv7/Wprimetotb_M5000W50_RH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/PU2017_12Apr2018_Nano02Apr2020_102X_mc2017_realistic_v8-v1/10000/3A5C9323-103E-3441-8AA0-F397EC770109.root cms:/store/mc/RunIIFall17NanoAODv7/Wprimetotb_M4400W880_RH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/PU2017_12Apr2018_Nano02Apr2020_102X_mc2017_realistic_v8-v1/10000/80564586-DB78-7B4B-87F4-6E3B576BFA23.root cms:/store/data/Run2016G/ZeroBias/NANOAOD/Nano02Dec2019-v1/240000/2AE884F4-31AD-A947-B5C1-DE84DD7AAFCF.root cms:/store/data/Run2016G/ZeroBias/NANOAOD/Nano02Dec2019-v1/240000/A526F9ED-E7B6-F64A-89E4-27B1BCC4DDAD.root cms:/store/data/Run2016C/MuonEG/NANOAOD/Nano02Dec2019-v1/20000/85941805-78AD-1140-9BC4-51F313E0F670.root cms:/store/data/Run2016C/MuonEG/NANOAOD/Nano02Dec2019-v1/20000/B3F6F1FF-4E26-1144-9CBC-82359B6F8C29.root cms:/store/data/Run2016C/MuonEG/NANOAOD/Nano02Dec2019-v1/20000/933693B8-99C5-1140-B0F6-8996A4C41E48.root cms:/store/data/Run2018B/MinimumBias5/NANOAOD/Nano02Dec2019-v1/240000/7DB055BA-A5CD-2748-92C1-3C8A5F005CF1.root cms:/store/data/Run2018B/MinimumBias5/NANOAOD/Nano02Dec2019-v1/240000/1A578A80-511E-A14F-9376-24E0BAD1D2DF.root cms:/store/data/Run2018B/MinimumBias5/NANOAOD/Nano02Dec2019-v1/240000/71D54310-9D1E-624D-94E3-281689691F45.root cms:/store/mc/RunIIFall17NanoAODv7/Wprimetotb_M5600W1680_RH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/PU2017_12Apr2018_Nano02Apr2020_102X_mc2017_realistic_v8-v1/230000/1E449F53-3A32-6D43-AD62-4DF9A7587195.root cms:/store/mc/RunIIFall17NanoAODv6/BuToMuMuK_BMuonFilter_SoftQCDnonD_TuneCP5_13TeV-pythia8-evtgen/NANOAODSIM/PU2017_12Apr2018_Nano25Oct2019_102X_mc2017_realistic_v7-v1/240000/EAC56EE9-1EFB-7047-83B9-C55F97B737D6.root 'cms:/store/mc/RunIIFall17NanoAODv7/SMS-T2tt_mStop-350to400_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_102X_mc2017_realistic_v8_ext1-v1/10000/C3B6FF99-F65E-014B-8C46-2EFBDE6AC411.root cms:/store/mc/RunIIFall17NanoAODv7/SMS-T2tt_mStop-350to400_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_102X_mc2017_realistic_v8_ext1-v1/10000/1FFA6EEA-06D8-6D48-BBF6-5C13A4C84729.root cms:/store/mc/RunIIAutumn18NanoAODv7/SMS-T6ttWW_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall18Fast_Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/10000/C615667B-D98E-6243-9F76-8E6708948A10.root cms:/store/mc/RunIIAutumn18NanoAODv7/SMS-T6ttWW_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall18Fast_Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/10000/A1BAC45B-673B-9441-BFDE-45F3CFE6BA15.root cms:/store/mc/RunIIAutumn18NanoAODv7/SMS-T6ttWW_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall18Fast_Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/10000/509D78C0-9E26-5944-9D91-47F470415599.root cms:/store/mc/RunIIAutumn18NanoAODv7/SMS-T6ttWW_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall18Fast_Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/10000/9FB337A3-7AFB-7F4D-9774-AEF6FFA0AD49.root cms:/store/mc/RunIIAutumn18NanoAODv7/SMS-T6ttWW_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall18Fast_Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/10000/99FA4E38-ADD8-834A-A714-90DB18A1FFBD.root cms:/store/mc/RunIIAutumn18NanoAODv7/SMS-T6ttWW_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall18Fast_Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/10000/E728EC41-D0CA-C044-9F43-C6371EC4EBCD.root cms:/store/mc/RunIIAutumn18NanoAODv7/SMS-TChiStauStau_x0p95_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall18Fast_Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/10000/472ABED0-3377-CD4E-AE41-74408B959EB1.root cms:/store/mc/RunIIAutumn18NanoAODv7/SMS-TChiStauStau_x0p95_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall18Fast_Nano02Apr2020_102X_upgrade2018_realistic_v21-v1/10000/64905808-7386-684B-8F79-04A6EBD78308.root cms:/store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root cms:/store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/31AA4395-4218-334E-AEF9-96082A6867D8.root

All of these are valid in DBS.

@FernandoGarzon
Copy link

FernandoGarzon commented Jul 7, 2020

Here are some few files that are valid in dbs and in Rucio, but they don't have physical replica yet:

/store/mc/RunIIFall17NanoAODv7/Wprimetotb_M4400W880_RH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/PU2017_12Apr2018_Nano02Apr2020_102X_mc2017_realistic_v8-v1/10000/EBDEA156-F0BC-9144-AD53-F516F74F0A24.root

/store/data/Run2016G/ZeroBias/NANOAOD/Nano02Dec2019-v1/20000/7559D271-9426-4342-AC5D-1823C8AA6129.root

/store/data/Run2016G/ZeroBias/NANOAOD/Nano02Dec2019-v1/240000/D4B3F955-6768-C747-B536-46D93A2BBB85.root

/store/data/Run2016G/ZeroBias/NANOAOD/Nano02Dec2019-v1/240000/42526BA7-A04C-C246-B69B-C85D61B4C0F6.root


$ xrdfs cms-xrd-global.cern.ch locate -d -m /store/mc/RunIIFall17NanoAODv7/Wprimetotb_M4400W880_RH_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/PU2017_12Apr2018_Nano02Apr2020_102X_mc2017_realistic_v8-v1/10000/EBDEA156-F0BC-9144-AD53-F516F74F0A24.root
[ERROR] Server responded with an error: [3011] No valid location found

$ xrdfs cms-xrd-global.cern.ch locate -d -m  /store/data/Run2016G/ZeroBias/NANOAOD/Nano02Dec2019-v1/20000/7559D271-9426-4342-AC5D-1823C8AA6129.root

[ERROR] Server responded with an error: [3011] No valid location found

$ xrdfs cms-xrd-global.cern.ch locate -d -m  /store/data/Run2016G/ZeroBias/NANOAOD/Nano02Dec2019-v1/240000/D4B3F955-6768-C747-B536-46D93A2BBB85.root

[ERROR] Server responded with an error: [3011] No valid location found

$ xrdfs cms-xrd-global.cern.ch locate -d -m  /store/data/Run2016G/ZeroBias/NANOAOD/Nano02Dec2019-v1/240000/42526BA7-A04C-C246-B69B-C85D61B4C0F6.root

[ERROR] Server responded with an error: [3011] No valid location found

@amaltaro
Copy link
Contributor

amaltaro commented Jul 7, 2020

Nick, Fernando, @nsmith- @FernandoGarzon
it looks like we have a bigger problem here!

I just had a look at vocms0251 RucioInjector logs and I see tons of failures since June 20, e.g.:

2020-06-20 21:27:51,900:139979273836288:INFO:RucioInjectorPoller:Preparing to insert replicas into Rucio...
2020-06-20 21:27:51,936:139979273836288:ERROR:Rucio:Failed to add replicas for: [{'adler32': 'ed6e0cf', 'state': 'A', 'bytes': 1729751147, 'name': '/store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUF
all17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root', 'scope': 'cms'}] and block: /SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/RunIIFall17NanoAODv7-PUFall17Fast_Nano02Apr2020_pil
ot_102X_mc2017_realistic_v8_ext1-v1/NANOAODSIM#77be8101-f733-4a03-b8df-a6bd3a2e1d8b. Error: Provided object does not match schema.
Details: Problem validating dids : u'ed6e0cf' does not match '^[a-fA-F\\d]{8}$'

Failed validating 'pattern' in schema['items']['properties']['adler32']:
    {'description': 'adler32',
     'pattern': '^[a-fA-F\\d]{8}$',
     'type': 'string'}

On instance[0]['adler32']:
    u'ed6e0cf'
2020-06-20 21:27:51,936:139979273836288:INFO:RucioInjectorPoller:Starting closeBlocks method

this seems to be related to the changes Eric was asking us to validate a week or two ago (which went fine in testbed somehow).

In short, it seems none of the NANO data is managing to get injected into Rucio since the Lexicon DID validation was put in place. Apologies for not spotting it before.

@nsmith-
Copy link
Author

nsmith- commented Jul 7, 2020

Uh it looks to me like the adler32 is what is failing validation. Why is it not 8 characters? I would guess a leading zero is being dropped somewhere

@amaltaro
Copy link
Contributor

amaltaro commented Jul 8, 2020

I've been looking how this adler32 checksum gets calculated, and I initially thought it was from the CMSSW framework. That does not seem to be the case. So I'm investigating other parts of the code.
Could you please confirm whether this validation of the adler32 field is something new in Rucio? Or has it been in place since day 1?
From the WMCore side, we have not made any changes on that front at all (and it always worked well in PhEDEx, of course it could be that they don't really validate that field).

@nsmith-
Copy link
Author

nsmith- commented Jul 8, 2020

As far as I can tell, the adler32 field validation has been in place since the beginning. I'm looking at what's been injected into rucio by WMCore and it seems there continues to be new files injected with adler32 starting digit 0.

@amaltaro
Copy link
Contributor

amaltaro commented Jul 8, 2020

Alright, I think I found where exactly the adler32 checksum gets calculated, here:
https://github.com/dmwm/WMCore/blob/master/src/python/Utils/FileTools.py#L16

Now that we know where to look at, I transferred one of the files that fails to get inserted into Rucio

2020-06-20 21:27:51,936:139979273836288:ERROR:Rucio:Failed to add replicas for: [{'adler32': 'ed6e0cf', 'state': 'A', 'bytes': 1729751147, 'name': '/store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root', 'scope': 'cms'}] and block: /SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/RunIIFall17NanoAODv7-PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/NANOAODSIM#77be8101-f733-4a03-b8df-a6bd3a2e1d8b. Error: Provided object does not match schema.
Details: Problem validating dids : u'ed6e0cf' does not match '^[a-fA-F\\d]{8}$'

Failed validating 'pattern' in schema['items']['properties']['adler32']:
    {'description': 'adler32',
     'pattern': '^[a-fA-F\\d]{8}$',
     'type': 'string'}

On instance[0]['adler32']:
    u'ed6e0cf'

and the file is here:

amaltaro@lxplus721:~/workarea $ xrdcp root://cmsxrootd.fnal.gov//store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root .
[1.611GB/1.611GB][100%][==================================================][3.81MB/s]   
amaltaro@lxplus721:~/workarea $

And here is my script - which executes calculateChecksums function - and the same adler32 checksum:

amaltaro@lxplus721:~/workarea $ python checkCksum.py 
adler32 is: ed6e0cf
cksum is: 3199116811

About the solution, I'm going to change that function such that it adds leading zeroes to always result in 8 chars length checksum.

In addition to that, we also need to patch the RucioInjector component to add leading zeroes to what has already been calculated and persisted in the database.

@nsmith- do you see any problem with this approach?

@nsmith-
Copy link
Author

nsmith- commented Jul 8, 2020

No, I think this is OK. But I'm a bit confused why the leading zeros are stripped sometimes but not others, as I see recently injected files with a leading 0.

@amaltaro
Copy link
Contributor

amaltaro commented Jul 8, 2020

Could it be that those are not really leading 0, it just happened to be 0?

@amaltaro
Copy link
Contributor

amaltaro commented Jul 8, 2020

Actually, I wonder if this modification - for already created files - can actually create any sort of problems in the DM system? Reason is, DBS and Rucio will have different adler32 checksum values...

@nsmith-
Copy link
Author

nsmith- commented Jul 8, 2020

Indeed DBS has a 7-digit adler32 for your example file: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/files?detail=1&logical_file_name=/store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root

To confirm, indeed it does have a leading zero:

$ xrdadler32 root://cms-xrd-global.cern.ch//store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root
0ed6e0cf root://cms-xrd-global.cern.ch//store/mc/RunIIFall17NanoAODv7/SMS-T1tttt_TuneCP2_13TeV-madgraphMLM-pythia8/NANOAODSIM/PUFall17Fast_Nano02Apr2020_pilot_102X_mc2017_realistic_v8_ext1-v1/10000/5CDFB6D9-D1B1-B242-9504-F5547B3DE763.root

I checked the files in the list from Fernando above and not all of them have this issue, but a fair number do:

adler32_length count
6     2
7    11
8    13

@nsmith-
Copy link
Author

nsmith- commented Jul 8, 2020

I looked in TMDB and I do find a lot of examples where the adler32 is shorter than 8 characters. I can only assume that everywhere it is used (essentially just FTS) it is converted back to a proper 32-bit value. One puzzle though, it seems new files are being inserted into TMDB with checksums with leading zeros even today, just as new files are going into Rucio OK. Why does this bug not affect all agents?

@amaltaro
Copy link
Contributor

amaltaro commented Jul 8, 2020

From what I read, adler32 is calculated with successive additions of chunks of the data. So I think one possible scenario would be that the sum goes beyond 0xffffffff, resulting in something like 0xXXX00ffffff, which would then - perhaps - have the leading 0. Just an speculation though.

Let's first clear this problem from the system, then we can investigate the other files that are apparently missing in Rucio.

@amaltaro
Copy link
Contributor

amaltaro commented Jul 8, 2020

The files backlog should be over. @FernandoGarzon please let us know if you see further files missing there.
There is still one development that we need to make regarding DBS/Rucio to keep things properly synchronize, so I do not discard the possibility of temporarily having a few others.

@amaltaro amaltaro reopened this Jul 8, 2020
@amaltaro
Copy link
Contributor

@FernandoGarzon @nsmith- I haven't heard anything else here - since we fixed the adler32 issue - could you please check whether things are fine on your side, and if so, close this ticket? Thanks

@FernandoGarzon
Copy link

Hello

I've been running consistency check everyday for the last week. I just made a last run. I haven't found a single file with the issue described. It seems fine to me.

@amaltaro
Copy link
Contributor

Thanks for confirming it, Fernando. Please reopen it if the issue pops up again in the coming days/weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants