Preprocessing a file at FNAL leads an unclear exception #1138

bockjoo · 2024-07-24T16:05:45Z

I am trying to preprocess a file at FNAL with Coffea2024.6.1, but got this exception:

Traceback (most recent call last):
  File "/cmsuf/t2/operations/opt/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/submitFullDataset.py", line 1066, in <module>
    dataset_runnable, dataset_updated = preprocess(
                                        ^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/coffea/dataset_tools/preprocess.py", line 381, in preprocess
    processed_files_without_forms = processed_files[
                                    ^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/highlevel.py", line 1066, in __getitem__
    prepare_layout(self._layout[where]),
                   ~~~~~~~~~~~~^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/content.py", line 669, in _getitem
    return self._getitem_fields(list(where))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/indexedoptionarray.py", line 346, in _getitem_fields
    self._content._getitem_fields(where, only_fields),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/emptyarray.py", line 193, in _getitem_fields
    raise ak._errors.index_error(self, where, "not an array of records")
IndexError: cannot slice EmptyArray (of length 0) with ['file', 'object_path', 'steps', 'num_entries', 'uuid']: not an array of records


This error occurred while attempting to slice

    <Array [None, None] type='2 * ?unknown'>

with

    ['file', 'object_path', 'steps', 'num_entries', 'uuid']

It was unclear what went wrong.

The text was updated successfully, but these errors were encountered:

lgray · 2024-07-24T16:13:47Z

@bockjoo I thought you described this as an uproot problem in the slack where you saw something was deserializing incorrectly when using https.

NJManganelli · 2024-07-24T16:21:19Z

I'll note that I've been seeing this with the occasional file opened via xrootd. One specific example: the /DoubleMuon/Run2016F*NanoAODv9-v1/NANOAOD file (there's just one, about 2.1GB, which I'm still investigating because it seems to open fine in uproot from wisconsin, but whichever is being picked up by the datadiscoverycli with round-robin replica choice is triggering this error... and also the first option is the T1_US_FNAL disks which are under maintenance today)

NJManganelli · 2024-07-24T18:01:12Z

Here's a single-file CMS dataset for which many replicas fail:

Sites availability for dataset: /DoubleMuon/Run2016F-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD
                Available replicas                
┏━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Index ┃ Site            ┃ Files ┃ Availability ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│   0   │ T1_US_FNAL_Disk │ 1 / 1 │    100.0%    │
│   1   │ T2_DE_DESY      │ 1 / 1 │    100.0%    │
│   2   │ T2_CH_CSCS      │ 1 / 1 │    100.0%    │
│   3   │ T1_DE_KIT_Disk  │ 1 / 1 │    100.0%    │
│   4   │ T3_KR_KISTI     │ 1 / 1 │    100.0%    │
│   5   │ T2_IT_Legnaro   │ 1 / 1 │    100.0%    │
│   6   │ T2_US_Wisconsin │ 1 / 1 │    100.0%    │
│   7   │ T2_BE_IIHE      │ 1 / 1 │    100.0%    │
│   8   │ T1_RU_JINR_Disk │ 1 / 1 │    100.0%    │
│   9   │ T3_US_NotreDame │ 1 / 1 │    100.0%    │
│  10   │ T3_IT_Trieste   │ 1 / 1 │    100.0%    │
│  11   │ T2_DE_RWTH      │ 1 / 1 │    100.0%    │
│  12   │ T3_KR_UOS       │ 1 / 1 │    100.0%    │
└───────┴─────────────────┴───────┴──────────────┘

This code should permit seeing the failure in action:

from coffea.dataset_tools import preprocess
run2016f = {
    "0": {"files": {
                "root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "1": {"files": {        
                "root://dcache-cms-xrootd.desy.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "2": {"files": {          
                "root://storage01.lcg.cscs.ch:1096//pnfs/lcg.cscs.ch/cms/trivcat/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "3": {"files": {  
                "root://cmsdcache-kit-disk.gridka.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "4": {"files": {  
                "root://cms-xrdr.sdfarm.kr:1094//xrd/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "5": {"files": {  
                 "root://t2-xrdcms.lnl.infn.it:7070//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "6": {"files": {  
                "root://cmsxrootd.hep.wisc.edu:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "7": {"files": {  
                "root://maite.iihe.ac.be:1095//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "8": {"files": {  
                "root://xrootd01.jinr-t1.ru:1094//pnfs/jinr-t1.ru/data/cms/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "9": {"files": {  
                "root://deepthought.crc.nd.edu//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "10": {"files": {  
                "root://cmsxrd.ts.infn.it:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "11": {"files": {  
                "root://grid-cms-xrootd.physik.rwth-aachen.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "12": {"files": {  
                "root://cms.sscc.uos.ac.kr:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
}
for key in run2016f:
    try:
        preprocess({key: run2016f[key]}, recalculate_steps=True, files_per_batch=10, save_form=True)
    except:
        print(key, "FAILED")

Output for me right now:

0 FAILED [Disk downtime at FNAL today, though]
2 FAILED [T2_CH_CSCS]
5 FAILED [T2_IT_Legnaro]
12 FAILED [T3_KR_UOS]

JoyYTZhou · 2024-08-21T17:47:21Z

Hi,

I also encountered the same issue for two datasets with many files. I have tried adding IndexError to the file_exceptions option in ddc.do_preprocess. Unfortunately, the error is still not caught. I am guessing that it's because the error is raised by awkward. Has there been any new fix to skip the problematic files?

lgray · 2024-08-21T18:44:51Z

It means that no sites returned a valid list of files when trying to establish their existence.

bockjoo · 2024-08-21T18:51:40Z

@bockjoo I thought you described this as an uproot problem in the slack where you saw something was deserializing incorrectly when using https.

I think I was reading a file from root:// protocol, not https, with skip_bad_files=True and the preprocess failed to open/read the file from FNAL.

bockjoo · 2024-08-21T18:53:07Z

When uproot raises an exception, it does not provide the file name and the reason for error, which should be added
to make the error clearer, e.g., when raising OSErr in
fsspec_xrootd/xrootd.py

JoyYTZhou · 2024-08-21T19:06:17Z

It means that no sites returned a valid list of files when trying to establish their existence.

How could that happen when the error does not appear during ddc.load_dataset_definition? I know for a fact that these files exist because I was using the generic root://cmsxrootd.fnal.gov/ redirector and was able to preprocess them in an older version of my code.

I thought the error was raised as long as there was one bad file.

bockjoo · 2024-08-21T19:21:55Z

At the moment, this fails:

xrdcp -d 1 -f root://cmsxrootd.fnal.gov//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root /dev/null

as is reported here, which I reported to a FNAL admin.
Another option to open the file is

root://cms-xrd-global.cern.ch:1094//store/test/xrootd/T1_US_FNAL/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

instead of

root://cmsxrootd.fnal.gov//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

Normally, it's supposed to be accessed using

root://cms-xrd-global.cern.ch:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

which will open the file from one of these sites:

T1_DE_KIT_Disk
T1_IT_CNAF_Disk
T1_RU_JINR_Disk
T1_US_FNAL_Disk
T2_BE_IIHE
T2_BE_UCL
T2_CH_CSCS
T2_DE_DESY
T2_DE_RWTH
T2_EE_Estonia
T2_FR_GRIF
T2_IT_Legnaro
T2_UK_London_IC
T2_US_Vanderbilt
T2_US_Wisconsin

lgray · 2024-08-21T19:25:26Z

Redirectors are known to be flakey for accessing files consistently, prior success unfortunately means you were only lucky.
You should find where this file is located and use a concrete xrootd endpoint instead of a redirector.

This particular error happens when you try to slice an array that consists entirely of None, which only happens when every single file you passed resulted in failure to access. Otherwise the fields that it is complaining about are all present and slicing will work as expected.

I'll make a PR that should at least report this outcome more clearly. I'll @ you and you can try it.

lgray · 2024-08-22T16:09:08Z

Could one of you please try #1168?

JoyYTZhou · 2024-08-22T17:18:51Z

This fix produces the updated error msg.

Exception: There was no populated list of files returned from querying your input dataset.
Please check your xrootd endpoints, and avoid redirectors.
Input dataset: /ZZto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22EENanoAODv12-130X_mcRun3_2022_realistic_postEE_v6-v2/NANOAODSIM
As parsed for querying: [{file: ..., ...}, {file: ..., ...}, ..., {file: ..., ...}, {file: ..., ...}]

If my dataset_definition contains several datasets and only one of them is failing like this, would it be possible to save at least the successfully preprocessed results?

lgray · 2024-08-22T18:20:26Z

@JoyYTZhou I have added something to the PR that should give this functionality. Have a look in the PR and give it a try.

JoyYTZhou · 2024-08-23T11:22:45Z

@lgray Now I do get the preprocessed result dumped to my terminal (I would've preferred it to be a json.gz), but it still only dumps the result for the dataset that has failed (which shows none in every field).

Based on the printed table index, that failed dataset was not the first to be processed, yet none of the previous results was dumped. If somehow the failed dataset was always picked to be run first then the preprocessed result wouldn't be useful. I could also delete the failed dataset from my query, that always works.

lgray · 2024-08-23T11:54:18Z

@JoyYTZhou

All of the passed results are returned as two dictionaries:

the first is only the successfully parsed results
the second is the input dictionary updated with parsed results where they exist

What gets dumped to the screen are only the failed runs, as a standard python user warning.
They are not meant for manipulation by the user, only to tell you what went wrong.
This is why it is not dumped to a json file, it would have no purpose, and copy/pasting is a user interface design choice that does not scale well.
You can also find out which datasets failed by finding the dataset keys that are in your input fileset that are not in the output dictionary of successfully parsed results.

You may save or further process the returned dictionaries however you wish.

lgray · 2024-08-23T14:57:32Z

@JoyYTZhou have you been able to 1) pass allow_empty_datasets=True to preprocess and then 2) access the successfully parsed datasets from what is returned by that function?

If you don't want to see the printout when the dataset fails you can use the control mechanisms available to you via https://docs.python.org/3/library/warnings.html.

JoyYTZhou · 2024-08-23T18:20:04Z

@lgray
Yes, there's such an option, however since preprocess is called by ddc.do_preprocess, and that one is really what the user is recommended to use, there needs to be **kwargs in do_preprocess in DataDiscoveryCLI so that I don't have to constantly go to src code to turn options on/off.

If the successfully parsed datasets are returned by preprocess, then I should be able to see a json.gz produced by do_preprocess. I am not seeing that. I might use preprocess directly to check, but that rather defeats the purpose of using DataDiscoveryCLI.

ikrommyd · 2024-08-23T18:22:30Z

https://github.com/CoffeaTeam/coffea/pull/1137/files this needs to be updated to add the extra arg

JoyYTZhou · 2024-08-23T18:23:46Z

@lgray Yes, there's such an option, however since preprocess is called by ddc.do_preprocess, and that one is really what the user is recommended to use, there needs to be **kwargs in do_preprocess in DataDiscoveryCLI so that I don't have to constantly go to src code to turn options on/off.

If the successfully parsed datasets are returned by preprocess, then I should be able to see a json.gz produced by do_preprocess. I am not seeing that. I might use preprocess directly to check, but that rather defeats the purpose of using DataDiscoveryCLI.

@lgray Actually never mind, yes the results get dumped when allow_empty_datasets=True in preprocess. I would appreciate if do_preprocess gets a kwargs still.

lgray · 2024-08-23T18:24:35Z

Composability does not defeat the purpose of a shortcut.

I'll add allow_empty_datasets in do_preprocess.

My bad for missing you were using that as opposed preprocess directly.

lgray · 2024-08-23T18:30:02Z

OK added to the rucio utils. Please give it a try.

JoyYTZhou · 2024-08-23T18:37:50Z

OK added to the rucio utils. Please give it a try.

Yes, I get the successful outputs now. Thank you. I think this fix closes the issue.

bockjoo added the question Further information is requested label Jul 24, 2024

lgray mentioned this issue Aug 21, 2024

fix: more expository error message in case of empty returned file list #1168

Merged

lgray closed this as completed in #1168 Aug 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing a file at FNAL leads an unclear exception #1138

Preprocessing a file at FNAL leads an unclear exception #1138

bockjoo commented Jul 24, 2024

lgray commented Jul 24, 2024

NJManganelli commented Jul 24, 2024

NJManganelli commented Jul 24, 2024

JoyYTZhou commented Aug 21, 2024 •

edited

Loading

lgray commented Aug 21, 2024

bockjoo commented Aug 21, 2024

bockjoo commented Aug 21, 2024 •

edited

Loading

JoyYTZhou commented Aug 21, 2024

bockjoo commented Aug 21, 2024 •

edited

Loading

lgray commented Aug 21, 2024

lgray commented Aug 22, 2024

JoyYTZhou commented Aug 22, 2024 •

edited

Loading

lgray commented Aug 22, 2024

JoyYTZhou commented Aug 23, 2024 •

edited

Loading

lgray commented Aug 23, 2024 •

edited

Loading

lgray commented Aug 23, 2024 •

edited

Loading

JoyYTZhou commented Aug 23, 2024 •

edited

Loading

ikrommyd commented Aug 23, 2024

JoyYTZhou commented Aug 23, 2024 •

edited

Loading

lgray commented Aug 23, 2024 •

edited

Loading

lgray commented Aug 23, 2024

JoyYTZhou commented Aug 23, 2024

Preprocessing a file at FNAL leads an unclear exception #1138

Preprocessing a file at FNAL leads an unclear exception #1138

Comments

bockjoo commented Jul 24, 2024

lgray commented Jul 24, 2024

NJManganelli commented Jul 24, 2024

NJManganelli commented Jul 24, 2024

JoyYTZhou commented Aug 21, 2024 • edited Loading

lgray commented Aug 21, 2024

bockjoo commented Aug 21, 2024

bockjoo commented Aug 21, 2024 • edited Loading

JoyYTZhou commented Aug 21, 2024

bockjoo commented Aug 21, 2024 • edited Loading

lgray commented Aug 21, 2024

lgray commented Aug 22, 2024

JoyYTZhou commented Aug 22, 2024 • edited Loading

lgray commented Aug 22, 2024

JoyYTZhou commented Aug 23, 2024 • edited Loading

lgray commented Aug 23, 2024 • edited Loading

lgray commented Aug 23, 2024 • edited Loading

JoyYTZhou commented Aug 23, 2024 • edited Loading

ikrommyd commented Aug 23, 2024

JoyYTZhou commented Aug 23, 2024 • edited Loading

lgray commented Aug 23, 2024 • edited Loading

lgray commented Aug 23, 2024

JoyYTZhou commented Aug 23, 2024

JoyYTZhou commented Aug 21, 2024 •

edited

Loading

bockjoo commented Aug 21, 2024 •

edited

Loading

bockjoo commented Aug 21, 2024 •

edited

Loading

JoyYTZhou commented Aug 22, 2024 •

edited

Loading

JoyYTZhou commented Aug 23, 2024 •

edited

Loading

lgray commented Aug 23, 2024 •

edited

Loading

lgray commented Aug 23, 2024 •

edited

Loading

JoyYTZhou commented Aug 23, 2024 •

edited

Loading

JoyYTZhou commented Aug 23, 2024 •

edited

Loading

lgray commented Aug 23, 2024 •

edited

Loading