Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing a file at FNAL leads an unclear exception #1138

Closed
bockjoo opened this issue Jul 24, 2024 · 22 comments · Fixed by #1168
Closed

Preprocessing a file at FNAL leads an unclear exception #1138

bockjoo opened this issue Jul 24, 2024 · 22 comments · Fixed by #1168
Labels
question Further information is requested

Comments

@bockjoo
Copy link

bockjoo commented Jul 24, 2024

I am trying to preprocess a file at FNAL with Coffea2024.6.1, but got this exception:

Traceback (most recent call last):
  File "/cmsuf/t2/operations/opt/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/submitFullDataset.py", line 1066, in <module>
    dataset_runnable, dataset_updated = preprocess(
                                        ^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/coffea/dataset_tools/preprocess.py", line 381, in preprocess
    processed_files_without_forms = processed_files[
                                    ^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/highlevel.py", line 1066, in __getitem__
    prepare_layout(self._layout[where]),
                   ~~~~~~~~~~~~^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/content.py", line 512, in __getitem__
    return self._getitem(where)
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/content.py", line 669, in _getitem
    return self._getitem_fields(list(where))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/indexedoptionarray.py", line 346, in _getitem_fields
    self._content._getitem_fields(where, only_fields),
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/bockjoo/opt/cmsio2/cms/services/T2/ops/Work/AAA/vll-analysis.Coffea2024.6.1/lib/python3.12/site-packages/awkward/contents/emptyarray.py", line 193, in _getitem_fields
    raise ak._errors.index_error(self, where, "not an array of records")
IndexError: cannot slice EmptyArray (of length 0) with ['file', 'object_path', 'steps', 'num_entries', 'uuid']: not an array of records


This error occurred while attempting to slice

    <Array [None, None] type='2 * ?unknown'>

with

    ['file', 'object_path', 'steps', 'num_entries', 'uuid']

It was unclear what went wrong.

@bockjoo bockjoo added the question Further information is requested label Jul 24, 2024
@lgray
Copy link
Collaborator

lgray commented Jul 24, 2024

@bockjoo I thought you described this as an uproot problem in the slack where you saw something was deserializing incorrectly when using https.

@NJManganelli
Copy link
Collaborator

I'll note that I've been seeing this with the occasional file opened via xrootd. One specific example: the /DoubleMuon/Run2016F*NanoAODv9-v1/NANOAOD file (there's just one, about 2.1GB, which I'm still investigating because it seems to open fine in uproot from wisconsin, but whichever is being picked up by the datadiscoverycli with round-robin replica choice is triggering this error... and also the first option is the T1_US_FNAL disks which are under maintenance today)

@NJManganelli
Copy link
Collaborator

Here's a single-file CMS dataset for which many replicas fail:

Sites availability for dataset: /DoubleMuon/Run2016F-UL2016_MiniAODv2_NanoAODv9-v1/NANOAOD
                Available replicas                
┏━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Index ┃ Site            ┃ Files ┃ Availability ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━┩
│   0   │ T1_US_FNAL_Disk │ 1 / 1 │    100.0%    │
│   1   │ T2_DE_DESY      │ 1 / 1 │    100.0%    │
│   2   │ T2_CH_CSCS      │ 1 / 1 │    100.0%    │
│   3   │ T1_DE_KIT_Disk  │ 1 / 1 │    100.0%    │
│   4   │ T3_KR_KISTI     │ 1 / 1 │    100.0%    │
│   5   │ T2_IT_Legnaro   │ 1 / 1 │    100.0%    │
│   6   │ T2_US_Wisconsin │ 1 / 1 │    100.0%    │
│   7   │ T2_BE_IIHE      │ 1 / 1 │    100.0%    │
│   8   │ T1_RU_JINR_Disk │ 1 / 1 │    100.0%    │
│   9   │ T3_US_NotreDame │ 1 / 1 │    100.0%    │
│  10   │ T3_IT_Trieste   │ 1 / 1 │    100.0%    │
│  11   │ T2_DE_RWTH      │ 1 / 1 │    100.0%    │
│  12   │ T3_KR_UOS       │ 1 / 1 │    100.0%    │
└───────┴─────────────────┴───────┴──────────────┘

This code should permit seeing the failure in action:

from coffea.dataset_tools import preprocess
run2016f = {
    "0": {"files": {
                "root://cmsdcadisk.fnal.gov//dcache/uscmsdisk/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "1": {"files": {        
                "root://dcache-cms-xrootd.desy.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "2": {"files": {          
                "root://storage01.lcg.cscs.ch:1096//pnfs/lcg.cscs.ch/cms/trivcat/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "3": {"files": {  
                "root://cmsdcache-kit-disk.gridka.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "4": {"files": {  
                "root://cms-xrdr.sdfarm.kr:1094//xrd/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "5": {"files": {  
                 "root://t2-xrdcms.lnl.infn.it:7070//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "6": {"files": {  
                "root://cmsxrootd.hep.wisc.edu:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "7": {"files": {  
                "root://maite.iihe.ac.be:1095//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "8": {"files": {  
                "root://xrootd01.jinr-t1.ru:1094//pnfs/jinr-t1.ru/data/cms/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "9": {"files": {  
                "root://deepthought.crc.nd.edu//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "10": {"files": {  
                "root://cmsxrd.ts.infn.it:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "11": {"files": {  
                "root://grid-cms-xrootd.physik.rwth-aachen.de:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
    "12": {"files": {  
                "root://cms.sscc.uos.ac.kr:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root": "Events"}},
}
for key in run2016f:
    try:
        preprocess({key: run2016f[key]}, recalculate_steps=True, files_per_batch=10, save_form=True)
    except:
        print(key, "FAILED")

Output for me right now:

0 FAILED [Disk downtime at FNAL today, though]
2 FAILED [T2_CH_CSCS]
5 FAILED [T2_IT_Legnaro]
12 FAILED [T3_KR_UOS]

@JoyYTZhou
Copy link

JoyYTZhou commented Aug 21, 2024

Hi,

I also encountered the same issue for two datasets with many files. I have tried adding IndexError to the file_exceptions option in ddc.do_preprocess. Unfortunately, the error is still not caught. I am guessing that it's because the error is raised by awkward. Has there been any new fix to skip the problematic files?

@lgray
Copy link
Collaborator

lgray commented Aug 21, 2024

It means that no sites returned a valid list of files when trying to establish their existence.

@bockjoo
Copy link
Author

bockjoo commented Aug 21, 2024

@bockjoo I thought you described this as an uproot problem in the slack where you saw something was deserializing incorrectly when using https.

I think I was reading a file from root:// protocol, not https, with skip_bad_files=True and the preprocess failed to open/read the file from FNAL.

@bockjoo
Copy link
Author

bockjoo commented Aug 21, 2024

When uproot raises an exception, it does not provide the file name and the reason for error, which should be added
to make the error clearer, e.g., when raising OSErr in
fsspec_xrootd/xrootd.py

@JoyYTZhou
Copy link

It means that no sites returned a valid list of files when trying to establish their existence.

How could that happen when the error does not appear during ddc.load_dataset_definition? I know for a fact that these files exist because I was using the generic root://cmsxrootd.fnal.gov/ redirector and was able to preprocess them in an older version of my code.

I thought the error was raised as long as there was one bad file.

@bockjoo
Copy link
Author

bockjoo commented Aug 21, 2024

At the moment, this fails:

xrdcp -d 1 -f root://cmsxrootd.fnal.gov//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root /dev/null

as is reported here, which I reported to a FNAL admin.
Another option to open the file is

root://cms-xrd-global.cern.ch:1094//store/test/xrootd/T1_US_FNAL/store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

instead of

root://cmsxrootd.fnal.gov//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

Normally, it's supposed to be accessed using

root://cms-xrd-global.cern.ch:1094//store/data/Run2016F/DoubleMuon/NANOAOD/UL2016_MiniAODv2_NanoAODv9-v1/120000/E6EF2FBB-676E-A447-B572-B575EDB3CC1C.root

which will open the file from one of these sites:

T1_DE_KIT_Disk
T1_IT_CNAF_Disk
T1_RU_JINR_Disk
T1_US_FNAL_Disk
T2_BE_IIHE
T2_BE_UCL
T2_CH_CSCS
T2_DE_DESY
T2_DE_RWTH
T2_EE_Estonia
T2_FR_GRIF
T2_IT_Legnaro
T2_UK_London_IC
T2_US_Vanderbilt
T2_US_Wisconsin

@lgray
Copy link
Collaborator

lgray commented Aug 21, 2024

Redirectors are known to be flakey for accessing files consistently, prior success unfortunately means you were only lucky.
You should find where this file is located and use a concrete xrootd endpoint instead of a redirector.

This particular error happens when you try to slice an array that consists entirely of None, which only happens when every single file you passed resulted in failure to access. Otherwise the fields that it is complaining about are all present and slicing will work as expected.

I'll make a PR that should at least report this outcome more clearly. I'll @ you and you can try it.

@lgray
Copy link
Collaborator

lgray commented Aug 22, 2024

Could one of you please try #1168?

@JoyYTZhou
Copy link

JoyYTZhou commented Aug 22, 2024

This fix produces the updated error msg.

Exception: There was no populated list of files returned from querying your input dataset.
Please check your xrootd endpoints, and avoid redirectors.
Input dataset: /ZZto2L2Nu_TuneCP5_13p6TeV_powheg-pythia8/Run3Summer22EENanoAODv12-130X_mcRun3_2022_realistic_postEE_v6-v2/NANOAODSIM
As parsed for querying: [{file: ..., ...}, {file: ..., ...}, ..., {file: ..., ...}, {file: ..., ...}]

If my dataset_definition contains several datasets and only one of them is failing like this, would it be possible to save at least the successfully preprocessed results?

@lgray
Copy link
Collaborator

lgray commented Aug 22, 2024

@JoyYTZhou I have added something to the PR that should give this functionality. Have a look in the PR and give it a try.

@JoyYTZhou
Copy link

JoyYTZhou commented Aug 23, 2024

@lgray Now I do get the preprocessed result dumped to my terminal (I would've preferred it to be a json.gz), but it still only dumps the result for the dataset that has failed (which shows none in every field).

Based on the printed table index, that failed dataset was not the first to be processed, yet none of the previous results was dumped. If somehow the failed dataset was always picked to be run first then the preprocessed result wouldn't be useful. I could also delete the failed dataset from my query, that always works.

@lgray
Copy link
Collaborator

lgray commented Aug 23, 2024

@JoyYTZhou

All of the passed results are returned as two dictionaries:

  • the first is only the successfully parsed results
  • the second is the input dictionary updated with parsed results where they exist

What gets dumped to the screen are only the failed runs, as a standard python user warning.
They are not meant for manipulation by the user, only to tell you what went wrong.
This is why it is not dumped to a json file, it would have no purpose, and copy/pasting is a user interface design choice that does not scale well.
You can also find out which datasets failed by finding the dataset keys that are in your input fileset that are not in the output dictionary of successfully parsed results.

You may save or further process the returned dictionaries however you wish.

@lgray
Copy link
Collaborator

lgray commented Aug 23, 2024

@JoyYTZhou have you been able to 1) pass allow_empty_datasets=True to preprocess and then 2) access the successfully parsed datasets from what is returned by that function?

If you don't want to see the printout when the dataset fails you can use the control mechanisms available to you via https://docs.python.org/3/library/warnings.html.

@JoyYTZhou
Copy link

JoyYTZhou commented Aug 23, 2024

@lgray
Yes, there's such an option, however since preprocess is called by ddc.do_preprocess, and that one is really what the user is recommended to use, there needs to be **kwargs in do_preprocess in DataDiscoveryCLI so that I don't have to constantly go to src code to turn options on/off.

If the successfully parsed datasets are returned by preprocess, then I should be able to see a json.gz produced by do_preprocess. I am not seeing that. I might use preprocess directly to check, but that rather defeats the purpose of using DataDiscoveryCLI.

@ikrommyd
Copy link
Contributor

https://github.com/CoffeaTeam/coffea/pull/1137/files this needs to be updated to add the extra arg

@JoyYTZhou
Copy link

JoyYTZhou commented Aug 23, 2024

@lgray Yes, there's such an option, however since preprocess is called by ddc.do_preprocess, and that one is really what the user is recommended to use, there needs to be **kwargs in do_preprocess in DataDiscoveryCLI so that I don't have to constantly go to src code to turn options on/off.

If the successfully parsed datasets are returned by preprocess, then I should be able to see a json.gz produced by do_preprocess. I am not seeing that. I might use preprocess directly to check, but that rather defeats the purpose of using DataDiscoveryCLI.

@lgray Actually never mind, yes the results get dumped when allow_empty_datasets=True in preprocess. I would appreciate if do_preprocess gets a kwargs still.

@lgray
Copy link
Collaborator

lgray commented Aug 23, 2024

Composability does not defeat the purpose of a shortcut.

I'll add allow_empty_datasets in do_preprocess.

My bad for missing you were using that as opposed preprocess directly.

@lgray
Copy link
Collaborator

lgray commented Aug 23, 2024

OK added to the rucio utils. Please give it a try.

@JoyYTZhou
Copy link

OK added to the rucio utils. Please give it a try.

Yes, I get the successful outputs now. Thank you. I think this fix closes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants