Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dask-awkward array loaded by dak.from_parquet and field-sliced is not populated (has PlaceholderArrays) #501

Open
jpivarski opened this issue Apr 19, 2024 · 7 comments

Comments

@jpivarski
Copy link
Collaborator

This ZIP contains a ROOT file and a Parquet file.

dak-issue-501.zip

If we open it with uproot.dask, extract one field and compute it, we get what we expect:

>>> import uproot
>>> result = uproot.dask("dak-issue-501.root")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>
        [     0      9     16 ... 189827 189839 189853]
    </Index></offsets>
    <content><NumpyArray dtype='float32' len='189853'>
        [131514.36  126449.445 112335.195 ...  12131.118  13738.865  10924.1  ]
    </NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
array([     0,      9,     16, ..., 189827, 189839, 189853])
>>> result.layout.content.data
array([131514.36 , 126449.445, 112335.195, ...,  12131.118,  13738.865,
        10924.1  ], dtype=float32)

But if we open it with dak.from_parquet, extract one field and compute it, the field is populated with a PlaceholderArray:

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501.parquet")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>[## ... ##]</Index></offsets>
    <content><NumpyArray dtype='float32' len='##'>[## ... ##]</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x7a45532925c0>
>>> result.layout.content.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x7a4553292260>

And of course, that would cause troubles downstream.

What happened here? This is almost the simplest case of column optimization that one could have.

@martindurant
Copy link
Collaborator

Since #491 is in progress, can we test with that code, rather than trying to fix code that's about to disappear?

@jpivarski
Copy link
Collaborator Author

Good idea. I'll check it on that git-branch.

@jpivarski
Copy link
Collaborator Author

On that git-branch, the uproot.dask case is now broken (I assume something needs to be updated in Uproot),

>>> import uproot
>>> result = uproot.dask("dak-issue-501.root")["AnalysisJetsAuxDyn.pt"].compute()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 277, in dask
    return _get_dak_array(
  File "/home/jpivarski/irishep/uproot5/src/uproot/_dask.py", line 1560, in _get_dak_array
    return dask_awkward.from_map(
  File "/tmp/dask-awkward/src/dask_awkward/lib/io/io.py", line 630, in from_map
    form_with_unique_keys(io_func.form, "@"),
AttributeError: '_UprootRead' object has no attribute 'form'

and the dak.from_parquet case is unchanged,

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501.parquet")["AnalysisJetsAuxDyn.pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>[## ... ##]</Index></offsets>
    <content><NumpyArray dtype='float32' len='##'>[## ... ##]</NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8eaa0>
>>> result.layout.content.data
<awkward._nplikes.placeholder.PlaceholderArray object at 0x770054e8ead0>

@martindurant
Copy link
Collaborator

I think this must be because of the name of the one field containing a "." character, which is also used to indicate nesting.

@jpivarski
Copy link
Collaborator Author

That's right, it is:

dak-issue-501-nodot.zip

>>> import dask_awkward as dak
>>> result = dak.from_parquet("dak-issue-501-nodot.parquet")["AnalysisJetsAuxDyn_pt"].compute()
>>> result.layout
<ListOffsetArray len='20000'>
    <offsets><Index dtype='int64' len='20001'>
        [     0      9     16 ... 189827 189839 189853]
    </Index></offsets>
    <content><NumpyArray dtype='float32' len='189853'>
        [131514.36  126449.445 112335.195 ...  12131.118  13738.865  10924.1  ]
    </NumpyArray></content>
</ListOffsetArray>
>>> result.layout.offsets.data
array([     0,      9,     16, ..., 189827, 189839, 189853])
>>> result.layout.content.data
array([131514.36 , 126449.445, 112335.195, ...,  12131.118,  13738.865,
        10924.1  ], dtype=float32)

But we should expect that the field names can contain any characters, right? When I look up "parquet column names dot", I see a lot of instances of people doing this on Spark, which uses backticks to avoid interpreting dot as nesting.

Handling column names with dots in them (by requiring such columns to have backticks) might need to be implemented in ak.forms.Form.select_columns. Would anything be needed in dask-awkward?

@jpivarski
Copy link
Collaborator Author

Actually, how does the dot cause column optimization to fail? What assumption is being broken?

If scikit-hep/awkward#3088 is fixed, what else would be needed?

@martindurant
Copy link
Collaborator

I haven't spotted a place where we assume dots to be special, but I suspect that parquet might have this built in (or not).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants