Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordArray with duplicated field names cause issues with to_buffers, unpickling and loading from arrow #3247

Open
nikoladze opened this issue Sep 18, 2024 · 1 comment
Labels
bug The problem described is something that must be fixed

Comments

@nikoladze
Copy link
Contributor

Version of Awkward Array

2.6.8

Description and code to reproduce

It seems RecordArray allows for duplicated fields, e.g. when constructing via the Layout API

>>> array = ak.Array(ak.contents.RecordArray([ak.contents.NumpyArray([1, 2, 3]), ak.contents.NumpyArray([1, 2, 3])], ["a", "a"]))
>>> array
<Array [{a: 1, a: 1}, {...}, {a: 3, a: 3}] type='3 * {a: int64, a: int64}'>

Another possibility this can happen is if one (like me, accidentally) repeats a record field twice when selecting multiple record fields:

>>> array = ak.zip({"a": [1, 2, 3]})[["a", "a"]]
>>> array
<Array [{a: 1, a: 1}, {...}, {a: 3, a: 3}] type='3 * {a: int64, a: int64}'>

Now, such arrays cause issues when

  1. exploding via to_buffers:
>>> ak.to_buffers(array)
(RecordForm([NumpyForm('int64', form_key='node1'), NumpyForm('int64', form_key='node2')], ['a', 'a'], form_key='node0'), 3, {'node1-data': array([1, 2, 3])})

one can see node2-data is missing

  1. which consequently leads to problems unpickling
>>> pickle.loads(pickle.dumps(array))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/_pickle.py", line 107, in unpickle_array_schema_1
    return _impl(
           ^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_buffers.py", line 150, in _impl
    out = _reconstitute(form, length, container, getkey, backend, byteorder, simplify)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_buffers.py", line 405, in _reconstitute
    _reconstitute(
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_buffers.py", line 196, in _reconstitute
    raw_array = container[getkey(form, "data")]
                ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'node2-data'
  1. also a roundtrip to and from arrow doesn't work anymore:
>>> ak.from_arrow(ak.to_arrow(array))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/_dispatch.py", line 39, in dispatch
    gen_or_result = func(*args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_arrow.py", line 45, in from_arrow
    return _impl(array, generate_bitmasks, highlevel, behavior, attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_arrow.py", line 55, in _impl
    out = awkward._connect.pyarrow.handle_arrow(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/_connect/pyarrow/conversions.py", line 757, in handle_arrow
    out = popbuffers(
          ^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/_connect/pyarrow/conversions.py", line 370, in popbuffers
    paarray.field(field_name), a, b, buffers, generate_bitmasks
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 3913, in pyarrow.lib.StructArray.field
KeyError: 'a'

This error occurred while calling

    ak.from_arrow(
        AwkwardArrowArray-instance
    )

Probably one could just not allow arrays with duplicated field names. I'm not sure if there is any useful application of this - when i discovered this in my code this was also actually something i did not intend to do (just accidentally repeated a field name), so writing this minimal reproducer was already worth it :)

@nikoladze nikoladze added the bug (unverified) The problem described would be a bug, but needs to be triaged label Sep 18, 2024
@jpivarski jpivarski added bug The problem described is something that must be fixed and removed bug (unverified) The problem described would be a bug, but needs to be triaged labels Sep 18, 2024
@jpivarski
Copy link
Member

I reserve judgement about the round-trip through Arrow or Parquet (it depends on whether those libraries are okay with duplicated field names), but the round-trip through buffers seems very fixable. It already has a Form distinct form_keys:

>>> a = ak.Array({"a": [1, 2, 3]})[["a", "a"]]
>>> form, length, containers = ak.to_buffers(a)
>>> print(form)
{
    "class": "RecordArray",
    "fields": [
        "a",
        "a"
    ],
    "contents": [
        {
            "class": "NumpyArray",
            "primitive": "int64",
            "form_key": "node1"
        },
        {
            "class": "NumpyArray",
            "primitive": "int64",
            "form_key": "node2"
        }
    ],
    "form_key": "node0"
}

The only problem is that the containers only has "node1-data"; it's missing "node2-data":

>>> containers
{'node1-data': array([1, 2, 3])}

Ideally, "node2-data" would be a view of "node1-data", as it is in the array itself, though serialization might break that link. (Pickle wouldn't, but other forms of serialization might.)

It does iterate over both fields named "a":

for field, content in zip(self._fields, self._contents):
content._to_buffers(
form.content(field), getkey, container, backend, byteorder
)

>>> a.layout._fields
['a', 'a']
>>> a.layout._contents
[<NumpyArray dtype='int64' len='3'>[1 2 3]</NumpyArray>,
 <NumpyArray dtype='int64' len='3'>[1 2 3]</NumpyArray>]

The problem is form.content(field), which looks up the child by name in the Form. Both times when form.content("a") is called, it goes to the same element of form._contents, so it writes "node1-data" to the output containers twice.

The solution to this would be to enumerate over the fields by index and get the right Form child by index:

 for index, content in enumerate(self._contents): 
     content._to_buffers( 
         form.content(index), getkey, container, backend, byteorder 
     )

(See definition of Form.content.)

But that's exactly the same as the no-fields case above this code, so the whole if-then-else can be removed in favor of the first case:

if self._fields is None:
for i, content in enumerate(self._contents):
content._to_buffers(
form.content(i), getkey, container, backend, byteorder
)
else:
for field, content in zip(self._fields, self._contents):
content._to_buffers(
form.content(field), getkey, container, backend, byteorder
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug The problem described is something that must be fixed
Projects
None yet
Development

No branches or pull requests

2 participants