Issue Regularizing a awkward Array of Strings #2465

tomeichlersmith · 2023-05-21T19:58:08Z

tomeichlersmith
May 21, 2023

Let's say I have a ragged array of strings.¹ I want to flatten this array to numpy so I can fill a histogram of the values.² The specific example is below, but the TLDR is that I get an error when attempting to convert an ak.Array of strings to a np.array of strings when awkward is trying to regularize the array. This error reads (to me) like awkward array is mistreating the strings as arrays themselves, but I cannot reproduce this error with a smaller example.

This same error occurs when trying ak.to_regular which I suppose makes sense since NumPy would need a regular array.

Is this a bug with awkward array? Or (more likely) is this a bug with how I'm using it? Perhaps theres a workaround where I tell awkward what kind of type to use in the NumPy array? I couldn't find such a parameter in the docs, but I could've missed that easily.

In anycase, I have this set of strings

[['PCB_volume', 'PCB_volume', 'PCB_volume', ..., 'PCB_volume', 'PCB_volume'],
 ['W_front_volume_0', 'W_front_volume_0', ..., 'W_front_volume_0'],
 ['PCB_volume', 'PCB_volume', 'PCB_volume', ..., 'PCB_volume', 'PCB_volume'],
 ['W_front_volume_0', 'W_front_volume_0', ..., 'W_front_volume_0'],
 ['W_cooling_volume_1', 'W_cooling_volume_1', ..., 'W_cooling_volume_1'],
 ['W_cooling_volume_0', 'W_cooling_volume_0', ..., 'W_cooling_volume_0'],
 ['W_cooling_volume_0', 'W_cooling_volume_0', ..., 'W_cooling_volume_0'],
 ['W_cooling_volume_0', ..., 'nohole_motherboard6_assembly'],
 ['W_front_volume_2', 'W_front_volume_2', ..., 'W_front_volume_6'],
 ['PCB_volume', 'PCB_volume', ..., 'W_cooling_volume_2', 'W_cooling_volume_2'],
 ...,
 ['W_cooling_volume_0', 'W_cooling_volume_0', ..., 'W_cooling_volume_0'],
 ['PCB_volume', 'PCB_volume', 'PCB_volume', ..., 'PCB_volume', 'PCB_volume'],
 ['W_front_volume_1', 'W_front_volume_1', ..., 'W_cooling_volume_4'],
 ['PCB_volume', 'PCB_volume', 'PCB_volume', ..., 'PCB_volume', 'PCB_volume'],
 ['PCB_volume', 'PCB_volume', 'PCB_volume', ..., 'Glue_volume', 'Glue_volume'],
 ['W_front_volume_0', 'W_front_volume_0', ..., 'W_cooling_volume_1'],
 ['W_front_volume_2', 'W_front_volume_2', ..., 'W_front_volume_2'],
 ['W_front_volume_4', 'W_front_volume_4', ..., 'W_front_volume_4'],
 ['W_cooling_volume_0', 'W_cooling_volume_0', ..., 'W_cooling_volume_1']]
-------------------------------------------------------------------------------
type: 15491 * var * string

And I can ak.flatten it easily

['PCB_volume',
 'PCB_volume',
 'PCB_volume',
 'PCB_volume',
 'PCB_volume',
 'PCB_volume',
 'PCB_volume',
 'PCB_volume',
 'PCB_volume',
 'PCB_volume',
 ...,
 'W_cooling_volume_0',
 'W_cooling_volume_0',
 'W_cooling_volume_0',
 'W_cooling_volume_1',
 'W_cooling_volume_1',
 'W_cooling_volume_1',
 'W_cooling_volume_1',
 'W_cooling_volume_1',
 'W_cooling_volume_1']
----------------------
type: 334460 * string

But I cannot convert ak.to_numpy (I get the same error using ak.to_numpy(ak.flatten(<>)) and just ak.to_numpy alone).

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[115], line 1
----> 1 ak.to_numpy(ak.flatten(strings))

File /opt/conda/lib/python3.10/site-packages/awkward/operations/ak_to_numpy.py:42, in to_numpy(array, allow_missing)
      8 """
      9 Args:
     10     array: Array-like data (anything #ak.to_layout recognizes).
   (...)
     36 See also #ak.from_numpy and #ak.to_cupy.
     37 """
     38 with ak._errors.OperationErrorContext(
     39     "ak.to_numpy",
     40     {"array": array, "allow_missing": allow_missing},
     41 ):
---> 42     return _impl(array, allow_missing)

File /opt/conda/lib/python3.10/site-packages/awkward/operations/ak_to_numpy.py:54, in _impl(array, allow_missing)
     51 backend = ak._backends.NumpyBackend.instance()
     52 numpy_layout = layout.to_backend(backend)
---> 54 return numpy_layout.to_backend_array(allow_missing=allow_missing)

File /opt/conda/lib/python3.10/site-packages/awkward/contents/content.py:1124, in Content.to_backend_array(self, allow_missing, backend)
   1122 else:
   1123     backend = ak._backends.regularize_backend(backend)
-> 1124 return self._to_backend_array(allow_missing, backend)

File /opt/conda/lib/python3.10/site-packages/awkward/contents/listarray.py:1451, in ListArray._to_backend_array(self, allow_missing, backend)
   1450 def _to_backend_array(self, allow_missing, backend):
-> 1451     return self.to_RegularArray()._to_backend_array(allow_missing, backend)

File /opt/conda/lib/python3.10/site-packages/awkward/contents/listarray.py:308, in ListArray.to_RegularArray(self)
    306 def to_RegularArray(self):
    307     offsets = self._compact_offsets64(True)
--> 308     return self._broadcast_tooffsets64(offsets).to_RegularArray()

File /opt/conda/lib/python3.10/site-packages/awkward/contents/listoffsetarray.py:282, in ListOffsetArray.to_RegularArray(self)
    277 _size = ak.index.Index64.empty(1, self._backend.index_nplike)
    278 assert (
    279     _size.nplike is self._backend.index_nplike
    280     and self._offsets.nplike is self._backend.index_nplike
    281 )
--> 282 self._handle_error(
    283     self._backend[
    284         "awkward_ListOffsetArray_toRegularArray",
    285         _size.dtype.type,
    286         self._offsets.dtype.type,
    287     ](
    288         _size.data,
    289         self._offsets.data,
    290         self._offsets.length,
    291     )
    292 )
    293 size = self._backend.index_nplike.index_as_shape_item(_size[0])
    294 length = self._offsets.length - 1

File /opt/conda/lib/python3.10/site-packages/awkward/contents/content.py:281, in Content._handle_error(self, error, slicer)
    278 message += filename
    280 if slicer is None:
--> 281     raise ak._errors.wrap_error(ValueError(message))
    282 else:
    283     raise ak._errors.index_error(self, slicer, message)

ValueError: while calling

    ak.to_numpy(
        array = <Array ['PCB_volume', ...] type='334460 * string'>
        allow_missing = True
    )

Error details: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-12/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

In my specific case, I am looking at the names of Geant4 volumes from which sim particles originated, but that is not important. ↩
To be very specific, I first ran into this issue trying to fill a Hist.hist object with these strings. ↩

Answered by tomeichlersmith

May 22, 2023

However, you are correct that this is an issue with my older version of awkward! Hooray 🎉 I will upgrade.

Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward as ak
>>> ak.__version__
'2.2.1'
>>> a = ak.Array({
...     'keep' : [
...         [True],
...         [False, True, True],
...         [],
...         [True, False]
...     ],
...     'mystr' : [
...         ['yes'],
...         ['foo','blabla','hellothere'],
...         [],
...         ['generalkenobi','anakin']
...     ]
... })
>>> ak.to_numpy(ak.flatten(a['mystr'][a['keep']]))
array(['yes', 'blabla', 'hellothere', 'generalkenobi'…

View full answer

jpivarski · 2023-05-21T22:19:16Z

jpivarski
May 21, 2023
Maintainer

Is this a bug with awkward array? Or (more likely) is this a bug with how I'm using it?

On the face of it, this looks like a bug in Awkward Array. After flattening the list of strings into just strings, it should be functionally equivalent to the following.

An array of variable length strings

>>> array = ak.Array(["one", "two", "three", "four", "five"])

is stored in a compact way, with all the character data contiguously, delineated by integer offsets

>>> array.layout.content.data.tobytes()
b'onetwothreefourfive'
>>> array.layout.offsets.data
array([ 0,  3,  6, 11, 15, 19])

whereas a NumPy array is contiguous with fixed-sized strings (by padding all the short ones to match the length of the longest one). HOWEVER, ak.to_numpy knows this and pads strings to make them NumPy compatible.

>>> ak.to_numpy(array)
array(['one', 'two', 'three', 'four', 'five'], dtype='<U5')

We can see this by looking at the NumPy array's raw bytes (every 4th byte because NumPy uses UTF-32 for general strings, whereas Awkward uses UTF-8).

>>> ak.to_numpy(array).tobytes()[::4]
b'one\x00\x00two\x00\x00threefour\x00five\x00'

Okay, so in my copy of Awkward (fairly recent git cloned version of main), ak.to_numpy does this properly. What could be going on in your case?

I tried a jagged array of strings (array of variable-length lists of variable-length strings),

>>> array = ak.Array([["one", "two"], ["three"], [], ["four", "five"]])

Passing this directly into ak.to_numpy raises the error that you saw because although ak.to_numpy automatically pads strings (who wouldn't want that?), it does not automatically pad other (more visible) lists¹.

>>> ak.to_numpy(array)
Traceback (most recent call last):
...
ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-15/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jpivarski/irishep/awkward/src/awkward/operations/ak_to_numpy.py", line 38, in to_numpy
    with ak._errors.OperationErrorContext(
  File "/Users/jpivarski/irishep/awkward/src/awkward/_errors.py", line 56, in __exit__
    self.handle_exception(exception_type, exception_value)
  File "/Users/jpivarski/irishep/awkward/src/awkward/_errors.py", line 71, in handle_exception
    raise self.decorate_exception(cls, exception)
ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-15/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)

This error occurred while calling

    ak.to_numpy(
        array = <Array [['one', 'two'], ..., [...]] type='4 * var * string'>
        allow_missing = True

But you said that you tried both flattening and not flattening. If you flatten the lists (to get an array of strings only), it should work.

>>> ak.to_numpy(ak.flatten(array))
array(['one', 'two', 'three', 'four', 'five'], dtype='<U5')

For me, it does. But I also noticed that your error message is old, suggesting that you have an old version of Awkward. Might this be a bug that has been fixed?

If you want to pad variable-length lists such that they all have the same length, as a prelude to ak.to_numpy, you can do that by ak.pad_none (pads short lists with None), followed by ak.fill_none (replaces None with some value of your choosing). If you have to use the version of Awkward that you're showing here, one with a bug in it, then this could be a work-around for you. It's what ak.to_numpy should be doing automatically for strings (the fill value is \x00 because most string-interpreters would interpret that as the end of a string that's shorter than its buffer). ↩

0 replies

tomeichlersmith · 2023-05-22T00:37:30Z

tomeichlersmith
May 22, 2023
Author

Thank you for the extra context @jpivarski ! I have another clue that might lead us towards a solution.

If I dump my strings to JSON and then read it back in the flattening works.

>>> ak.to_json(strings, 'volumes.json')
>>> strings = ak.from_json(pathlib.Path('volumes.json'))
>>> ak.to_numpy(ak.flatten(strings))
array(['PCB_volume', 'PCB_volume', 'PCB_volume', ...,
       'W_cooling_volume_1', 'W_cooling_volume_1', 'W_cooling_volume_1'],
      dtype='<U28')

So I think this has to do with the memory layout because the way I am getting these strings is by making a selection of a larger awkward array. I have cooked up a smaller example to test this in a more portable way.

>>> a = ak.Array({
    'keep' : [
        [True],
        [False, True, True],
        [],
        [True, False]
    ],
    'mystr' : [
        ['yes'],
        ['foo','blabla','hellothere'],
        [],
        ['generalkenobi','anakin']
    ]
})
>>> ak.to_numpy(ak.flatten(a['mystr'][a['keep']]))
# produces "cannot convert to Regular array" error

For context, I am using Awkward Array v2.1.1.

4 replies

agoose77 May 23, 2023
Maintainer

This is good detective-work @tomeichlersmith! As @jpivarski points out, the fix came from #2449 which improved how we handle strings in this context, changing the previous naive method with a string-aware implementatino:

awkward/src/awkward/contents/listarray.py

Lines 1431 to 1437 in b7971d3

    
           array_param = self.parameter("__array__") 
        
           if array_param in {"bytestring", "string"}: 
        
               # As our array-of-strings _may_ be empty, we should pass the dtype 
        
               dtype = np.str_ if array_param == "string" else np.bytes_ 
        
               return backend.nplike.asarray(self.to_list(), dtype=dtype) 
        
           else: 
        
               return self.to_RegularArray()._to_backend_array(allow_missing, backend)

tomeichlersmith May 23, 2023
Author

Is this already incorporated into the tests? If not, I can put my simple test script into your tests.

agoose77 May 23, 2023
Maintainer

We should add a test for this, as we don't yet have one. Are you familiar with the low-level ak.contents.Content objects that we use to describe an array? The test would be best if it built such a ListArray layout: this layout is produced by your slice, and is fixed by #2449

tomeichlersmith May 23, 2023
Author

I am not so it may be quickest if a more experienced developer wrote the test - not sure how long it will take for me to grasp the ak.contents.Content class.

tomeichlersmith · 2023-05-22T00:42:58Z

tomeichlersmith
May 22, 2023
Author

However, you are correct that this is an issue with my older version of awkward! Hooray 🎉 I will upgrade.

Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward as ak
>>> ak.__version__
'2.2.1'
>>> a = ak.Array({
...     'keep' : [
...         [True],
...         [False, True, True],
...         [],
...         [True, False]
...     ],
...     'mystr' : [
...         ['yes'],
...         ['foo','blabla','hellothere'],
...         [],
...         ['generalkenobi','anakin']
...     ]
... })
>>> ak.to_numpy(ak.flatten(a['mystr'][a['keep']]))
array(['yes', 'blabla', 'hellothere', 'generalkenobi'], dtype='<U13')

2 replies

tomeichlersmith May 22, 2023
Author

For any future reader

I wrote a quick script to test this - right now, only the latest release 2.2.1 passes this test script.

import awkward as ak
import numpy as np
a = ak.Array({
    'keep' : [
        [True],
        [False, True, True],
        [],
        [True, False]
    ],
    'mystr' : [
        ['yes'],
        ['foo','blabla','hellothere'],
        [],
        ['generalkenobi','anakin']
    ]
})
print(ak.__version__)
assert (ak.to_numpy(ak.flatten(a['mystr'][a['keep']])) == np.array(['yes','blabla','hellothere','generalkenobi'])).all()

jpivarski May 22, 2023
Maintainer

Among the fixes in 2.2.1 is related to #2449 (although that addressed the lack of information due to empty strings).

jpivarski · 2023-12-30T15:38:59Z

jpivarski
Dec 30, 2023
Maintainer

I want to keep all of the discussions open. Issues get closed when they're done, but it's valuable to keep discussions around—even if they're resolved—because they're useful to other people with the same questions.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue Regularizing a awkward Array of Strings #2465

{{title}}

Replies: 4 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issue Regularizing a awkward Array of Strings #2465

tomeichlersmith May 21, 2023

Footnotes

Replies: 4 comments · 6 replies

jpivarski May 21, 2023 Maintainer

Footnotes

tomeichlersmith May 22, 2023 Author

agoose77 May 23, 2023 Maintainer

tomeichlersmith May 23, 2023 Author

agoose77 May 23, 2023 Maintainer

tomeichlersmith May 23, 2023 Author

tomeichlersmith May 22, 2023 Author

tomeichlersmith May 22, 2023 Author

For any future reader

jpivarski May 22, 2023 Maintainer

jpivarski Dec 30, 2023 Maintainer

tomeichlersmith
May 21, 2023

Replies: 4 comments 6 replies

jpivarski
May 21, 2023
Maintainer

tomeichlersmith
May 22, 2023
Author

agoose77 May 23, 2023
Maintainer

tomeichlersmith May 23, 2023
Author

agoose77 May 23, 2023
Maintainer

tomeichlersmith May 23, 2023
Author

tomeichlersmith
May 22, 2023
Author

tomeichlersmith May 22, 2023
Author

jpivarski May 22, 2023
Maintainer

jpivarski
Dec 30, 2023
Maintainer