Issue Regularizing a awkward Array of Strings #2465
-
Let's say I have a ragged array of strings.1 I want to flatten this array to numpy so I can fill a histogram of the values.2 The specific example is below, but the TLDR is that I get an error when attempting to convert an This same error occurs when trying Is this a bug with awkward array? Or (more likely) is this a bug with how I'm using it? Perhaps theres a workaround where I tell awkward what kind of type to use in the NumPy array? I couldn't find such a parameter in the docs, but I could've missed that easily. In anycase, I have this set of strings
And I can
But I cannot convert
Footnotes |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 6 replies
-
On the face of it, this looks like a bug in Awkward Array. After flattening the list of strings into just strings, it should be functionally equivalent to the following. An array of variable length strings >>> array = ak.Array(["one", "two", "three", "four", "five"]) is stored in a compact way, with all the character data contiguously, delineated by integer offsets >>> array.layout.content.data.tobytes()
b'onetwothreefourfive'
>>> array.layout.offsets.data
array([ 0, 3, 6, 11, 15, 19]) whereas a NumPy array is contiguous with fixed-sized strings (by padding all the short ones to match the length of the longest one). HOWEVER, >>> ak.to_numpy(array)
array(['one', 'two', 'three', 'four', 'five'], dtype='<U5') We can see this by looking at the NumPy array's raw bytes (every 4th byte because NumPy uses UTF-32 for general strings, whereas Awkward uses UTF-8). >>> ak.to_numpy(array).tobytes()[::4]
b'one\x00\x00two\x00\x00threefour\x00five\x00' Okay, so in my copy of Awkward (fairly recent git cloned version of I tried a jagged array of strings (array of variable-length lists of variable-length strings), >>> array = ak.Array([["one", "two"], ["three"], [], ["four", "five"]]) Passing this directly into >>> ak.to_numpy(array)
Traceback (most recent call last):
...
ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-15/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jpivarski/irishep/awkward/src/awkward/operations/ak_to_numpy.py", line 38, in to_numpy
with ak._errors.OperationErrorContext(
File "/Users/jpivarski/irishep/awkward/src/awkward/_errors.py", line 56, in __exit__
self.handle_exception(exception_type, exception_value)
File "/Users/jpivarski/irishep/awkward/src/awkward/_errors.py", line 71, in handle_exception
raise self.decorate_exception(cls, exception)
ValueError: cannot convert to RegularArray because subarray lengths are not regular (in compiled code: https://github.com/scikit-hep/awkward/blob/awkward-cpp-15/awkward-cpp/src/cpu-kernels/awkward_ListOffsetArray_toRegularArray.cpp#L22)
This error occurred while calling
ak.to_numpy(
array = <Array [['one', 'two'], ..., [...]] type='4 * var * string'>
allow_missing = True But you said that you tried both flattening and not flattening. If you flatten the lists (to get an array of strings only), it should work. >>> ak.to_numpy(ak.flatten(array))
array(['one', 'two', 'three', 'four', 'five'], dtype='<U5') For me, it does. But I also noticed that your error message is old, suggesting that you have an old version of Awkward. Might this be a bug that has been fixed? Footnotes
|
Beta Was this translation helpful? Give feedback.
-
Thank you for the extra context @jpivarski ! I have another clue that might lead us towards a solution. If I dump my strings to JSON and then read it back in the flattening works. >>> ak.to_json(strings, 'volumes.json')
>>> strings = ak.from_json(pathlib.Path('volumes.json'))
>>> ak.to_numpy(ak.flatten(strings))
array(['PCB_volume', 'PCB_volume', 'PCB_volume', ...,
'W_cooling_volume_1', 'W_cooling_volume_1', 'W_cooling_volume_1'],
dtype='<U28') So I think this has to do with the memory layout because the way I am getting these strings is by making a selection of a larger awkward array. I have cooked up a smaller example to test this in a more portable way. >>> a = ak.Array({
'keep' : [
[True],
[False, True, True],
[],
[True, False]
],
'mystr' : [
['yes'],
['foo','blabla','hellothere'],
[],
['generalkenobi','anakin']
]
})
>>> ak.to_numpy(ak.flatten(a['mystr'][a['keep']]))
# produces "cannot convert to Regular array" error For context, I am using Awkward Array v2.1.1. |
Beta Was this translation helpful? Give feedback.
-
However, you are correct that this is an issue with my older version of awkward! Hooray 🎉 I will upgrade. Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import awkward as ak
>>> ak.__version__
'2.2.1'
>>> a = ak.Array({
... 'keep' : [
... [True],
... [False, True, True],
... [],
... [True, False]
... ],
... 'mystr' : [
... ['yes'],
... ['foo','blabla','hellothere'],
... [],
... ['generalkenobi','anakin']
... ]
... })
>>> ak.to_numpy(ak.flatten(a['mystr'][a['keep']]))
array(['yes', 'blabla', 'hellothere', 'generalkenobi'], dtype='<U13') |
Beta Was this translation helpful? Give feedback.
-
I want to keep all of the discussions open. Issues get closed when they're done, but it's valuable to keep discussions around—even if they're resolved—because they're useful to other people with the same questions. |
Beta Was this translation helpful? Give feedback.
However, you are correct that this is an issue with my older version of awkward! Hooray 🎉 I will upgrade.