Replies: 2 comments 4 replies
-
I manually moved this here because the error is a I'll leave it for others to talk about strategies of turning unions into non-unions, as they'll be prompted by use-cases. Your use-case, however, shouldn't be happening: it's a union of exactly one type:
shouldn't have been created; it should be
instead. If you can find the single step that introduces that Here's how unions are supposed to work. The following must be a union type because it mixes integers and strings: >>> must_be_union = ak.Array([1, 2, 3, "four", "five"])
>>> must_be_union
<Array [1, 2, 3, 'four', 'five'] type='5 * union[int64, string]'> The following is still a union type because it could contain multiple types, but this particular value does not. >>> union_type_but_not_values = must_be_union[:3]
>>> union_type_but_not_values
<Array [1, 2, 3] type='3 * union[int64, string]'> We may need new functions/tooling to deal with such cases. Similarly, #487 is an open issue to provide tooling for eliminating unwanted option-types. The strange case you encountered is something I have to build manually because high-level functions are not supposed to produce it: >>> a = union_type_but_not_values.layout
>>> bad_union = ak.Array(ak.layout.UnionArray8_64(a.tags, a.index, [a.contents[0]]))
>>> bad_union
<Array [1, 2, 3] type='3 * union[int64]'> If anything like a >>> fixed_it = ak.Array(bad_union.layout.simplify())
>>> fixed_it
<Array [1, 2, 3] type='3 * int64'> That's why I'd like to find out which of your steps is producing this union of only one type (a record in your example). That one operation is probably not calling With your data, though, it should be possible to turn >>> df_Pangenome = df.Pangenome
>>> ak.type(df_Pangenome):
20 * var * union[{"Gene": option[string], "Annotation": option[string]}] into something usable by >>> fixed = ak.Array(
... ak.layout.ListArray64(
... df_Pangenome.layout.starts,
... df_Pangenome.layout.stops,
... ak.layout.IndexedArray64(
... df_Pangenome.layout.content.index,
... df_Pangenome.layout.content.contents[0],
... )
... )
... )
...
>>> ak.type(fixed)
20 * var * {"Gene": option[string], "Annotation": option[string]} but I haven't tested the above (I simulated the Python outputs) because your example is not reproducible (no access to |
Beta Was this translation helpful? Give feedback.
-
Hi, Perhaps this issue has already been resolved, but I am kind of facing the same thing while working with awkward array. # Jet.pt
<ListOffsetArray len='100000'>
<offsets><Index dtype='int64' len='100001'>
[ 0 2 8 ... 533498 533505 533508]
</Index></offsets>
<content><NumpyArray dtype='float32' len='533508'>
[31.162498 32.81094 31.387499 ... 31.264063 25.385939 17.058594]
</NumpyArray></content>
</ListOffsetArray> & # hcand.pt
<ListArray len='100000'>
<starts><Index dtype='int64' len='100000'>
[39588 39588 39588 ... 39588 39588 39588]
</Index></starts>
<stops><Index dtype='int64' len='100000'>
[39588 39588 39588 ... 39588 39588 39588]
</Index></stops>
<content><UnionArray len='39588'>
<tags><Index dtype='int8' len='39588'>[0 0 0 ... 0 0 0]</Index></tags>
<index><Index dtype='int64' len='39588'>
[ 21 67981 98 ... 676112 524119 676185]
</Index></index>
<content index='0'>
<NumpyArray dtype='float32' len='726294'>
[22.018867 38.186386 14.039998 ... 27.71415 29.300837
10.972406]
</NumpyArray>
</content>
<content index='1'>
<ListArray len='0'>
<starts><Index dtype='int64' len='0'>
[]
</Index></starts>
<stops><Index dtype='int64' len='0'>
[]
</Index></stops>
<content><NumpyArray dtype='float32' len='13196'>
[34.344795 47.639206 34.660675 ... 44.429535 52.219776
50.008633]
</NumpyArray></content>
</ListArray>
</content>
</UnionArray></content>
</ListArray> But, I can't write this array in parquet... it is showing the following error ArrowNotImplementedError Traceback (most recent call last)
Cell In[9], line 1
----> 1 ak.to_parquet(events.hcand.pt, "temp.parquet")
File ~/Work/ColumnflowAnalyses/CPinHToTauTau/data/software/venvs/venv_columnar_dev_3cbb5aff/lib/python3.9/site-packages/awkward/_dispatch.py:68, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
63 else:
64 raise AssertionError(
65 "high-level functions should only implement a single yield statement"
66 )
---> 68 return gen_or_result
File ~/Work/ColumnflowAnalyses/CPinHToTauTau/data/software/venvs/venv_columnar_dev_3cbb5aff/lib/python3.9/site-packages/awkward/_errors.py:67, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
60 try:
61 # Handle caught exception
62 if (
63 exception_type is not None
64 and issubclass(exception_type, Exception)
65 and self.primary() is self
66 ):
---> 67 self.handle_exception(exception_type, exception_value)
68 finally:
69 # `_kwargs` may hold cyclic references, that we really want to avoid
70 # as this can lead to large buffers remaining in memory for longer than absolutely necessary
71 # Let's just clear this, now.
72 self._kwargs.clear()
File ~/Work/ColumnflowAnalyses/CPinHToTauTau/data/software/venvs/venv_columnar_dev_3cbb5aff/lib/python3.9/site-packages/awkward/_errors.py:82, in ErrorContext.handle_exception(self, cls, exception)
80 self.decorate_exception(cls, exception)
81 else:
---> 82 raise self.decorate_exception(cls, exception)
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: dense_union<0: extension<awkward<AwkwardArrowType>> not null=0, 1: extension<awkward<AwkwardArrowType>> not null=1> Please let me know if there is any other way. |
Beta Was this translation helpful? Give feedback.
-
From @nongiga:
I'll try to do this as easy to reproduce as I can but it's a little complicated.
I have a data structure, that looks a little like this:
I am loading multiple csv files and saving their content in a subfield of the structure, making a final struct that looks more like this:
I did it using the following code, a bit bukly but get the job done:
Problem is, when I try to save the file in to_parquet
ak.to_parquet(df.Pangenome, datadir+"/up_to_pangenome.parquet", explode_records=True)
I get:
ArrowNotImplementedError: Unhandled type for Arrow to Parquet schema conversion: dense_union<0: string not null=0, 1: string not null=1>
Nonetheless when I try to save the Isolates substructure within the awway:
ak.to_parquet(df.Isolates, datadir+"/up_to_pangenome.parquet", explode_records=True)
I get no error.
When I print the type for ak.Pangenome I get:
But for the isolates I get:
Is there a way to convert the union type to a different type within the list to circumvent this?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions