CI: enable string dtype for test job with pandas main #947

jorisvandenbossche · 2025-01-26T11:28:47Z

Currently you still have to enable this manually even when using pandas main (although it is probably time now to switch the default on the pandas side). When enabling this, this should also show that the simple test that was added in #933 is failing again (related to a np.array(inp, copy=False) call that now errors if copy=False cannot be honored, this was a change in numpy 2+, that pandas will start following in 3.0)

jorisvandenbossche · 2025-01-26T11:51:37Z

When enabling this, this should also show that the simple test that was added in #933 is failing again (related to a np.array(inp, copy=False) call that now errors if copy=False cannot be honored, this was a change in numpy 2+, that pandas will start following in 3.0)

Sorry, that test should of course already have been using the string dtype, because it is using a fixture to enable the option on the fly. So not entirely sure why that is not failing. If I test that example locally in my pandas dev env, it is failing:

In [3]: pd.options.future.infer_string = True

In [4]: pd.__version__
Out[4]: '3.0.0.dev0+1487.g160b3eb4be.dirty'

In [5]: import fastparquet

In [6]: fastparquet.__version__
Out[6]: '2024.11.0'

In [7]: df = pd.DataFrame({"a": ["some", "strings"]})
   ...: df.to_parquet("temp.parquet", engine="fastparquet")
...
ValueError: Error converting column "a" to bytes using encoding UTF8. Original error: Unable to avoid copy while creating an array as requested.

Full error traceback:

In [7]: df = pd.DataFrame({"a": ["some", "strings"]})
   ...: df.to_parquet("temp.parquet", engine="fastparquet")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:290, in convert(data, se)
    288 if converted_type == parquet_thrift.ConvertedType.UTF8:
    289     # TODO: into bytes in one step
--> 290     out = array_encode_utf8(data)
    291 elif converted_type is None:

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/speedups.pyx:43, in fastparquet.speedups.array_encode_utf8()

File ~/scipy/repos/pandas/pandas/core/series.py:888, in Series.__array__(self, dtype, copy)
    887 else:
--> 888     arr = np.array(values, dtype=dtype, copy=copy)
    890 if copy is True:

File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:681, in ArrowExtensionArray.__array__(self, dtype, copy)
    679 if copy is False:
    680     # TODO: By using `zero_copy_only` it may be possible to implement this
--> 681     raise ValueError(
    682         "Unable to avoid copy while creating an array as requested."
    683     )
    684 elif copy is None:
    685     # `to_numpy(copy=False)` has the meaning of NumPy `copy=None`.

ValueError: Unable to avoid copy while creating an array as requested.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[7], line 2
      1 df = pd.DataFrame({"a": ["some", "strings"]})
----> 2 df.to_parquet("temp.parquet", engine="fastparquet")

File ~/scipy/repos/pandas/pandas/core/frame.py:2972, in DataFrame.to_parquet(self, path, engine, compression, index, partition_cols, storage_options, **kwargs)
   2882 """
   2883 Write a DataFrame to the binary parquet format.
   2884 
   (...)
   2968 >>> content = f.read()
   2969 """
   2970 from pandas.io.parquet import to_parquet
-> 2972 return to_parquet(
   2973     self,
   2974     path,
   2975     engine,
   2976     compression=compression,
   2977     index=index,
   2978     partition_cols=partition_cols,
   2979     storage_options=storage_options,
   2980     **kwargs,
   2981 )

File ~/scipy/repos/pandas/pandas/io/parquet.py:477, in to_parquet(df, path, engine, compression, index, storage_options, partition_cols, filesystem, **kwargs)
    473 impl = get_engine(engine)
    475 path_or_buf: FilePath | WriteBuffer[bytes] = io.BytesIO() if path is None else path
--> 477 impl.write(
    478     df,
    479     path_or_buf,
    480     compression=compression,
    481     index=index,
    482     partition_cols=partition_cols,
    483     storage_options=storage_options,
    484     filesystem=filesystem,
    485     **kwargs,
    486 )
    488 if path is None:
    489     assert isinstance(path_or_buf, io.BytesIO)

File ~/scipy/repos/pandas/pandas/io/parquet.py:339, in FastParquetImpl.write(self, df, path, compression, index, partition_cols, storage_options, filesystem, **kwargs)
    334     raise ValueError(
    335         "storage_options passed with file object or non-fsspec file path"
    336     )
    338 with catch_warnings(record=True):
--> 339     self.api.write(
    340         path,
    341         df,
    342         compression=compression,
    343         write_index=index,
    344         partition_on=partition_cols,
    345         **kwargs,
    346     )

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:1343, in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nulls, write_index, partition_on, fixed_text, append, object_encoding, times, custom_metadata, stats)
   1339     fmd.key_value_metadata = kvm
   1341 if file_scheme == 'simple':
   1342     # Case 'simple'
-> 1343     write_simple(filename, data, fmd,
   1344                  row_group_offsets=row_group_offsets,
   1345                  compression=compression, open_with=open_with,
   1346                  has_nulls=None, append=False, stats=stats)
   1347 else:
   1348     # Case 'hive', 'drill'
   1349     write_multi(filename, data, fmd,
   1350                 row_group_offsets=row_group_offsets,
   1351                 compression=compression, file_scheme=file_scheme,
   1352                 write_fmd=True, open_with=open_with,
   1353                 mkdirs=mkdirs, partition_on=partition_on,
   1354                 append=False, stats=stats)

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:1004, in write_simple(fn, data, fmd, row_group_offsets, compression, open_with, has_nulls, append, stats)
   1002 of = open_with(fn, mode)
   1003 with of as f:
-> 1004     write_to_file(f)

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:988, in write_simple.<locals>.write_to_file(f)
    986 rgs = fmd.row_groups
    987 for i, row_group in enumerate(data):
--> 988     rg = make_row_group(f, row_group, fmd.schema,
    989                         compression=compression, stats=stats)
    990     if rg is not None:
    991         rgs.append(rg)

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:805, in make_row_group(f, data, schema, compression, stats)
    803         else:
    804             st = column.name in stats
--> 805         chunk = write_column(f, coldata, column,
    806                              compression=comp, stats=st)
    807         cols.append(chunk)
    808 rg = ThriftObject.from_fields(
    809     "RowGroup", num_rows=rows, columns=cols,
    810     total_byte_size=sum([c.meta_data.total_uncompressed_size for c in cols]))

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:638, in write_column(f, data0, selement, compression, datapage_version, stats)
    634     data = data.astype('int32')
    636 if datapage_version == 1:
    637     bdata = b"".join([
--> 638         repetition_data, definition_data, encode[encoding](data, selement), 8 * b'\x00'
    639     ])
    640     dph = parquet_thrift.DataPageHeader(
    641         num_values=check_32(row_end - row_start),
    642         encoding=getattr(parquet_thrift.Encoding, encoding),
   (...)
    645         i32=1
    646     )
    647     l0 = len(bdata)

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:391, in encode_plain(data, se)
    389 def encode_plain(data, se):
    390     """PLAIN encoding; returns byte representation"""
--> 391     out = convert(data, se)
    392     if se.type == parquet_thrift.Type.BYTE_ARRAY:
    393         return pack_byte_array(list(out))

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:298, in convert(data, se)
    295     except Exception as e:  # pragma: no cover
    296         ct = parquet_thrift.ConvertedType._VALUES_TO_NAMES[
    297             converted_type] if converted_type is not None else None
--> 298         raise ValueError('Error converting column "%s" to bytes using '
    299                          'encoding %s. Original error: '
    300                          '%s' % (data.name, ct, e))
    302 elif converted_type == parquet_thrift.ConvertedType.TIME_MICROS:
    303     # TODO: shift inplace
    304     if data.dtype == "m8[ns]":

ValueError: Error converting column "a" to bytes using encoding UTF8. Original error: Unable to avoid copy while creating an array as requested.

Now, enabling the future string dtype for the full test suite shows a bunch of other failures (based on a quick look at the logs, I assume most of those failures are things that have to be updated in the tests to run correctly with the future pandas release, for example things like object dtype as the hardcoded expected dtype, while in the future that might be the string dtype)

martindurant · 2025-01-27T15:41:03Z

Specifically, it's test_gh929 that would fail?

jorisvandenbossche · 2025-01-27T20:54:50Z

No, AFAIK that is a test for datetime dtype, it's the test_auto_string test added in #933 that is failing for me locally (based on the output shown in the comment above)

CI: enable string dtype for test job with pandas main

4d6e47f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: enable string dtype for test job with pandas main #947

CI: enable string dtype for test job with pandas main #947

jorisvandenbossche commented Jan 26, 2025

jorisvandenbossche commented Jan 26, 2025

martindurant commented Jan 27, 2025

jorisvandenbossche commented Jan 27, 2025 •

edited

Loading

CI: enable string dtype for test job with pandas main #947

Are you sure you want to change the base?

CI: enable string dtype for test job with pandas main #947

Conversation

jorisvandenbossche commented Jan 26, 2025

jorisvandenbossche commented Jan 26, 2025

martindurant commented Jan 27, 2025

jorisvandenbossche commented Jan 27, 2025 • edited Loading

jorisvandenbossche commented Jan 27, 2025 •

edited

Loading