Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: join may result in duplicate column names #21048

Open
2 tasks done
leoliu0 opened this issue Feb 3, 2025 · 2 comments
Open
2 tasks done

bug: join may result in duplicate column names #21048

leoliu0 opened this issue Feb 3, 2025 · 2 comments
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@leoliu0
Copy link

leoliu0 commented Feb 3, 2025

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

When df has duplicated column names, to_pandas failed

import polars as p
df = p.DataFrame({'a':[1,2,3,4,5],'b':[6,6,6,6,6]})
df = df.join(df,on=['a']).join(df,on=['a'])
df.to_pandas()

gives an error called Result::unwrap() on an Err value: ComputeError(ErrString("RecordBatch requires an equal number of fields and arrays"))

It seems joins would create duplicated column names, which is very weird behavior.
It would be great if join would automatically rename additional columns to xxx_right2, xxx_right3 etc?

Log output

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-arrow/src/record_batch.rs:27:47:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("RecordBatch requires an equal number of fields and arrays"))
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <polars_core::frame::RecordBatchIter as core::iter::traits::iterator::Iterator>::next
   4: <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next
   5: polars_python::dataframe::export::<impl polars_python::dataframe::PyDataFrame>::__pymethod_to_pandas__
   6: pyo3::impl_::trampoline::trampoline
   7: polars_python::dataframe::export::_::__INVENTORY::trampoline
   8: method_vectorcall_NOARGS
             at /usr/src/debug/python311/Python-3.11.11/Objects/descrobject.c:453:24
   9: _PyObject_VectorcallTstate
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_call.h:92:11
  10: PyObject_Vectorcall
             at /usr/src/debug/python311/Python-3.11.11/Objects/call.c:299:12
  11: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:4769:0
  12: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  13: _PyEval_Vector
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:6434
  14: _PyObject_VectorcallTstate
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_call.h:92:11
  15: method_vectorcall
             at /usr/src/debug/python311/Python-3.11.11/Objects/classobject.c:59
  16: _PyVectorcall_Call
             at /usr/src/debug/python311/Python-3.11.11/Objects/call.c:257:24
  17: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:5376:0
  18: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  19: _PyEval_Vector
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:6434
  20: PyEval_EvalCode
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:1148:0
  21: builtin_exec_impl
             at /usr/src/debug/python311/Python-3.11.11/Python/bltinmodule.c:1077:0
  22: builtin_exec
             at /usr/src/debug/python311/Python-3.11.11/Python/clinic/bltinmodule.c.h:465
  23: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:5091:0
  24: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  25: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  26: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:2585:0
  27: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  28: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  29: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:2585:0
  30: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Inc
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
/tmp/ipykernel_774697/246086097.py in ?()
----> 1 df.to_pandas()

~/.local/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, use_pyarrow_extension_array, **kwargs)
   2431             return self._to_pandas_with_object_columns(
   2432                 use_pyarrow_extension_array=use_pyarrow_extension_array, **kwargs
   2433             )
   2434 
-> 2435         return self._to_pandas_without_object_columns(
   2436             self, use_pyarrow_extension_array=use_pyarrow_extension_array, **kwargs
   2437         )

~/.local/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, df, use_pyarrow_extension_array, **kwargs)
   2482     ) -> pd.DataFrame:
   2483         if not df.width:  # Empty dataframe, cannot infer schema from batches
   2484             return pd.DataFrame()
   2485 
-> 2486         record_batches = df._df.to_pandas()
   2487         tbl = pa.Table.from_batches(record_batches)
   2488         if use_pyarrow_extension_array:
   2489             return tbl.to_pandas(

PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("RecordBatch requires an equal number of fields and arrays"))
lude/internal/pycore_ceval.h:73:16
  31: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  32: gen_send_ex
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:287:0
  33: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:5221:0
  34: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  35: _PyEval_Vector
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:6434
  36: _PyObject_VectorcallTstate
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_call.h:92:11
  37: method_vectorcall
             at /usr/src/debug/python311/Python-3.11.11/Objects/classobject.c:59
  38: _PyVectorcall_Call
             at /usr/src/debug/python311/Python-3.11.11/Objects/call.c:257:24
  39: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:5376:0
  40: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  41: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  42: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:2585:0
  43: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  44: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  45: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:2585:0
  46: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  47: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  48: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:2585:0
  49: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  50: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  51: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:2585:0
  52: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  53: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  54: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:2585:0
  55: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  56: gen_send_ex2
             at /usr/src/debug/python311/Python-3.11.11/Objects/genobject.c:219
  57: task_step_impl
             at /usr/src/debug/python311/Python-3.11.11/Modules/_asynciomodule.c:2693:22
  58: task_step
             at /usr/src/debug/python311/Python-3.11.11/Modules/_asynciomodule.c:2993:11
  59: cfunction_vectorcall_O
             at /usr/src/debug/python311/Python-3.11.11/Objects/methodobject.c:514:24
  60: _PyObject_VectorcallTstate
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_call.h:92:11
  61: context_run
             at /usr/src/debug/python311/Python-3.11.11/Python/context.c:673:29
  62: cfunction_vectorcall_FASTCALL_KEYWORDS
             at /usr/src/debug/python311/Python-3.11.11/Objects/methodobject.c:443:24
  63: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:5376:0
  64: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  65: _PyEval_Vector
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:6434
  66: PyEval_EvalCode
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:1148:0
  67: builtin_exec_impl
             at /usr/src/debug/python311/Python-3.11.11/Python/bltinmodule.c:1077:0
  68: builtin_exec
             at /usr/src/debug/python311/Python-3.11.11/Python/clinic/bltinmodule.c.h:465
  69: cfunction_vectorcall_FASTCALL_KEYWORDS
             at /usr/src/debug/python311/Python-3.11.11/Objects/methodobject.c:443:24
  70: _PyObject_VectorcallTstate
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_call.h:92:11
  71: PyObject_Vectorcall
             at /usr/src/debug/python311/Python-3.11.11/Objects/call.c:299:12
  72: _PyEval_EvalFrameDefault
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:4769:0
  73: _PyEval_EvalFrame
             at /usr/src/debug/python311/Python-3.11.11/./Include/internal/pycore_ceval.h:73:16
  74: _PyEval_Vector
             at /usr/src/debug/python311/Python-3.11.11/Python/ceval.c:6434
  75: PyObject_Call
             at /usr/src/debug/python311/Python-3.11.11/Objects/call.c:355:12
  76: pymain_run_module
             at /usr/src/debug/python311/Python-3.11.11/Modules/main.c:300
  77: pymain_run_python
             at /usr/src/debug/python311/Python-3.11.11/Modules/main.c:599:0
  78: Py_RunMain
             at /usr/src/debug/python311/Python-3.11.11/Modules/main.c:684
  79: <unknown>
  80: __libc_start_main
  81: _start
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Issue description

As described above

Expected behavior

Should complete without error

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Linux-6.12.10-arch1-1-x86_64-with-glibc2.40
Python:              3.11.11 (main, Dec 28 2024, 09:46:27) [GCC 14.2.1 20240910]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  1.2.0
altair               5.4.1
azure.identity       <not installed>
boto3                1.35.73
cloudpickle          3.1.0
connectorx           0.4.0
deltalake            0.21.0
fastexcel            0.12.0
fsspec               2024.10.0
gevent               24.10.3
google.auth          <not installed>
great_tables         0.13.0
matplotlib           3.9.2
numpy                1.26.4
openpyxl             3.1.5
pandas               2.2.3
pyarrow              18.0.0
pydantic             2.9.2
pyiceberg            0.7.1
sqlalchemy           2.0.36
torch                2.6.0+cu124
xlsx2csv             0.8.3
xlsxwriter           3.2.0
@leoliu0 leoliu0 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 3, 2025
@MarcoGorelli MarcoGorelli added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels Feb 3, 2025
@MarcoGorelli
Copy link
Collaborator

woooah

In [1]: import polars as pl

In [2]: df = pl.DataFrame({'a':[1,2,3,4,5],'b':[6,6,6,6,6]})
   ...: df = df.join(df,on=['a']).join(df,on=['a'])

In [3]: df
Out[3]:
shape: (5, 4)
┌─────┬─────┬─────────┬─────────┐
│ abb_rightb_right │
│ ------------     │
│ i64i64i64i64     │
╞═════╪═════╪═════════╪═════════╡
│ 1666       │
│ 2666       │
│ 3666       │
│ 4666       │
│ 5666       │
└─────┴─────┴─────────┴─────────┘

thanks @leoliu0 for the report, this shouldn't happen, and is (I guess) a result of some check being bypassed?

@MarcoGorelli MarcoGorelli changed the title to_pandas raise ComputeError bug: join may result in duplicate column names Feb 3, 2025
@orlp
Copy link
Collaborator

orlp commented Feb 3, 2025

>>> df.lazy().join(df.lazy(),on=['a']).join(df.lazy(),on=['a']).collect(new_streaming=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/orlp/.localpython/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2056, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.DuplicateError: column with name 'b_right' already exists

You may want to try:
- renaming the column prior to joining
- using the `suffix` parameter to specify a suffix different to the default one ('_right')

New-streaming already detects this 😎

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants