Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading CSV lazy vs eager behaves inconsistently #21094

Open
2 tasks done
schmidttill opened this issue Feb 5, 2025 · 1 comment
Open
2 tasks done

Reading CSV lazy vs eager behaves inconsistently #21094

schmidttill opened this issue Feb 5, 2025 · 1 comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@schmidttill
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Difference between read_csv and scan_csv

import tempfile

import polars as pl

with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
A,B,C,
1,2,3,
4,5,6,
9,10,11,
""".strip())
    f.seek(0)

    df = pl.read_csv(f.name, schema=dict.fromkeys("ABC", pl.Int32), truncate_ragged_lines=True, skip_rows=1, has_header=False, new_columns=["A", "B", "C"])
    print(df)

shape: (3, 3)
┌─────┬─────┬─────┐
│ ABC   │
│ --------- │
│ i32i32i32 │
╞═════╪═════╪═════╡
│ 123   │
│ 456   │
│ 91011  │
└─────┴─────┴─────┘

    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABC", pl.Int32), truncate_ragged_lines=True, skip_rows=1, has_header=False, new_columns=["A", "B", "C"]).collect()
    print(lf)

shape: (3, 4)
┌─────┬─────┬─────┬──────────┐
│ ABCcolumn_4 │
│ ------------      │
│ i64i64i64str      │
╞═════╪═════╪═════╪══════════╡
│ 123null     │
│ 456null     │
│ 91011null     │
└─────┴─────┴─────┴──────────┘

Without new_columns

    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABC", pl.Int32), truncate_ragged_lines=True, skip_rows=1, has_header=False).collect()
    print(lf)

shape: (3, 3)
┌─────┬─────┬─────┐
│ ABC   │
│ --------- │
│ i32i32i32 │
╞═════╪═════╪═════╡
│ 123   │
│ 456   │
│ 91011  │
└─────┴─────┴─────┘

Log output

shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
│ 9   ┆ 10  ┆ 11  │
└─────┴─────┴─────┘
shape: (3, 4)
┌─────┬─────┬─────┬──────────┐
│ A   ┆ B   ┆ C   ┆ column_4 │
│ --- ┆ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ i64 ┆ str      │
╞═════╪═════╪═════╪══════════╡
│ 1   ┆ 2   ┆ 3   ┆ null     │
│ 4   ┆ 5   ┆ 6   ┆ null     │
│ 9   ┆ 10  ┆ 11  ┆ null     │
└─────┴─────┴─────┴──────────┘
shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
│ 9   ┆ 10  ┆ 11  │
└─────┴─────┴─────┘
read files in parallel
read files in parallel

Issue description

I noticed that there is an inconsistency between using read_csv and scan_csv. When reding the file providing both schema and and new_columns parameters, scan_csv and adds an additional column. The same thing seems to work with read_csv just fine.

If new_columns is omitted, the output looks as expected.

Expected behavior

The output should look the same?!

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Windows-10-10.0.22631-SP0
Python:              3.11.10 (main, Oct  7 2024, 23:30:23) [MSC v.1929 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
'az' is not recognized as an internal or external command,
operable program or batch file.
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.1
openpyxl             <not installed>
pandas               2.2.3
pyarrow              19.0.0
pydantic             2.10.5
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@schmidttill schmidttill added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 5, 2025
@mcrumiller
Copy link
Contributor

mcrumiller commented Feb 5, 2025

scan_csv has it correct and read_csv is wrong here. Your CSV is malformed and you have a trailing comma at each line, which means that your input data actually has a 4th unnamed column.

See with trailing comma removal:

from io import StringIO
import polars as pl

csv_with_trailing_comma = StringIO(
    "A,B,C,\n"
    "1,2,3,\n"
)

print(pl.scan_csv(csv_without_trailing_comma, new_columns=["A", "B", "C"]).collect())
# shape: (1, 4)
# ┌─────┬─────┬─────┬──────┐
# │ A   ┆ B   ┆ C   ┆      │
# │ --- ┆ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ i64 ┆ str  │
# ╞═════╪═════╪═════╪══════╡
# │ 1   ┆ 2   ┆ 3   ┆ null │
# └─────┴─────┴─────┴──────┘

csv_without_trailing_comma = StringIO(
    "A,B,C\n"
    "1,2,3\n"
)

print(pl.scan_csv(csv_without_trailing_comma, new_columns=["A", "B", "C"]).collect())
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ A   ┆ B   ┆ C   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 2   ┆ 3   │
# └─────┴─────┴─────┘

It is odd, though, that the 4th column only shows up when new_columns is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants