Reading CSV lazy vs eager behaves inconsistently #21094

schmidttill · 2025-02-05T11:44:59Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Difference between `read_csv` and `scan_csv`

import tempfile

import polars as pl

with tempfile.NamedTemporaryFile() as f:
    f.write(b"""
A,B,C,
1,2,3,
4,5,6,
9,10,11,
""".strip())
    f.seek(0)

    df = pl.read_csv(f.name, schema=dict.fromkeys("ABC", pl.Int32), truncate_ragged_lines=True, skip_rows=1, has_header=False, new_columns=["A", "B", "C"])
    print(df)

shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
│ 9   ┆ 10  ┆ 11  │
└─────┴─────┴─────┘

    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABC", pl.Int32), truncate_ragged_lines=True, skip_rows=1, has_header=False, new_columns=["A", "B", "C"]).collect()
    print(lf)

shape: (3, 4)
┌─────┬─────┬─────┬──────────┐
│ A   ┆ B   ┆ C   ┆ column_4 │
│ --- ┆ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ i64 ┆ str      │
╞═════╪═════╪═════╪══════════╡
│ 1   ┆ 2   ┆ 3   ┆ null     │
│ 4   ┆ 5   ┆ 6   ┆ null     │
│ 9   ┆ 10  ┆ 11  ┆ null     │
└─────┴─────┴─────┴──────────┘

Without `new_columns`

    lf = pl.scan_csv(f.name, schema=dict.fromkeys("ABC", pl.Int32), truncate_ragged_lines=True, skip_rows=1, has_header=False).collect()
    print(lf)

shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
│ 9   ┆ 10  ┆ 11  │
└─────┴─────┴─────┘

Log output

shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
│ 9   ┆ 10  ┆ 11  │
└─────┴─────┴─────┘
shape: (3, 4)
┌─────┬─────┬─────┬──────────┐
│ A   ┆ B   ┆ C   ┆ column_4 │
│ --- ┆ --- ┆ --- ┆ ---      │
│ i64 ┆ i64 ┆ i64 ┆ str      │
╞═════╪═════╪═════╪══════════╡
│ 1   ┆ 2   ┆ 3   ┆ null     │
│ 4   ┆ 5   ┆ 6   ┆ null     │
│ 9   ┆ 10  ┆ 11  ┆ null     │
└─────┴─────┴─────┴──────────┘
shape: (3, 3)
┌─────┬─────┬─────┐
│ A   ┆ B   ┆ C   │
│ --- ┆ --- ┆ --- │
│ i32 ┆ i32 ┆ i32 │
╞═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 3   │
│ 4   ┆ 5   ┆ 6   │
│ 9   ┆ 10  ┆ 11  │
└─────┴─────┴─────┘
read files in parallel
read files in parallel

Issue description

I noticed that there is an inconsistency between using read_csv and scan_csv. When reding the file providing both schema and and new_columns parameters, scan_csv and adds an additional column. The same thing seems to work with read_csv just fine.

If new_columns is omitted, the output looks as expected.

Expected behavior

The output should look the same?!

Installed versions

--------Version info---------
Polars:              1.21.0
Index type:          UInt32
Platform:            Windows-10-10.0.22631-SP0
Python:              3.11.10 (main, Oct  7 2024, 23:30:23) [MSC v.1929 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            <not installed>
adbc_driver_manager  <not installed>
'az' is not recognized as an internal or external command,
operable program or batch file.
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            <not installed>
fsspec               <not installed>
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
numpy                2.2.1
openpyxl             <not installed>
pandas               2.2.3
pyarrow              19.0.0
pydantic             2.10.5
pyiceberg            <not installed>
sqlalchemy           <not installed>
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>

The text was updated successfully, but these errors were encountered:

mcrumiller · 2025-02-05T12:40:47Z

scan_csv has it correct and read_csv is wrong here. Your CSV is malformed and you have a trailing comma at each line, which means that your input data actually has a 4th unnamed column.

See with trailing comma removal:

from io import StringIO
import polars as pl

csv_with_trailing_comma = StringIO(
    "A,B,C,\n"
    "1,2,3,\n"
)

print(pl.scan_csv(csv_without_trailing_comma, new_columns=["A", "B", "C"]).collect())
# shape: (1, 4)
# ┌─────┬─────┬─────┬──────┐
# │ A   ┆ B   ┆ C   ┆      │
# │ --- ┆ --- ┆ --- ┆ ---  │
# │ i64 ┆ i64 ┆ i64 ┆ str  │
# ╞═════╪═════╪═════╪══════╡
# │ 1   ┆ 2   ┆ 3   ┆ null │
# └─────┴─────┴─────┴──────┘

csv_without_trailing_comma = StringIO(
    "A,B,C\n"
    "1,2,3\n"
)

print(pl.scan_csv(csv_without_trailing_comma, new_columns=["A", "B", "C"]).collect())
# shape: (1, 3)
# ┌─────┬─────┬─────┐
# │ A   ┆ B   ┆ C   │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 1   ┆ 2   ┆ 3   │
# └─────┴─────┴─────┘

It is odd, though, that the 4th column only shows up when new_columns is used.

schmidttill added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading CSV lazy vs eager behaves inconsistently #21094

Reading CSV lazy vs eager behaves inconsistently #21094

schmidttill commented Feb 5, 2025

mcrumiller commented Feb 5, 2025 •

edited

Loading

Reading CSV lazy vs eager behaves inconsistently #21094

Reading CSV lazy vs eager behaves inconsistently #21094

Comments

schmidttill commented Feb 5, 2025

Checks

Reproducible example

Difference between read_csv and scan_csv

Without new_columns

Log output

Issue description

Expected behavior

Installed versions

mcrumiller commented Feb 5, 2025 • edited Loading

Difference between `read_csv` and `scan_csv`

Without `new_columns`

mcrumiller commented Feb 5, 2025 •

edited

Loading