Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support schema arg in read/scan_parquet() #19013

Merged
merged 1 commit into from
Sep 30, 2024

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Sep 30, 2024

In combination with allow_missing_columns, this will support reading from datasets with differing column names across files -

import polars as pl

path = "hf://datasets/nameexhaustion/polars-docs/iris.parquet"
print(pl.read_parquet(path))
# shape: (150, 5)
# ┌──────────────┬─────────────┬──────────────┬─────────────┬───────────┐
# │ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species   │
# │ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---       │
# │ f64          ┆ f64         ┆ f64          ┆ f64         ┆ str       │
# ╞══════════════╪═════════════╪══════════════╪═════════════╪═══════════╡
# │ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa    │
# │ 4.9          ┆ 3.0         ┆ 1.4          ┆ 0.2         ┆ setosa    │
# │ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa    │
# │ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa    │
# │ 5.0          ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa    │
# │ …            ┆ …           ┆ …            ┆ …           ┆ …         │
# │ 6.7          ┆ 3.0         ┆ 5.2          ┆ 2.3         ┆ virginica │
# │ 6.3          ┆ 2.5         ┆ 5.0          ┆ 1.9         ┆ virginica │
# │ 6.5          ┆ 3.0         ┆ 5.2          ┆ 2.0         ┆ virginica │
# │ 6.2          ┆ 3.4         ┆ 5.4          ┆ 2.3         ┆ virginica │
# │ 5.9          ┆ 3.0         ┆ 5.1          ┆ 1.8         ┆ virginica │
# └──────────────┴─────────────┴──────────────┴─────────────┴───────────┘
print(
    pl.read_parquet(
        path,
        schema={
            "sepal_length": pl.Float64,
            "extra1": pl.Null,
            "extra2": pl.UInt8,
        },
        allow_missing_columns=True,
    )
)
# shape: (150, 3)
# ┌──────────────┬────────┬────────┐
# │ sepal_length ┆ extra1 ┆ extra2 │
# │ ---          ┆ ---    ┆ ---    │
# │ f64          ┆ null   ┆ u8     │
# ╞══════════════╪════════╪════════╡
# │ 5.1          ┆ null   ┆ null   │
# │ 4.9          ┆ null   ┆ null   │
# │ 4.7          ┆ null   ┆ null   │
# │ 4.6          ┆ null   ┆ null   │
# │ 5.0          ┆ null   ┆ null   │
# │ …            ┆ …      ┆ …      │
# │ 6.7          ┆ null   ┆ null   │
# │ 6.3          ┆ null   ┆ null   │
# │ 6.5          ┆ null   ┆ null   │
# │ 6.2          ┆ null   ┆ null   │
# │ 5.9          ┆ null   ┆ null   │
# └──────────────┴────────┴────────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels Sep 30, 2024
Copy link

codecov bot commented Sep 30, 2024

Codecov Report

Attention: Patch coverage is 50.66667% with 37 lines in your changes missing coverage. Please review.

Project coverage is 79.87%. Comparing base (c23266b) to head (2a0c345).
Report is 45 commits behind head on main.

Files with missing lines Patch % Lines
...ates/polars-stream/src/nodes/parquet_source/mod.rs 0.00% 14 Missing ⚠️
crates/polars-io/src/parquet/read/utils.rs 57.69% 11 Missing ⚠️
...-stream/src/nodes/parquet_source/metadata_fetch.rs 0.00% 5 Missing ⚠️
py-polars/polars/io/parquet/functions.py 0.00% 2 Missing and 1 partial ⚠️
...tes/polars-stream/src/nodes/parquet_source/init.rs 0.00% 2 Missing ⚠️
...-stream/src/nodes/parquet_source/metadata_utils.rs 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #19013      +/-   ##
==========================================
+ Coverage   79.84%   79.87%   +0.02%     
==========================================
  Files        1524     1524              
  Lines      207653   207693      +40     
  Branches     2905     2906       +1     
==========================================
+ Hits       165802   165885      +83     
+ Misses      41303    41259      -44     
- Partials      548      549       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion marked this pull request as ready for review September 30, 2024 07:16
@ritchie46 ritchie46 merged commit 1fe5d03 into pola-rs:main Sep 30, 2024
29 checks passed
@coastalwhite
Copy link
Collaborator

I am not 100% sure this should be the name. I cannot think of a much better of the top of my head atm, but with #17418 this might lead to confusion at some point. Maybe, something like pl_schema or something?

@c-peters c-peters added the accepted Ready for implementation label Oct 6, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Oct 17, 2024

I am not 100% sure this should be the name. I cannot think of a much better of the top of my head atm, but with #17418 this might lead to confusion at some point. Maybe, something like pl_schema or something?

@coastalwhite: schema is consistent with the other read/scan funcs; perhaps on export/write we could use something like target_schema, to distinguish the two uses 🤔

@nameexhaustion nameexhaustion deleted the scan-parquet-schema-arg branch October 28, 2024 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants