Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: cache .schema and .columns for lazy-only backends #2085

Open
MarcoGorelli opened this issue Feb 24, 2025 · 1 comment
Open

perf: cache .schema and .columns for lazy-only backends #2085

MarcoGorelli opened this issue Feb 24, 2025 · 1 comment
Labels
high priority Your PR will be reviewed very quickly if you address this performance

Comments

@MarcoGorelli
Copy link
Member

See #2064 for context

For Dask / PySpark / DuckDB / polars.LazyFrame, we should probably cache schema and column names

Some guidelines:

First, do not do this for eager backends, because those may allow in-place operations which modify the data type. Example:

In [49]: df_pd = pd.DataFrame({'a':[1,1,2], 'b': [4,5,6]})

In [50]: df = nw.from_native(df)

In [51]: df.schema
Out[51]: Schema([('a', Int64), ('b', Int64)])

In [52]: df
Out[52]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|        a  b      |
|     0  1  4      |
|     1  1  5      |
|     2  2  6      |
└──────────────────┘

In [53]: df.schema
Out[53]: Schema([('a', Int64), ('b', Int64)])

In [54]: df_native = df.to_native()

In [55]: df_native.loc['a', 0] = 1.5

In [56]: df
Out[56]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|      a    b    0 |
| 0  1.0  4.0  NaN |
| 1  1.0  5.0  NaN |
| 2  2.0  6.0  NaN |
| a  NaN  NaN  1.5 |
└──────────────────┘

In [57]: df.schema
Out[57]: Schema([('a', Float64), ('b', Float64), (0, Float64)])

Is this poor design on pandas' part? Arguably. But, it's just what we've got to deal with.

Second, careful about using lru_cache on properties: https://youtu.be/sVjtp6tGo0g

@MarcoGorelli MarcoGorelli added performance high priority Your PR will be reviewed very quickly if you address this labels Feb 24, 2025
@dangotbanned
Copy link
Member

dangotbanned commented Feb 24, 2025

It is a bit hidden on here (https://arrow.apache.org/docs/python/data.html#arrays)

Arrow data is immutable, so values can be selected but not assigned.


Second, careful about using lru_cache on properties: youtu.be/sVjtp6tGo0g

@MarcoGorelli functools.cached_property would be preferred for this case I assume?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Your PR will be reviewed very quickly if you address this performance
Projects
None yet
Development

No branches or pull requests

2 participants