perf: cache `.schema` and `.columns` for lazy-only backends #2085

MarcoGorelli · 2025-02-24T15:00:17Z

See #2064 for context

For Dask / PySpark / DuckDB / polars.LazyFrame, we should probably cache schema and column names

Some guidelines:

First, do not do this for eager backends, because those may allow in-place operations which modify the data type. Example:

In [49]: df_pd = pd.DataFrame({'a':[1,1,2], 'b': [4,5,6]})

In [50]: df = nw.from_native(df)

In [51]: df.schema
Out[51]: Schema([('a', Int64), ('b', Int64)])

In [52]: df
Out[52]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|        a  b      |
|     0  1  4      |
|     1  1  5      |
|     2  2  6      |
└──────────────────┘

In [53]: df.schema
Out[53]: Schema([('a', Int64), ('b', Int64)])

In [54]: df_native = df.to_native()

In [55]: df_native.loc['a', 0] = 1.5

In [56]: df
Out[56]:
┌──────────────────┐
|Narwhals DataFrame|
|------------------|
|      a    b    0 |
| 0  1.0  4.0  NaN |
| 1  1.0  5.0  NaN |
| 2  2.0  6.0  NaN |
| a  NaN  NaN  1.5 |
└──────────────────┘

In [57]: df.schema
Out[57]: Schema([('a', Float64), ('b', Float64), (0, Float64)])

Is this poor design on pandas' part? Arguably. But, it's just what we've got to deal with.

Second, careful about using lru_cache on properties: https://youtu.be/sVjtp6tGo0g

The text was updated successfully, but these errors were encountered:

dangotbanned · 2025-02-24T15:19:19Z

It is a bit hidden on here (https://arrow.apache.org/docs/python/data.html#arrays)

Arrow data is immutable, so values can be selected but not assigned.

Second, careful about using lru_cache on properties: youtu.be/sVjtp6tGo0g

@MarcoGorelli functools.cached_property would be preferred for this case I assume?

MarcoGorelli added performance high priority Your PR will be reviewed very quickly if you address this labels Feb 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: cache `.schema` and `.columns` for lazy-only backends #2085

perf: cache `.schema` and `.columns` for lazy-only backends #2085

MarcoGorelli commented Feb 24, 2025

dangotbanned commented Feb 24, 2025 •

edited

Loading

perf: cache .schema and .columns for lazy-only backends #2085

perf: cache .schema and .columns for lazy-only backends #2085

Comments

MarcoGorelli commented Feb 24, 2025

dangotbanned commented Feb 24, 2025 • edited Loading

perf: cache `.schema` and `.columns` for lazy-only backends #2085

perf: cache `.schema` and `.columns` for lazy-only backends #2085

dangotbanned commented Feb 24, 2025 •

edited

Loading