-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
predicate/projection pushdown causes ShapeError #19944
Comments
Attempted minimal repro. import polars as pl
filename = '19944.parquet'
n = 1_500_000
values = list('abcdefghijlmno')
dtype = pl.Enum(values)
s = pl.Series(values, dtype=dtype)
pl.select(
A = pl.lit('a'),
B = s.sample(n, with_replacement=True),
C = s.sample(n, with_replacement=True)
).write_parquet(filename)
r1 = pl.LazyFrame({'A': 'a', 'B': 'b'}).cast({'B': dtype})
r2 = pl.scan_parquet(filename).cast({'B': dtype, 'C': dtype})
l = pl.LazyFrame({'B': 'b'}).cast({'B': dtype})
r = pl.concat([r1, r2], how='diagonal_relaxed')
(l.join(r, on='B')
.filter(A = pl.col('A'))
.group_by('A')
.first()
.collect()
)
# ShapeError: unable to vstack, column names don't match: "B" and "A" Update: It seems this started to error in 1.13.0 |
I cannot reproduce this one. @mdavis-xyz can you make a minimal repro? Try to get out boto3 and aws for instance and add the minimal amount of columns operations needed to reproduce this. |
The MWE from @cmdlineluser does not produce the error for me. But here is one which does:
It seems the issue arises when doing a diagonal concat and one of the lazyframes is missing a column which is of type enum. If I directly concatenate LazyFrames without going via parquet, I do not get the error. If I use |
Hmm, woops. That might actually be a different error to what I originally reported.
vs
|
FWIW, I was getting both errors while trying to get a MWE. It seemed to go from No Error -> SchemaError -> ShapeError depending on the number of rows. |
I'm encountering the same issue, starting with Polars 1.13.0. General system info and some other package versions (not sure if they're relevant):
I've come up with a somewhat minimal reproduction case which shows the error happening. Generally, data frames are constructed in-memory and written to Parquet files. Those are then read back using At a high-level, I've observed:
In summary:
Here's the reproduction case - it's a bit verbose because it tests for various combinations of the issue using binary search: Reproduction case
import polars as pl
from polars.exceptions import ShapeError
pl.show_versions()
def _experiment(n, predicate_pushdown, lazy) -> bool:
col1_type = pl.Int8
col2_type = pl.Int8
data = {
'col1': [0] * n,
'col2': [0] * n
}
if lazy:
cls = pl.LazyFrame
else:
cls = pl.DataFrame
df1 = cls(data, schema={
'col1': col1_type,
'col2': col2_type,
})
df2 = cls(data, schema={
'col2': col2_type,
'col1': col1_type,
})
if lazy:
df1.sink_parquet('df1.parquet')
df2.sink_parquet('df2.parquet')
else:
df1.write_parquet('df1.parquet')
df2.write_parquet('df2.parquet')
df1 = pl.scan_parquet('df1.parquet')
df2 = pl.scan_parquet('df2.parquet')
df = pl.concat([df1, df2], how='diagonal_relaxed')
try:
df.filter(pl.col('col1') >= 0).collect(predicate_pushdown=predicate_pushdown)
except ShapeError as e:
print(df1.collect().shape)
print(e)
return False
return True
def experiment(predicate_pushdown, lazy) -> bool:
print(f'*** Trying to reproduce issue with predicate_pushdown={predicate_pushdown} lazy={lazy}')
failed = False
high = 5000000
low = 1
while low < high:
mid = (low + high) // 2
if _experiment(mid, predicate_pushdown, lazy):
low = mid + 1
else:
failed = True
high = mid
if failed:
print(f'*** Stacking starts to fail with dataframes of length {low} and predicate_pushdown={predicate_pushdown} lazy={lazy}\n\n')
else:
print(f'*** Everything worked as expected with predicate_pushdown={predicate_pushdown} lazy={lazy}\n\n')
experiment(predicate_pushdown=True, lazy=True)
experiment(predicate_pushdown=False, lazy=True)
experiment(predicate_pushdown=True, lazy=False)
experiment(predicate_pushdown=False, lazy=False) The failing results, as observed on my machine with Polars 1.16.0
The OK results, as observed on my machine with Polars 1.12.0:
|
Sorry @mdavis-xyz @ritchie46 I forgot to add that release v1.13.0 is the first Polars release where I'm seeing this issue 👍 Let me know if you need any more details or if you have any questions. |
I've just run my full code (more complex than the MWE) using polars v1.17.0, and now it works. Thanks! |
Same here, thanks a lot @coastalwhite @ritchie46 @nameexhaustion @cmdlineluser, that was a very quick fix ❤️ Can confirm the issue is solved in v1.17.0 🥇 ! |
Checks
Reproducible example
Note that this script will download 170MB of sample files from AWS S3. But you shouldn't need any AWS credentials for this to work. (I made the bucket public.)
Log output
Issue description
I get error:
For code which has no
vstack
command. I do have aconcat
but withdiagonal_relaxed
.Note that if I query from AWS S3 directly, I don't get the error. If I turn off either predicate or projection pushdown (or both), I don't get the error. Only when I turn on both, and read the file locally, do I get the error.
I suspect the error may be related to enums/categoricals.
It may be related to #13381
Expected behavior
The script should run without throwing an exception.
Querying files locally should give the same result as querying from S3.
Installed versions
The text was updated successfully, but these errors were encountered: