Use the Pandas expression tree for query preflighting. #175

gitosaurus · 2024-11-14T00:21:42Z

Requires extract_nest_names to be a method on NestedFrame so that the evaluation context is available at parsing time, since the Pandas Expr parsing does some eager evaluation.

Resolves #174 .

Change Description

My PR includes a link to the issue that I am addressing

Solution Description

The preflighting for NestedFrame.query() was examining the Python AST parse tree to locate the names of nests whose fields were being used in expressions. This was fine so long as the column names were identifier-like, but failed for columns with spaces. Pandas provides a way of dealing with such names, by enclosing them in backticks. This causes Pandas to transform them into a temporary identifier that is put into a resolver and used for the duration; this process is called cleaning. .query() needed to use this evaluator.

However, the Pandas evaluator does some eager evaluation, so that its parsing cannot occur outside of its DataFrame (and NestedFrame) context, so the proper nested field resolvers need to be preloaded for existing columns, as well as created on the fly for dynamic columns.

Code Quality

I have read the Contribution Guide
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

My fix includes a new test that breaks as a result of the bug (if possible)
My change includes a breaking change
- My change includes backwards compatibility and deprecation warnings (if possible)

Requires `extract_nest_names` to be a method on `NestedFrame` so that the evaluation context is available at parsing time, since the Pandas Expr parsing does some eager evaluation. Resolves #174 .

codecov · 2024-11-14T00:23:04Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.40%. Comparing base (402ab66) to head (3a5852f).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #175   +/-   ##
=======================================
  Coverage   99.40%   99.40%           
=======================================
  Files          14       13    -1     
  Lines        1004     1016   +12     
=======================================
+ Hits          998     1010   +12     
  Misses          6        6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

github-actions · 2024-11-14T00:24:01Z

Before [`402ab66`] <v0.3.1>	After [`bd9afc5`]	Ratio	Benchmark (Parameter)
38.6±9ms	48.5±10ms	~1.26	benchmarks.ReassignHalfOfNestedSeries.time_run
7.91±0.1ms	15.4±0.2ms	1.95	benchmarks.NestedFrameQuery.time_run
22.1±0.1ms	23.3±0.4ms	1.05	benchmarks.AssignSingleDfToNestedSeries.time_run
9.73±0.2ms	9.80±0.2ms	1.01	benchmarks.NestedFrameAddNested.time_run
266M	266M	1.00	benchmarks.AssignSingleDfToNestedSeries.peakmem_run
88.1M	88M	1.00	benchmarks.NestedFrameAddNested.peakmem_run
93.2M	93.2M	1.00	benchmarks.NestedFrameQuery.peakmem_run
91.7M	91.6M	1.00	benchmarks.NestedFrameReduce.peakmem_run
1.59±0.02ms	1.59±0.02ms	1.00	benchmarks.NestedFrameReduce.time_run
285M	285M	1.00	benchmarks.ReassignHalfOfNestedSeries.peakmem_run

Click here to view all benchmarks.

hombit

Great! Probably it is out of scope of this PR, feel free to separate it into a new issue. We also need to support arbitrary names for sub-columns, for example this code doesn't work:

nf = NestedFrame(data={"dog": [1, 2, 3], "good dog": [2, 4, 6]}, index=[0, 1, 2])
nested = pd.DataFrame(
    data={"n/a": [0, 2, 4, 1, 4, 3, 1, 4, 1], "n/b": [5, 4, 7, 5, 3, 1, 9, 3, 4]},
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2],
)
nf = nf.add_nested(nested, "bad dog")
nf.query("`bad dog`.`n/a` > 2")

It is nice that nested column name could be arbitrary now, but if we need to choose between supporting arbitrary nested column names and sub-column names, I definitely would prefer the second option. The reason is that the nested column name is under our control, while sub-columns are coming from user's data.

wilsonbb

LGTM!

Use the Pandas expr tree for preflighting.

3a5852f

Requires `extract_nest_names` to be a method on `NestedFrame` so that the evaluation context is available at parsing time, since the Pandas Expr parsing does some eager evaluation. Resolves #174 .

gitosaurus linked an issue Nov 14, 2024 that may be closed by this pull request

Fix NestedFrame.query/eval to handle column names including a space #174

Closed

gitosaurus requested review from hombit and wilsonbb November 14, 2024 00:21

hombit approved these changes Nov 14, 2024

View reviewed changes

gitosaurus merged commit a86a532 into main Nov 14, 2024
10 checks passed

gitosaurus deleted the 174-fix-nest-name-extraction branch November 14, 2024 19:39

wilsonbb reviewed Nov 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the Pandas expression tree for query preflighting. #175

Use the Pandas expression tree for query preflighting. #175

gitosaurus commented Nov 14, 2024

codecov bot commented Nov 14, 2024 •

edited

Loading

github-actions bot commented Nov 14, 2024

hombit left a comment •

edited

Loading

wilsonbb left a comment

Use the Pandas expression tree for query preflighting. #175

Use the Pandas expression tree for query preflighting. #175

Conversation

gitosaurus commented Nov 14, 2024

Change Description

Solution Description

Code Quality

Project-Specific Pull Request Checklists

Bug Fix Checklist

codecov bot commented Nov 14, 2024 • edited Loading

Codecov Report

github-actions bot commented Nov 14, 2024

hombit left a comment • edited Loading

Choose a reason for hiding this comment

wilsonbb left a comment

Choose a reason for hiding this comment

codecov bot commented Nov 14, 2024 •

edited

Loading

hombit left a comment •

edited

Loading