Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the Pandas expression tree for query preflighting. #175

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

gitosaurus
Copy link
Contributor

Requires extract_nest_names to be a method on NestedFrame so that the evaluation context is available at parsing time, since the Pandas Expr parsing does some eager evaluation.

Resolves #174 .

Change Description

  • My PR includes a link to the issue that I am addressing

Solution Description

The preflighting for NestedFrame.query() was examining the Python AST parse tree to locate the names of nests whose fields were being used in expressions. This was fine so long as the column names were identifier-like, but failed for columns with spaces. Pandas provides a way of dealing with such names, by enclosing them in backticks. This causes Pandas to transform them into a temporary identifier that is put into a resolver and used for the duration; this process is called cleaning. .query() needed to use this evaluator.

However, the Pandas evaluator does some eager evaluation, so that its parsing cannot occur outside of its DataFrame (and NestedFrame) context, so the proper nested field resolvers need to be preloaded for existing columns, as well as created on the fly for dynamic columns.

Code Quality

  • I have read the Contribution Guide
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation

Project-Specific Pull Request Checklists

Bug Fix Checklist

  • My fix includes a new test that breaks as a result of the bug (if possible)
  • My change includes a breaking change
    • My change includes backwards compatibility and deprecation warnings (if possible)

Requires `extract_nest_names` to be a method on `NestedFrame` so that
the evaluation context is available at parsing time, since the
Pandas Expr parsing does some eager evaluation.

Resolves #174 .
Copy link

codecov bot commented Nov 14, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.40%. Comparing base (402ab66) to head (3a5852f).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #175   +/-   ##
=======================================
  Coverage   99.40%   99.40%           
=======================================
  Files          14       13    -1     
  Lines        1004     1016   +12     
=======================================
+ Hits          998     1010   +12     
  Misses          6        6           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

Before [402ab66] <v0.3.1> After [bd9afc5] Ratio Benchmark (Parameter)
38.6±9ms 48.5±10ms ~1.26 benchmarks.ReassignHalfOfNestedSeries.time_run
7.91±0.1ms 15.4±0.2ms 1.95 benchmarks.NestedFrameQuery.time_run
22.1±0.1ms 23.3±0.4ms 1.05 benchmarks.AssignSingleDfToNestedSeries.time_run
9.73±0.2ms 9.80±0.2ms 1.01 benchmarks.NestedFrameAddNested.time_run
266M 266M 1.00 benchmarks.AssignSingleDfToNestedSeries.peakmem_run
88.1M 88M 1.00 benchmarks.NestedFrameAddNested.peakmem_run
93.2M 93.2M 1.00 benchmarks.NestedFrameQuery.peakmem_run
91.7M 91.6M 1.00 benchmarks.NestedFrameReduce.peakmem_run
1.59±0.02ms 1.59±0.02ms 1.00 benchmarks.NestedFrameReduce.time_run
285M 285M 1.00 benchmarks.ReassignHalfOfNestedSeries.peakmem_run

Click here to view all benchmarks.

Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Probably it is out of scope of this PR, feel free to separate it into a new issue. We also need to support arbitrary names for sub-columns, for example this code doesn't work:

nf = NestedFrame(data={"dog": [1, 2, 3], "good dog": [2, 4, 6]}, index=[0, 1, 2])
nested = pd.DataFrame(
    data={"n/a": [0, 2, 4, 1, 4, 3, 1, 4, 1], "n/b": [5, 4, 7, 5, 3, 1, 9, 3, 4]},
    index=[0, 0, 0, 1, 1, 1, 2, 2, 2],
)
nf = nf.add_nested(nested, "bad dog")
nf.query("`bad dog`.`n/a` > 2")

It is nice that nested column name could be arbitrary now, but if we need to choose between supporting arbitrary nested column names and sub-column names, I definitely would prefer the second option. The reason is that the nested column name is under our control, while sub-columns are coming from user's data.

@gitosaurus gitosaurus merged commit a86a532 into main Nov 14, 2024
10 checks passed
@gitosaurus gitosaurus deleted the 174-fix-nest-name-extraction branch November 14, 2024 19:39
Copy link
Contributor

@wilsonbb wilsonbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix NestedFrame.query/eval to handle column names including a space
3 participants