-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the Pandas expression tree for query preflighting. #175
Conversation
Requires `extract_nest_names` to be a method on `NestedFrame` so that the evaluation context is available at parsing time, since the Pandas Expr parsing does some eager evaluation. Resolves #174 .
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #175 +/- ##
=======================================
Coverage 99.40% 99.40%
=======================================
Files 14 13 -1
Lines 1004 1016 +12
=======================================
+ Hits 998 1010 +12
Misses 6 6 ☔ View full report in Codecov by Sentry. |
Click here to view all benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Probably it is out of scope of this PR, feel free to separate it into a new issue. We also need to support arbitrary names for sub-columns, for example this code doesn't work:
nf = NestedFrame(data={"dog": [1, 2, 3], "good dog": [2, 4, 6]}, index=[0, 1, 2])
nested = pd.DataFrame(
data={"n/a": [0, 2, 4, 1, 4, 3, 1, 4, 1], "n/b": [5, 4, 7, 5, 3, 1, 9, 3, 4]},
index=[0, 0, 0, 1, 1, 1, 2, 2, 2],
)
nf = nf.add_nested(nested, "bad dog")
nf.query("`bad dog`.`n/a` > 2")
It is nice that nested column name could be arbitrary now, but if we need to choose between supporting arbitrary nested column names and sub-column names, I definitely would prefer the second option. The reason is that the nested column name is under our control, while sub-columns are coming from user's data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Requires
extract_nest_names
to be a method onNestedFrame
so that the evaluation context is available at parsing time, since the PandasExpr
parsing does some eager evaluation.Resolves #174 .
Change Description
Solution Description
The preflighting for
NestedFrame.query()
was examining the Python AST parse tree to locate the names of nests whose fields were being used in expressions. This was fine so long as the column names were identifier-like, but failed for columns with spaces. Pandas provides a way of dealing with such names, by enclosing them in backticks. This causes Pandas to transform them into a temporary identifier that is put into a resolver and used for the duration; this process is called cleaning..query()
needed to use this evaluator.However, the Pandas evaluator does some eager evaluation, so that its parsing cannot occur outside of its
DataFrame
(andNestedFrame
) context, so the proper nested field resolvers need to be preloaded for existing columns, as well as created on the fly for dynamic columns.Code Quality
Project-Specific Pull Request Checklists
Bug Fix Checklist